[intrepid-notify] Intrepid status

Andrew Cherry acherry at alcf.anl.gov
Fri Jan 23 18:04:26 CST 2009


Update:

Due to instabilities with the system that were resolved by reverting  
to V1R2M0, we have aborted the driver upgrade and are now on the  
previous driver.  We will be planning test windows with IBM to more  
thoroughly test the new driver before switching over to it in the  
future.  If you have recompiled your code, you will need to revert  
back to your pre-upgrade executables as described below, or recompile  
your application again.

It looks as if most of the jobs in the queue on Intrepid have  
executables that are older than one week, so we are releasing the  
queues for use.  As a precaution, I have put user holds on the  
following jobs that have executables with modification times newer  
than Monday:

94288:gjordan:/gpfs1/gjordan/2009/mass_study/8km_16o20/rundir_0004/ 
flash3
94294:gjordan:/gpfs1/gjordan/2009/extreme_preexpansion/ 
8km_multi_63_128o148_m1385/rundir_0010/flash3
94324:ckerr:/gpfs1/ckerr/c720-hiram.dev/scripts/runscript
94328:rajeshn:/gpfs/home/rajeshn/temp/install/bin/./hello
94335:chulwoo:/gpfs/home/chulwoo/CPS/v5_0_5-tw/reweight/BGP.x.v1
94337:lynnreid:/home/lynnreid/datadir/extreme_preexpansion/ 
8km_63_128o148_1365_1/rundir_0002/flash3
94342:knomura:/gpfs/home/knomura/stencil/afd3_mpi_thread/afd3_mpi
94343:spanu:/gpfs/home/spanu/TurboRVB/bin/turborvb-mpi.x
94382:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/ 
SCRIPTS_detar/batch_kappa_0_000545.sh
94383:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/ 
SCRIPTS_detar/batch_kappa_36_000545.sh
94384:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/ 
SCRIPTS_detar/batch_kappa_72_000545.sh
94385:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/ 
SCRIPTS_detar/batch_kappa_108_000545.sh
94386:chan:/gpfs/home/chan/ts_abort/halfsleep_mpiabort
94387:kweide:/gpfs1/kweide/2009/extreme_preexpansion/ 
8km_79_138o148_1365_1/rundir_0001/flash3
94389:chulwoo:/gpfs/home/chulwoo/CPS/v5_0_5-tw/reweight/BGP.x.v2

If you have a job on the list above, I recommend you restore your  
original executables before releasing the hold on your job.

We apologize for any inconvenience, and again, thank you for your  
patience.  Please email support at alcf.anl.gov if you have further  
questions or concerns.

Andrew Cherry
ALCF Support

On Jan 23, 2009, at 3:39 PM, Andrew Cherry wrote:

> If you've recompiled any of your code for the new BlueGene V1R3M0  
> driver on Intrepid, or are in the process of doing so, please read  
> this note.
>
> As mentioned in the communication below, there is a possibility that  
> we will back out of the new V1R3M0 driver installation on Intrepid  
> if we don't find a resolution to the problems we are seeing by the  
> close of business today.  One implication of this is that binaries  
> rebuilt for the new driver will most likely not work on the old  
> one.  If you are rebuilding your code, please make sure to keep a  
> copy of your old binaries in case we need to revert back.
>
> If you have already rebuilt your code and overwrote or deleted your  
> original binaries, it is still possible to recover your original  
> files from one of the GPFS snapshots.  If you cd into the .snapshots  
> directory in your home directory, you will find several directories  
> named after days of the week.  Each of these contains a snapshot of  
> your home directory from the last occurrence of that day; for  
> example, if you need to recover your files from last Saturday,  
> you'll find them in the "saturday" directory.    The snapshots are  
> only good for a week (i.e. the "saturday" directory is reused every  
> Saturday), so there's limited time to copy data out of them.   
> Unfortunately, we didn't get a snapshot on Sunday this past week, so  
> if you need something older than Monday's copy you'll need to get it  
> from the Saturday copy.  If you do need to get any files from last  
> Saturday's or Monday's snapshots, I suggest you do so today before  
> the snapshots are overwritten tomorrow and Monday.
>
> Sorry for any inconvenience.  We will let you know later today if we  
> end up reverting back to the old driver.
>
> Andrew Cherry
> ALCF Support
>
> On Jan 22, 2009, at 10:58 PM, tstacey at alcf.anl.gov wrote:
>
>> We continue to experience extremely unstable behavior when running  
>> jobs
>> at scale on the new V1R3M0 driver.  Debugging has been complicated by
>> diagnostics calling out intermittent hardware issues.  We currently  
>> have
>> a level 1 PMR (very high priority trouble ticket) open with IBM and  
>> they
>> are working on this problem.  If we have not been able to resolve the
>> problem by tomorrow afternoon, we will revert back to our previous
>> driver, so that jobs can run over the weekend.
>>
>> We apologize for any inconvenience this outage has caused and are
>> grateful for your continued patience as we work through this issue.
>> Please contact support at alcf.anl.gov if you have any further  
>> questions.
>>
>> Thanks, The ALCF Support Team
>> _______________________________________________
>> intrepid-notify mailing list
>> intrepid-notify at alcf.anl.gov
>> http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/intrepid-notify/attachments/20090123/f243a03f/attachment.htm>


More information about the intrepid-notify mailing list