[intrepid-notify] TIME SENSITIVE - Intrepid recompilation notes

Andrew Cherry acherry at alcf.anl.gov
Fri Jan 23 15:39:43 CST 2009


If you've recompiled any of your code for the new BlueGene V1R3M0  
driver on Intrepid, or are in the process of doing so, please read  
this note.

As mentioned in the communication below, there is a possibility that  
we will back out of the new V1R3M0 driver installation on Intrepid if  
we don't find a resolution to the problems we are seeing by the close  
of business today.  One implication of this is that binaries rebuilt  
for the new driver will most likely not work on the old one.  If you  
are rebuilding your code, please make sure to keep a copy of your old  
binaries in case we need to revert back.

If you have already rebuilt your code and overwrote or deleted your  
original binaries, it is still possible to recover your original files  
from one of the GPFS snapshots.  If you cd into the .snapshots  
directory in your home directory, you will find several directories  
named after days of the week.  Each of these contains a snapshot of  
your home directory from the last occurrence of that day; for example,  
if you need to recover your files from last Saturday, you'll find them  
in the "saturday" directory.    The snapshots are only good for a week  
(i.e. the "saturday" directory is reused every Saturday), so there's  
limited time to copy data out of them.  Unfortunately, we didn't get a  
snapshot on Sunday this past week, so if you need something older than  
Monday's copy you'll need to get it from the Saturday copy.  If you do  
need to get any files from last Saturday's or Monday's snapshots, I  
suggest you do so today before the snapshots are overwritten tomorrow  
and Monday.

Sorry for any inconvenience.  We will let you know later today if we  
end up reverting back to the old driver.

Andrew Cherry
ALCF Support

On Jan 22, 2009, at 10:58 PM, tstacey at alcf.anl.gov wrote:

> We continue to experience extremely unstable behavior when running  
> jobs
> at scale on the new V1R3M0 driver.  Debugging has been complicated by
> diagnostics calling out intermittent hardware issues.  We currently  
> have
> a level 1 PMR (very high priority trouble ticket) open with IBM and  
> they
> are working on this problem.  If we have not been able to resolve the
> problem by tomorrow afternoon, we will revert back to our previous
> driver, so that jobs can run over the weekend.
>
> We apologize for any inconvenience this outage has caused and are
> grateful for your continued patience as we work through this issue.
> Please contact support at alcf.anl.gov if you have any further questions.
>
> Thanks, The ALCF Support Team
> _______________________________________________
> intrepid-notify mailing list
> intrepid-notify at alcf.anl.gov
> http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/intrepid-notify/attachments/20090123/3749e59c/attachment.htm>


More information about the intrepid-notify mailing list