[intrepid-notify] TIME SENSITIVE - Intrepid recompilation notes
Andrew Cherry
acherry at alcf.anl.gov
Fri Jan 23 15:39:43 CST 2009
If you've recompiled any of your code for the new BlueGene V1R3M0
driver on Intrepid, or are in the process of doing so, please read
this note.
As mentioned in the communication below, there is a possibility that
we will back out of the new V1R3M0 driver installation on Intrepid if
we don't find a resolution to the problems we are seeing by the close
of business today. One implication of this is that binaries rebuilt
for the new driver will most likely not work on the old one. If you
are rebuilding your code, please make sure to keep a copy of your old
binaries in case we need to revert back.
If you have already rebuilt your code and overwrote or deleted your
original binaries, it is still possible to recover your original files
from one of the GPFS snapshots. If you cd into the .snapshots
directory in your home directory, you will find several directories
named after days of the week. Each of these contains a snapshot of
your home directory from the last occurrence of that day; for example,
if you need to recover your files from last Saturday, you'll find them
in the "saturday" directory. The snapshots are only good for a week
(i.e. the "saturday" directory is reused every Saturday), so there's
limited time to copy data out of them. Unfortunately, we didn't get a
snapshot on Sunday this past week, so if you need something older than
Monday's copy you'll need to get it from the Saturday copy. If you do
need to get any files from last Saturday's or Monday's snapshots, I
suggest you do so today before the snapshots are overwritten tomorrow
and Monday.
Sorry for any inconvenience. We will let you know later today if we
end up reverting back to the old driver.
Andrew Cherry
ALCF Support
On Jan 22, 2009, at 10:58 PM, tstacey at alcf.anl.gov wrote:
> We continue to experience extremely unstable behavior when running
> jobs
> at scale on the new V1R3M0 driver. Debugging has been complicated by
> diagnostics calling out intermittent hardware issues. We currently
> have
> a level 1 PMR (very high priority trouble ticket) open with IBM and
> they
> are working on this problem. If we have not been able to resolve the
> problem by tomorrow afternoon, we will revert back to our previous
> driver, so that jobs can run over the weekend.
>
> We apologize for any inconvenience this outage has caused and are
> grateful for your continued patience as we work through this issue.
> Please contact support at alcf.anl.gov if you have any further questions.
>
> Thanks, The ALCF Support Team
> _______________________________________________
> intrepid-notify mailing list
> intrepid-notify at alcf.anl.gov
> http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/intrepid-notify/attachments/20090123/3749e59c/attachment.htm>
More information about the intrepid-notify
mailing list