[intrepid-notify] Status update on Intrepid, Eureka, and Challenger

ALCF User Services support at alcf.anl.gov
Thu Apr 4 17:39:41 CDT 2013


Dear ALCF Users,

As stated in the notification Wednesday evening, the /intrepid-fs0 file 
system showed signs of corruption. ALCF staff believe this corruption is 
due to the sudden building power outage on March 19.

We are in the process of repairing the file system. We need to complete 
the fsck in repair mode on /intrepid-fs0 and then run another scan to 
verify all problems are fixed. Based upon our estimates, this process 
will complete early morning on Monday the 8th. We believe it is likely 
that all issues will be resolved and we will be able to return to normal 
production upon completion of our normal Monday maintenance. However, if 
additional errors are found during the verification scan, we will need 
another iteration of repair / verify and that would likely take us until 
the following Thursday. Please see the steps at the end of the message 
for more detail.

In the meantime, we have enabled access to the login nodes, /home, and 
/intrepid-fs1 (a pvfs volume) on Intrepid, Challenger and Eureka. This 
gives users access to the system and enables some users to run jobs from 
the PVFS file system. You may continue to submit jobs to the default 
queue, but they will not run until /intrepid-fs0 is restored to service.

We apologize for the inconvenience and are working to resolve this issue 
as quickly as possible. If you have any questions please don't hesitate 
to contact your Catalyst or the ALCF help desk (support at alcf.anl.gov).

PVFS and Running in this Mode:

If you believe you can run your jobs from the PVFSfile system 
(/intrepid-fs1), please send an email to support at alcf.anl.gov or your 
assigned Catalyst and we will work with you to evaluate if this 
short-term solution will work for you.

If you are a user of the PVFSfile system, compiling on PVFS is not 
advised. ALCF staff recommend that you compile in your home directory 
and read and write data from the PVFS file system.

The team has created special queues on Challenger and Intrepid. To run:

qsub -q Q.pvfsruns --kernel pvfs -n <node count> -t <walltime> -A \
<project> [any other options] <executable>

On Eureka, you do not have to specify a --kernel option, but you do 
still have to use -q Q.pvfsruns. There also is a queue on Eureka for the 
pubnet nodes:  Q.pvfsruns-pubnet.

Again, if you have any questions please don't hesitate to contact your 
Catalyst or the ALCF help desk (support at alcf.anl.gov).

File System Details:

As in December when we discovered file system corruption, ALCF staff are 
working through these steps to repair  intrepid-fs0:

1. Complete a full fsck in scan mode (determine what the problems are, 
but don't fix them).
2. Identify corrupted or problematic file.
3. Copy corrupted or problematic file off of the file system.
4. Delete files.
5. Run fsck in repair mode (fix the problems)
6. Verify all errors have been fixed.
7. If no errors, we are finished. If errors, back to 2.

We are working on step 5. Again, fsck in repair mode will have to be 
repeated until there are no more reported errors.

Thank you,
ALCF Support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/intrepid-notify/attachments/20130404/3c5b721b/attachment.html>


More information about the intrepid-notify mailing list