[intrepid-notify] Status update on Intrepid, Eureka, and Challenger
ALCF User Services
support at alcf.anl.gov
Thu Apr 4 17:39:41 CDT 2013
Dear ALCF Users,
As stated in the notification Wednesday evening, the /intrepid-fs0 file
system showed signs of corruption. ALCF staff believe this corruption is
due to the sudden building power outage on March 19.
We are in the process of repairing the file system. We need to complete
the fsck in repair mode on /intrepid-fs0 and then run another scan to
verify all problems are fixed. Based upon our estimates, this process
will complete early morning on Monday the 8th. We believe it is likely
that all issues will be resolved and we will be able to return to normal
production upon completion of our normal Monday maintenance. However, if
additional errors are found during the verification scan, we will need
another iteration of repair / verify and that would likely take us until
the following Thursday. Please see the steps at the end of the message
for more detail.
In the meantime, we have enabled access to the login nodes, /home, and
/intrepid-fs1 (a pvfs volume) on Intrepid, Challenger and Eureka. This
gives users access to the system and enables some users to run jobs from
the PVFS file system. You may continue to submit jobs to the default
queue, but they will not run until /intrepid-fs0 is restored to service.
We apologize for the inconvenience and are working to resolve this issue
as quickly as possible. If you have any questions please don't hesitate
to contact your Catalyst or the ALCF help desk (support at alcf.anl.gov).
PVFS and Running in this Mode:
If you believe you can run your jobs from the PVFSfile system
(/intrepid-fs1), please send an email to support at alcf.anl.gov or your
assigned Catalyst and we will work with you to evaluate if this
short-term solution will work for you.
If you are a user of the PVFSfile system, compiling on PVFS is not
advised. ALCF staff recommend that you compile in your home directory
and read and write data from the PVFS file system.
The team has created special queues on Challenger and Intrepid. To run:
qsub -q Q.pvfsruns --kernel pvfs -n <node count> -t <walltime> -A \
<project> [any other options] <executable>
On Eureka, you do not have to specify a --kernel option, but you do
still have to use -q Q.pvfsruns. There also is a queue on Eureka for the
pubnet nodes: Q.pvfsruns-pubnet.
Again, if you have any questions please don't hesitate to contact your
Catalyst or the ALCF help desk (support at alcf.anl.gov).
File System Details:
As in December when we discovered file system corruption, ALCF staff are
working through these steps to repair intrepid-fs0:
1. Complete a full fsck in scan mode (determine what the problems are,
but don't fix them).
2. Identify corrupted or problematic file.
3. Copy corrupted or problematic file off of the file system.
4. Delete files.
5. Run fsck in repair mode (fix the problems)
6. Verify all errors have been fixed.
7. If no errors, we are finished. If errors, back to 2.
We are working on step 5. Again, fsck in repair mode will have to be
repeated until there are no more reported errors.
Thank you,
ALCF Support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/intrepid-notify/attachments/20130404/3c5b721b/attachment.html>
More information about the intrepid-notify
mailing list