<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<font face="Helvetica, Arial, sans-serif">Dear ALCF Users,<br>
<br>
As stated in the notification Wednesday evening, the /intrepid-fs0
file system showed signs of corruption. ALCF staff believe this
corruption is due to the sudden building power outage on March 19.
<br>
<br>
We are in the process of repairing the file system. We need to
complete the fsck in repair mode on /intrepid-fs0 and then run
another scan to verify all problems are fixed. Based upon our
estimates, this process will complete early morning on Monday the
8th. We believe it is likely that all issues will be resolved and
we will be able to return to normal production upon completion of
our normal Monday maintenance. However, if additional errors are
found during the verification scan, we will need another iteration
of repair / verify and that would likely take us until the
following Thursday. Please see the steps at the end of the message
for more detail.<br>
<br>
In the meantime, we have enabled access to the login nodes, /home,
and /intrepid-fs1 (a pvfs volume) on Intrepid, Challenger and
Eureka. This gives users access to the system and enables some
users to run jobs from the PVFS file system. You may continue to
submit jobs to the default queue, but they will not run until
/intrepid-fs0 is restored to service.<br>
<br>
We apologize for the inconvenience and are working to resolve this
issue as quickly as possible. If you have any questions please
don't hesitate to contact your Catalyst or the ALCF help desk
(<a class="moz-txt-link-abbreviated" href="mailto:support@alcf.anl.gov">support@alcf.anl.gov</a>).<br>
<br>
PVFS and Running in this Mode:<br>
<br>
If you believe you can run your jobs from the </font><font
face="Helvetica, Arial, sans-serif">PVFS</font><font
face="Helvetica, Arial, sans-serif"> file system (/intrepid-fs1),
please send an email to <a class="moz-txt-link-abbreviated" href="mailto:support@alcf.anl.gov">support@alcf.anl.gov</a> or your assigned
Catalyst and we will work with you to evaluate if this short-term
solution will work for you.<br>
<br>
If you are a user of the </font><font face="Helvetica, Arial,
sans-serif">PVFS</font><font face="Helvetica, Arial, sans-serif">
file system, compiling on PVFS is not advised. ALCF staff
recommend that you compile in your home directory and read and
write data from the PVFS file system.<br>
<br>
The team has created special queues on Challenger and Intrepid. To
run:<br>
<br>
qsub -q Q.pvfsruns --kernel pvfs -n <node count> -t
<walltime> -A \<br>
<project> [any other options] <executable><br>
<br>
On Eureka, you do not have to specify a --kernel option, but you
do still have to use -q Q.pvfsruns. There also is a queue on
Eureka for the pubnet nodes: Q.pvfsruns-pubnet.<br>
<br>
Again, if you have any questions please don't hesitate to contact
your Catalyst or the ALCF help desk (<a class="moz-txt-link-abbreviated" href="mailto:support@alcf.anl.gov">support@alcf.anl.gov</a>).<br>
<br>
File System Details:<br>
<br>
As in December when we discovered file system corruption, ALCF
staff are working through these steps to repair intrepid-fs0:<br>
<br>
1. Complete a full fsck in scan mode (determine what the problems
are, but don't fix them).<br>
2. Identify corrupted or problematic file. <br>
3. Copy corrupted or problematic file off of the file system.<br>
4. Delete files.<br>
5. Run fsck in repair mode (fix the problems)<br>
6. Verify all errors have been fixed. <br>
7. If no errors, we are finished. If errors, back to 2.<br>
<br>
We are working on step 5. Again, fsck in repair mode will have to
be repeated until there are no more reported errors.<br>
<br>
Thank you,<br>
ALCF Support</font>
</body>
</html>