[intrepid-notify] Ongoing GPFS issues

Tisha Stacey tstacey at alcf.anl.gov
Wed Nov 26 16:13:30 CST 2008


As many of you are aware, we have been experiencing problems with the
GPFS file systems recently.  This has been manifesting itself as jobs
not being able to boot, jobs can't write output, stale NFS handles, ls
very slow on home directories, etc.  We are aware of the problem(s) and
are working to resolve them as rapidly as possible.  Our current
hypothesis is that a problem which is causing Myricom ports to suddenly
start generating extremely high numbers of CRC errors is interacting
with a bug in GPFS which is causing it to not handle these errors as it
should.  We are working closely with both IBM and Myricom to resolve
this problem, but it is intermittent and is proving difficult to
isolate.  We are continuing to work on the problem and will let you know
as we discover things.  If we have not solved it before our downtime on
Dec. 8th - 10th, we are going to take advantage of having Myricom staff
on site to inspect the hardware and run tests as we come back up.

We apologize for the inconvenience and appreciate your cooperation in
providing us bug reports and your patience as we work through this
issue.

Thank you,
ALCF System Team



More information about the intrepid-notify mailing list