[intrepid-notify] Intrepid / Eureka System Status
Jini Ramprakash
jini at alcf.anl.gov
Tue May 25 12:48:10 CDT 2010
Hello,
The BlueGene/P Intrepid and several other subsystems at Argonne's
Leadership Computing Facility are currently offline while a lab chilled
water problem is resolved. Details below:
* Last night at approximately 7pm, the temperature of the Interim
Supercomputing Support Facility began to rise. Automated scripts ran
which reduced machine room heat load by turning off subsystems.
* At about 9pm, it seemed cooling had been restored, and staff began
bringing subsystems on line, one at a time. By 11pm, the staff
determined that the chillers were not keeping up with the load, and
therefore it was not possible to bring up all the systems. By 1:00am,
the team had brought up some of the file systems and other subsystems,
but decided to leave Intrepid off line as the facility support worked to
bring the chilled water system back to full capacity.
* This morning at 5:30am, the chilled water loop had stabilized for the
partial load currently on the system.
We are currently working closely with the facility staff to bring the
main systems back on line. We will slowly increase the load on the
chillers, and will be bringing up 8 racks of Intrepid at a time. If all
goes well, we should have all of the ALCF systems back on line by this
evening.
Root Cause: We are still working with ANL facility staff to understand
the issues that caused the lab-supplied chilled water to fail.
Currently, we believe that it was a combination of some scheduled
maintenance on the exterior cooling towers and the unseasonably warm
weather. We continue to work with ANL staff to bring our systems back
on line and ensure continued reliable chilled water, and will provide a
more detailed update after we have returned our systems to service.
--
Thanks & Regards,
ALCF Support Team.
More information about the intrepid-notify
mailing list