[intrepid-notify] Intrepid / Eureka System Status

Jini Ramprakash jini at alcf.anl.gov
Tue May 25 12:48:10 CDT 2010


Hello,

The BlueGene/P Intrepid and several other subsystems at Argonne's 
Leadership Computing Facility are currently offline while a lab chilled 
water problem is resolved.  Details below:

* Last night at approximately 7pm, the temperature of the Interim 
Supercomputing Support Facility began to rise.  Automated scripts ran 
which reduced machine room heat load by turning off subsystems.

* At about 9pm, it seemed cooling had been restored, and staff began 
bringing subsystems on line, one at a time.  By 11pm, the staff 
determined that the chillers were not keeping up with the load, and 
therefore it was not possible to bring up all the systems.  By 1:00am, 
the team had brought up some of the file systems and other subsystems, 
but decided to leave Intrepid off line as the facility support worked to 
bring the chilled water system back to full capacity.

* This morning at 5:30am, the chilled water loop had stabilized for the 
partial load currently on the system.

We are currently working closely with the facility staff to bring the 
main systems back on line.  We will slowly increase the load on the 
chillers, and will be bringing up 8 racks of Intrepid at a time.  If all 
goes well, we should have all of the ALCF systems back on line by this 
evening.

Root Cause:  We are still working with ANL facility staff to understand 
the issues that caused the lab-supplied chilled water to fail. 
Currently, we believe that it was a combination of some scheduled 
maintenance on the exterior cooling towers and the unseasonably warm 
weather.  We continue to work with ANL staff to bring our systems back 
on line and ensure continued reliable chilled water, and will provide a 
more detailed update after we have returned our systems to service.

-- 

Thanks & Regards,
ALCF Support Team.


More information about the intrepid-notify mailing list