[notify] Update on intrepid situation - 100T back in service

Tisha Stacey tstacey at alcf.anl.gov
Fri Apr 18 16:25:24 CDT 2008


The 100T (row 0) has been restored to service.  The 500T (rows 1-4) remains disabled as we continue to work on it.  We will send another notice when the 500T has been returned to service.

Thank you,
ALCF Systems Team

Andrew Cherry wrote:
| We are continuing to have issues on intrepid after restarting it this
| afternoon.  We have determined that the issues are triggered by
| booting large numbers of nodes at the same time, which can cause
| partitions to time out and fail to boot properly.  Since we have not
| had difficulty booting single rows, we have restored row 0 (the 100T)  
| to service in order to allow production jobs to run overnight.   For  
| the time being, we will be leaving the 500T (rows 1-4) disabled until  
| we can get a better understanding of the problem, in order to avoid  
| further impacting production use of the system.  Tomorrow morning at  
| 9, we have reserved the entire 5 rows of intrepid in order to perform  
| more troubleshooting, and hopefully get down to the root cause of  
| this issue.
| 
| Thank you for your continued patience.
| 
| ALCF Support
| 
| On Apr 17, 2008, at 11:29 AM, Andrew Cherry wrote:
| 
|> Late last night, we identified a problem with the BG/P  
|> environmental monitoring on intrepid that needs to be addressed.   
|> Unfortunately, there is a possibility that this may result in  
|> having to restart the control system again (we are still working  
|> with IBM to try to cover all other options before we resort to a  
|> control system restart).  We have therefore reserved the entire  
|> system for a possible restart at 2 PM.   New jobs queued on  
|> intrepid will not be started if they are long enough run past 2  
|> PM.  We don't expect any currently running production jobs to be  
|> impacted (since they will be finished by the time the work begins),  
|> but long-running early science jobs may need to be killed if they  
|> are still running and we determine that the restart is needed.
|>
|> Access to the login nodes and the filesystems will not be impacted  
|> by this work - this will only affect the BlueGene itself.
|>
|> We will send out another note when we know for sure what the impact  
|> will be.
|>
|> Thanks...
|>
|> -Andrew Cherry
|>  ALCF Support
|>




More information about the intrepid-notify mailing list