[notify] Update on intrepid situation

Andrew Cherry acherry at mcs.anl.gov
Thu Apr 17 20:51:57 CDT 2008


We are continuing to have issues on intrepid after restarting it this  
afternoon.  We have determined that the issues are triggered by  
booting large numbers of nodes at the same time, which can cause  
partitions to time out and fail to boot properly.  Since we have not  
had difficulty booting single rows, we have restored row 0 (the 100T)  
to service in order to allow production jobs to run overnight.   For  
the time being, we will be leaving the 500T (rows 1-4) disabled until  
we can get a better understanding of the problem, in order to avoid  
further impacting production use of the system.  Tomorrow morning at  
9, we have reserved the entire 5 rows of intrepid in order to perform  
more troubleshooting, and hopefully get down to the root cause of  
this issue.

Thank you for your continued patience.

ALCF Support

On Apr 17, 2008, at 11:29 AM, Andrew Cherry wrote:

> Late last night, we identified a problem with the BG/P  
> environmental monitoring on intrepid that needs to be addressed.   
> Unfortunately, there is a possibility that this may result in  
> having to restart the control system again (we are still working  
> with IBM to try to cover all other options before we resort to a  
> control system restart).  We have therefore reserved the entire  
> system for a possible restart at 2 PM.   New jobs queued on  
> intrepid will not be started if they are long enough run past 2  
> PM.  We don't expect any currently running production jobs to be  
> impacted (since they will be finished by the time the work begins),  
> but long-running early science jobs may need to be killed if they  
> are still running and we determine that the restart is needed.
>
> Access to the login nodes and the filesystems will not be impacted  
> by this work - this will only affect the BlueGene itself.
>
> We will send out another note when we know for sure what the impact  
> will be.
>
> Thanks...
>
> -Andrew Cherry
>  ALCF Support
>




More information about the intrepid-notify mailing list