[intrepid-notify] Re: Possible pending restart of Intrepid

Andrew Cherry acherry at mcs.anl.gov
Fri Aug 29 17:33:14 CDT 2008


We have restored service on the production row (row 0).  The 32-racks  
(rows 1-4) are still being worked on at this time.  We will send  
another notice when service is fully restored.

Andrew Cherry
ALCF Support

On Aug 29, 2008, at 2:31 PM, Andrew Cherry wrote:

> Folks-
>
> We've detected a problem on Intrepid that is preventing us from  
> servicing failed components on the system.  The issue appears to be  
> one that impacts service actions on the entire system, and is not  
> restricted to a particular rack or component.  Jobs are currently  
> not being affected by the problem, but it is still a concern since  
> it affects our ability to replace bad hardware (failed nodes, bad  
> power supplies, etc), and it's particularly important that we  
> replace any failed power supplies before the upcoming holiday  
> weekend.  We have therefore placed a temporary hold on the system  
> starting at 3:45 PM just in case we end up having to do a full  
> control system restart.  This hold will prevent new jobs from  
> starting if they would go past 3:45 PM, so if you find that your  
> jobs aren't being launched as expected, this is the reason why.
>
> We are working with IBM on the issue and will release the hold as  
> soon as we find a resolution to the issue.
>
> Thanks for your patience.
>
> Andrew Cherry
> ALCF Support
>




More information about the intrepid-notify mailing list