[intrepid-notify] Re: Possible pending restart of Intrepid

Andrew Cherry acherry at mcs.anl.gov
Fri Aug 29 18:44:12 CDT 2008


Service has been restored on the 32 racks, and Intrepid is now fully  
functional.

Thanks for your patience.

Andrew Cherry
ALCF Support

On Aug 29, 2008, at 5:33 PM, Andrew Cherry wrote:

> We have restored service on the production row (row 0).  The 32- 
> racks (rows 1-4) are still being worked on at this time.  We will  
> send another notice when service is fully restored.
>
> Andrew Cherry
> ALCF Support
>
> On Aug 29, 2008, at 2:31 PM, Andrew Cherry wrote:
>
>> Folks-
>>
>> We've detected a problem on Intrepid that is preventing us from  
>> servicing failed components on the system.  The issue appears to be  
>> one that impacts service actions on the entire system, and is not  
>> restricted to a particular rack or component.  Jobs are currently  
>> not being affected by the problem, but it is still a concern since  
>> it affects our ability to replace bad hardware (failed nodes, bad  
>> power supplies, etc), and it's particularly important that we  
>> replace any failed power supplies before the upcoming holiday  
>> weekend.  We have therefore placed a temporary hold on the system  
>> starting at 3:45 PM just in case we end up having to do a full  
>> control system restart.  This hold will prevent new jobs from  
>> starting if they would go past 3:45 PM, so if you find that your  
>> jobs aren't being launched as expected, this is the reason why.
>>
>> We are working with IBM on the issue and will release the hold as  
>> soon as we find a resolution to the issue.
>>
>> Thanks for your patience.
>>
>> Andrew Cherry
>> ALCF Support
>>
>




More information about the intrepid-notify mailing list