[intrepid-notify] Re: Possible pending restart of Intrepid
Andrew Cherry
acherry at mcs.anl.gov
Fri Aug 29 18:44:12 CDT 2008
Service has been restored on the 32 racks, and Intrepid is now fully
functional.
Thanks for your patience.
Andrew Cherry
ALCF Support
On Aug 29, 2008, at 5:33 PM, Andrew Cherry wrote:
> We have restored service on the production row (row 0). The 32-
> racks (rows 1-4) are still being worked on at this time. We will
> send another notice when service is fully restored.
>
> Andrew Cherry
> ALCF Support
>
> On Aug 29, 2008, at 2:31 PM, Andrew Cherry wrote:
>
>> Folks-
>>
>> We've detected a problem on Intrepid that is preventing us from
>> servicing failed components on the system. The issue appears to be
>> one that impacts service actions on the entire system, and is not
>> restricted to a particular rack or component. Jobs are currently
>> not being affected by the problem, but it is still a concern since
>> it affects our ability to replace bad hardware (failed nodes, bad
>> power supplies, etc), and it's particularly important that we
>> replace any failed power supplies before the upcoming holiday
>> weekend. We have therefore placed a temporary hold on the system
>> starting at 3:45 PM just in case we end up having to do a full
>> control system restart. This hold will prevent new jobs from
>> starting if they would go past 3:45 PM, so if you find that your
>> jobs aren't being launched as expected, this is the reason why.
>>
>> We are working with IBM on the issue and will release the hold as
>> soon as we find a resolution to the issue.
>>
>> Thanks for your patience.
>>
>> Andrew Cherry
>> ALCF Support
>>
>
More information about the intrepid-notify
mailing list