[intrepid-notify] Re: Possible pending restart of Intrepid
Andrew Cherry
acherry at mcs.anl.gov
Fri Aug 29 17:33:14 CDT 2008
We have restored service on the production row (row 0). The 32-racks
(rows 1-4) are still being worked on at this time. We will send
another notice when service is fully restored.
Andrew Cherry
ALCF Support
On Aug 29, 2008, at 2:31 PM, Andrew Cherry wrote:
> Folks-
>
> We've detected a problem on Intrepid that is preventing us from
> servicing failed components on the system. The issue appears to be
> one that impacts service actions on the entire system, and is not
> restricted to a particular rack or component. Jobs are currently
> not being affected by the problem, but it is still a concern since
> it affects our ability to replace bad hardware (failed nodes, bad
> power supplies, etc), and it's particularly important that we
> replace any failed power supplies before the upcoming holiday
> weekend. We have therefore placed a temporary hold on the system
> starting at 3:45 PM just in case we end up having to do a full
> control system restart. This hold will prevent new jobs from
> starting if they would go past 3:45 PM, so if you find that your
> jobs aren't being launched as expected, this is the reason why.
>
> We are working with IBM on the issue and will release the hold as
> soon as we find a resolution to the issue.
>
> Thanks for your patience.
>
> Andrew Cherry
> ALCF Support
>
More information about the intrepid-notify
mailing list