[intrepid-notify] Possible pending restart of Intrepid
Andrew Cherry
acherry at alcf.anl.gov
Fri Aug 29 14:31:23 CDT 2008
Folks-
We've detected a problem on Intrepid that is preventing us from
servicing failed components on the system. The issue appears to be
one that impacts service actions on the entire system, and is not
restricted to a particular rack or component. Jobs are currently not
being affected by the problem, but it is still a concern since it
affects our ability to replace bad hardware (failed nodes, bad power
supplies, etc), and it's particularly important that we replace any
failed power supplies before the upcoming holiday weekend. We have
therefore placed a temporary hold on the system starting at 3:45 PM
just in case we end up having to do a full control system restart.
This hold will prevent new jobs from starting if they would go past
3:45 PM, so if you find that your jobs aren't being launched as
expected, this is the reason why.
We are working with IBM on the issue and will release the hold as soon
as we find a resolution to the issue.
Thanks for your patience.
Andrew Cherry
ALCF Support
More information about the intrepid-notify
mailing list