[intrepid-notify] Possible pending restart of Intrepid

Andrew Cherry acherry at alcf.anl.gov
Fri Aug 29 14:31:23 CDT 2008


Folks-

We've detected a problem on Intrepid that is preventing us from  
servicing failed components on the system.  The issue appears to be  
one that impacts service actions on the entire system, and is not  
restricted to a particular rack or component.  Jobs are currently not  
being affected by the problem, but it is still a concern since it  
affects our ability to replace bad hardware (failed nodes, bad power  
supplies, etc), and it's particularly important that we replace any  
failed power supplies before the upcoming holiday weekend.  We have  
therefore placed a temporary hold on the system starting at 3:45 PM  
just in case we end up having to do a full control system restart.   
This hold will prevent new jobs from starting if they would go past  
3:45 PM, so if you find that your jobs aren't being launched as  
expected, this is the reason why.

We are working with IBM on the issue and will release the hold as soon  
as we find a resolution to the issue.

Thanks for your patience.

Andrew Cherry
ALCF Support




More information about the intrepid-notify mailing list