[intrepid-notify] UPDATE - intrepid back up

Andrew Cherry acherry at mcs.anl.gov
Tue Apr 29 13:35:29 CDT 2008


It looks like the network situation has stabilized.  Home directories  
are accessible again, and most jobs that are running now look like  
they are still producing output.  There are a couple that might be  
hung - I will contact individuals separately on those.

I have re-enabled all queues on intrepid.  If you have any jobs that  
completed on intrepid between 10:30 AM and 1:00 PM today, you may  
want to check on your .error file(s) and see if there were any  
problems, resubmitting if necessary.

Sorry for any inconvenience.

ALCF Support Team

On Apr 29, 2008, at 12:09 PM, ALCF Support wrote:

> FYI-
>
> At around 10:30 AM today, we began to encounter widespread network  
> issues on intrepid. The network problems are affecting home  
> directory access on the frontend nodes, as well BG/P jobs on some  
> parts of the machine.   To prevent job failures, we have  
> temporarily disabled all Cobalt queueing on intrepid (though we  
> have not killed any running jobs).  Some of the running jobs may be  
> OK, but we don't know the full scope of the problem yet since the  
> scope has broadened since we first noticed the issue.  Once we have  
> everything back online and things have stabilized, we should be  
> able to assess which jobs have been affected by the problem.
>
> We will send out another note as soon as everything is working  
> properly again.
>
> Thanks,
> ALCF Support Team
>
> _______________________________________________
> intrepid-notify mailing list
> intrepid-notify at alcf.anl.gov
> http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify




More information about the intrepid-notify mailing list