[intrepid-notify] Cobalt issues on Intrepid

Andrew Cherry acherry at alcf.anl.gov
Wed Feb 11 23:23:32 CST 2009


We've encountered an issue with Cobalt on Intrepid that has caused it to lose track of which partitions are and aren't in use.  This has caused a number of jobs to attempt to start on partitions that are already blocked or busy, resulting in job failures.  To remedy the situation, we have temporarily stopped all queues in order to allow the most of the remaining jobs to finish running.  Once the system has drained, we will restart Cobalt and resume service.  Based on the walltimes of currently running jobs, I'm estimating it will be feasible to restore service in about two hours.

We apologize for any inconvenience.

Andrew Cherry
ALCF Support



More information about the intrepid-notify mailing list