[intrepid-notify] Re: Unexpected outage on Intrepid
Andrew Cherry
acherry at mcs.anl.gov
Wed Jul 23 18:53:29 CDT 2008
FYI, intrepid is back online (with the exception of some hardware on
ANL-R10-R11-2048 and ANL-R15-1024 which is being serviced).
Andrew Cherry
ALCF Support
On Jul 23, 2008, at 6:19 PM, Andrew Cherry wrote:
> As of 5:40 PM today, we have encountered a major issue with the
> BlueGene/P control system which caused it to reset itself. We are
> in the process of investigating the issue; at this point we are not
> certain of the root cause. We will be holding all jobs in the queue
> until the system is operational again. We will send another note
> when the system is back online.
>
> FYI, the following jobs died as a result of the control system going
> down and will need to be resubmitted:
>
> 27046 close-2.namd fkhalili
> 03:00:00 00:32:55 03:08:27 1024 running ANL-R04-1024
> dual 2048 prod-medium 07/23/08 14:45:31 None
> 27050 log.1365.backward.antinucleon.px-1 pochinsk
> 01:56:00 01:35:46 01:44:34 512 running ANL-R07-M0-512
> vn 2048 prod-medium 07/23/08 16:09:25 None
> 27058 log.1370.backward.nucleon.px-1 pochinsk
> 01:19:00 01:30:28 00:14:40 1024 running ANL-R06-1024
> vn 4096 prod-medium 07/23/08 17:39:18 None
> 27059 log.1370.backward.antinucleon.px0 pochinsk
> 01:56:00 00:00:41 01:44:22 512 running ANL-R05-M0-512
> vn 2048 prod-medium 07/23/08 16:09:37 None
> 27060 log.1370.backward.antinucleon.px-1 pochinsk
> 01:56:00 00:08:24 01:35:37 512 running ANL-R05-M1-512
> vn 2048 prod-medium 07/23/08 16:18:21 None
> 27068 log.435.backward.antinucleon.px0 pochinsk
> 01:09:00 00:40:31 00:19:32 2048 running ANL-R02-R03-2048
> vn 8192 prod-medium 07/23/08 17:34:27 None
> 27070 log.675.backward.antinucleon.px0 pochinsk
> 01:09:00 00:00:39 00:14:54 2048 running ANL-R00-R01-2048
> vn 8192 prod-medium 07/23/08 17:39:05 None
>
> Sorry for the inconvenience.
>
> Andrew Cherry
> ALCF Support
>
More information about the intrepid-notify
mailing list