[intrepid-notify] Unexpected outage on Intrepid
Andrew Cherry
acherry at alcf.anl.gov
Wed Jul 23 18:19:55 CDT 2008
As of 5:40 PM today, we have encountered a major issue with the
BlueGene/P control system which caused it to reset itself. We are in
the process of investigating the issue; at this point we are not
certain of the root cause. We will be holding all jobs in the queue
until the system is operational again. We will send another note when
the system is back online.
FYI, the following jobs died as a result of the control system going
down and will need to be resubmitted:
27046 close-2.namd fkhalili
03:00:00 00:32:55 03:08:27 1024 running ANL-R04-1024
dual 2048 prod-medium 07/23/08 14:45:31 None
27050 log.1365.backward.antinucleon.px-1 pochinsk
01:56:00 01:35:46 01:44:34 512 running ANL-R07-M0-512
vn 2048 prod-medium 07/23/08 16:09:25 None
27058 log.1370.backward.nucleon.px-1 pochinsk
01:19:00 01:30:28 00:14:40 1024 running ANL-R06-1024
vn 4096 prod-medium 07/23/08 17:39:18 None
27059 log.1370.backward.antinucleon.px0 pochinsk
01:56:00 00:00:41 01:44:22 512 running ANL-R05-M0-512
vn 2048 prod-medium 07/23/08 16:09:37 None
27060 log.1370.backward.antinucleon.px-1 pochinsk
01:56:00 00:08:24 01:35:37 512 running ANL-R05-M1-512
vn 2048 prod-medium 07/23/08 16:18:21 None
27068 log.435.backward.antinucleon.px0 pochinsk
01:09:00 00:40:31 00:19:32 2048 running ANL-R02-R03-2048
vn 8192 prod-medium 07/23/08 17:34:27 None
27070 log.675.backward.antinucleon.px0 pochinsk
01:09:00 00:00:39 00:14:54 2048 running ANL-R00-R01-2048
vn 8192 prod-medium 07/23/08 17:39:05 None
Sorry for the inconvenience.
Andrew Cherry
ALCF Support
More information about the intrepid-notify
mailing list