[intrepid-notify] Cobalt issues on Intrepid
Andrew Cherry
acherry at alcf.anl.gov
Tue Sep 9 19:37:41 CDT 2008
FYI, we encountered a problem with the Cobalt job scheduler on
Intrepid that required us to restart Cobalt. Although no jobs have
been killed by this process, it will take about 30 minutes for Cobalt
to reinitialize. During this time, no new jobs will start. In
addition, since Cobalt loses state information about running jobs
during a restart, the output of qstat may not accurately reflect the
current state of the system. During the next 30 minutes or so, the
list of "running" jobs in qstat may actually show some jobs that have
already completed. After initialization is done, all jobs that were
running at the time of the restart will disappear from qstat's output
(even if they are actually still running on the system), so there may
be some period of time where parts of the system are blocked but you
don't see jobs running on them in the qstat output -- in this case,
you can see the "hidden" jobs by running the bg-listjobs and bg-
listblocks commands.
I will set up cron jobs to kill off the remaining "unmanaged" jobs if
they go over their requested walltime.
To hopefully make things easier, I've included a list of jobs from
qstat and their corresponding bg-listjobs entry -- if you are checking
the status of one of your jobs and don't see it in the qstat/cqstat
output, double-check bg-listjobs first to see if it is actually still
running.
qstat: 43557 chulwoo 04:00:00 1024 running ANL-R42-1024
listjobs: 96755 chulwoo ANL-R42-1024 00/00/ R 09/09 18:03:03
qstat: 43558 chulwoo 04:00:00 1024 running ANL-R34-1024
listjobs: 96756 chulwoo ANL-R34-1024 00/00/ R 09/09 18:03:31
qstat: 43670 fkhalili 11:40:00 1024 running ANL-R36-1024
listjobs: 96757 fkhalili ANL-R36-1024 00/00/ R 09/09 18:03:59
qstat: 43656 fkhalili 04:00:00 512 running ANL-R35-M0-512
listjobs: 96761 fkhalili ANL-R35-M0-512 00/00/ R 09/09 18:04:56
qstat: 43657 fkhalili 04:00:00 512 running ANL-R35-M1-512
listjobs: 96760 fkhalili ANL-R35-M1-512 00/00/ R 09/09 18:04:51
qstat: 43388 chulwoo 04:00:00 2048 running ANL-R46-R47-2048
listjobs: 96758 chulwoo ANL-R46-R47-204 00/00/ R 09/09 18:04:01
qstat: 43668 fkhalili 10:00:00 2048 running ANL-R40-R41-2048
listjobs: 96759 fkhalili ANL-R40-R41-204 00/00/ R 09/09 18:04:34
qstat: 43661 fkhalili 04:00:00 512 running ANL-R27-M0-512
listjobs: 96767 fkhalili ANL-R27-M0-512 00/00/ R 09/09 18:05:51
qstat: 43293 chulwoo 03:00:00 2048 running ANL-R44-R45-2048
listjobs: 96762 chulwoo ANL-R44-R45-204 00/00/ R 09/09 18:05:20
qstat: 43658 fkhalili 04:00:00 512 running ANL-R37-M1-512
listjobs: 96763 fkhalili ANL-R37-M1-512 00/00/ R 09/09 18:05:26
qstat: 43660 fkhalili 04:00:00 512 running ANL-R27-M1-512
listjobs: 96764 fkhalili ANL-R37-M0-512 00/00/ R 09/09 18:05:27
qstat: 43660 fkhalili 04:00:00 512 running ANL-R27-M1-512
listjobs: 96766 fkhalili ANL-R27-M1-512 00/00/ R 09/09 18:05:28
qstat: 43663 fkhalili 04:00:00 512 running ANL-R25-M0-512
listjobs: 96769 fkhalili ANL-R25-M0-512 00/00/ R 09/09 18:06:08
qstat: 43662 fkhalili 04:00:00 512 running ANL-R25-M1-512
listjobs: 96768 fkhalili ANL-R25-M1-512 00/00/ R 09/09 18:06:08
qstat: 43654 fkhalili 03:00:00 512 running ANL-R07-M0-512
listjobs: 96771 fkhalili ANL-R07-M0-512 00/00/ R 09/09 18:07:43
qstat: 43677 gottlieb 09:15:00 8192 running ANL-R10-R17-8192
listjobs: 96770 gottlieb ANL-R10-R17-819 00/00/ R 09/09 18:07:42
Completed jobs as of the time of this writing (these will remain in
the qstat output until Cobalt is done initializing):
43655 gottlieb 02:00:00 2048 running ANL-R04-R05-2048
43659 fkhalili 04:00:00 512 running ANL-R37-M0-512
43671 fkhalili 11:40:00 1024 running ANL-R26-1024
More information about the intrepid-notify
mailing list