[intrepid-notify] Cobalt issues on Intrepid

Andrew Cherry acherry at alcf.anl.gov
Tue Sep 9 19:37:41 CDT 2008


FYI, we encountered a problem with the Cobalt job scheduler on  
Intrepid that required us to restart Cobalt.  Although no jobs have  
been killed by this process, it will take about 30 minutes for Cobalt  
to reinitialize.  During this time, no new jobs will start.  In  
addition, since Cobalt loses state information about running jobs  
during a restart, the output of qstat may not accurately reflect the  
current state of the system.  During the next 30 minutes or so, the  
list of "running" jobs in qstat may actually show some jobs that have  
already completed. After initialization is done, all jobs that were  
running at the time of the restart will disappear from qstat's output  
(even if they are actually still running on the system), so there may  
be some period of time where parts of the system are blocked but you  
don't see jobs running on them in the qstat output -- in this case,  
you can see the "hidden" jobs by running the bg-listjobs and bg- 
listblocks commands.

I will set up cron jobs to kill off the remaining "unmanaged" jobs if  
they go over their requested walltime.

To hopefully make things easier, I've included a list of jobs from  
qstat and their corresponding bg-listjobs entry -- if you are checking  
the status of one of your jobs and don't see it in the qstat/cqstat  
output, double-check bg-listjobs first to see if it is actually still  
running.

qstat:    43557  chulwoo   04:00:00  1024   running    ANL-R42-1024
listjobs: 96755  chulwoo  ANL-R42-1024    00/00/   R 09/09 18:03:03

qstat:    43558  chulwoo   04:00:00  1024   running    ANL-R34-1024
listjobs: 96756  chulwoo  ANL-R34-1024    00/00/   R 09/09 18:03:31

qstat:    43670  fkhalili  11:40:00  1024   running    ANL-R36-1024
listjobs: 96757  fkhalili ANL-R36-1024    00/00/   R 09/09 18:03:59

qstat:    43656  fkhalili  04:00:00  512    running    ANL-R35-M0-512
listjobs: 96761  fkhalili ANL-R35-M0-512  00/00/   R 09/09 18:04:56

qstat:    43657  fkhalili  04:00:00  512    running    ANL-R35-M1-512
listjobs: 96760  fkhalili ANL-R35-M1-512  00/00/   R 09/09 18:04:51

qstat:    43388  chulwoo   04:00:00  2048   running    ANL-R46-R47-2048
listjobs: 96758  chulwoo  ANL-R46-R47-204 00/00/   R 09/09 18:04:01

qstat:    43668  fkhalili  10:00:00  2048   running    ANL-R40-R41-2048
listjobs: 96759  fkhalili ANL-R40-R41-204 00/00/   R 09/09 18:04:34

qstat:    43661  fkhalili  04:00:00  512    running    ANL-R27-M0-512
listjobs: 96767  fkhalili ANL-R27-M0-512  00/00/   R 09/09 18:05:51

qstat:    43293  chulwoo   03:00:00  2048   running    ANL-R44-R45-2048
listjobs: 96762  chulwoo  ANL-R44-R45-204 00/00/   R 09/09 18:05:20

qstat:    43658  fkhalili  04:00:00  512    running    ANL-R37-M1-512
listjobs: 96763  fkhalili ANL-R37-M1-512  00/00/   R 09/09 18:05:26

qstat:    43660  fkhalili  04:00:00  512    running    ANL-R27-M1-512
listjobs: 96764  fkhalili ANL-R37-M0-512  00/00/   R 09/09 18:05:27

qstat:    43660  fkhalili  04:00:00  512    running    ANL-R27-M1-512
listjobs: 96766  fkhalili ANL-R27-M1-512  00/00/   R 09/09 18:05:28

qstat:    43663  fkhalili  04:00:00  512    running    ANL-R25-M0-512
listjobs: 96769  fkhalili ANL-R25-M0-512  00/00/   R 09/09 18:06:08

qstat:    43662  fkhalili  04:00:00  512    running    ANL-R25-M1-512
listjobs: 96768  fkhalili ANL-R25-M1-512  00/00/   R 09/09 18:06:08

qstat:    43654  fkhalili  03:00:00  512    running    ANL-R07-M0-512
listjobs: 96771  fkhalili ANL-R07-M0-512  00/00/   R 09/09 18:07:43

qstat:    43677  gottlieb  09:15:00  8192   running    ANL-R10-R17-8192
listjobs: 96770  gottlieb ANL-R10-R17-819 00/00/   R 09/09 18:07:42


Completed jobs as of the time of this writing (these will remain in  
the qstat output until Cobalt is done initializing):

43655  gottlieb  02:00:00  2048   running    ANL-R04-R05-2048
43659  fkhalili  04:00:00  512    running    ANL-R37-M0-512
43671  fkhalili  11:40:00  1024   running    ANL-R26-1024





More information about the intrepid-notify mailing list