[intrepid-notify] Job issues on Intrepid
Andrew Cherry
acherry at alcf.anl.gov
Tue Feb 17 12:16:35 CST 2009
After yesterday's upgrade, we are encountering scheduling issues on
Intrepid. The job scheduler is attempting to start jobs on partitions
that have been previously allocated for other jobs, causing failures.
The symptom of this problem are error messages like the ones below:
<Feb 17 11:45:40.214985> BE_MPI (ERROR): Error booting partition -
INCOMPATIBLE_STATE
or
<Feb 17 10:42:58.858432> BE_MPI (ERROR): Current user is not the owner
of the partition
This appears to be a timing issue or race condition, and the Cobalt
developers are working to determine the root cause. In the meantime,
we have stopped further queueing in the prod queue to prevent job
failures. The prod-devel queue will remain active, since the jobs
there run infrequently enough that we're not hitting the problem
there. We will update you again when we have a resolution to the
problem.
As always, thank you for your patience, and we apologize for the
problem.
ALCF Support Team
More information about the intrepid-notify
mailing list