[intrepid-notify] Job issues on Intrepid

Andrew Cherry acherry at alcf.anl.gov
Tue Feb 17 12:16:35 CST 2009


After yesterday's upgrade, we are encountering scheduling issues on  
Intrepid.  The job scheduler is attempting to start jobs on partitions  
that have been previously allocated for other jobs, causing failures.   
The symptom of this problem are error messages like the ones below:

<Feb 17 11:45:40.214985> BE_MPI (ERROR): Error booting partition -   
INCOMPATIBLE_STATE
or
<Feb 17 10:42:58.858432> BE_MPI (ERROR): Current user is not the owner  
of the partition

This appears to be a timing issue or race condition, and the Cobalt  
developers are working to determine the root cause.  In the meantime,  
we have stopped further queueing in the prod queue to prevent job  
failures.  The prod-devel queue will remain active, since the jobs  
there run infrequently enough that we're not hitting the problem  
there.  We will update you again when we have a resolution to the  
problem.

As always, thank you for your patience, and we apologize for the  
problem.

ALCF Support Team





More information about the intrepid-notify mailing list