[notify] Update on intrepid situation - 100T back in service
Tisha Stacey
tstacey at alcf.anl.gov
Fri Apr 18 16:25:24 CDT 2008
The 100T (row 0) has been restored to service. The 500T (rows 1-4) remains disabled as we continue to work on it. We will send another notice when the 500T has been returned to service.
Thank you,
ALCF Systems Team
Andrew Cherry wrote:
| We are continuing to have issues on intrepid after restarting it this
| afternoon. We have determined that the issues are triggered by
| booting large numbers of nodes at the same time, which can cause
| partitions to time out and fail to boot properly. Since we have not
| had difficulty booting single rows, we have restored row 0 (the 100T)
| to service in order to allow production jobs to run overnight. For
| the time being, we will be leaving the 500T (rows 1-4) disabled until
| we can get a better understanding of the problem, in order to avoid
| further impacting production use of the system. Tomorrow morning at
| 9, we have reserved the entire 5 rows of intrepid in order to perform
| more troubleshooting, and hopefully get down to the root cause of
| this issue.
|
| Thank you for your continued patience.
|
| ALCF Support
|
| On Apr 17, 2008, at 11:29 AM, Andrew Cherry wrote:
|
|> Late last night, we identified a problem with the BG/P
|> environmental monitoring on intrepid that needs to be addressed.
|> Unfortunately, there is a possibility that this may result in
|> having to restart the control system again (we are still working
|> with IBM to try to cover all other options before we resort to a
|> control system restart). We have therefore reserved the entire
|> system for a possible restart at 2 PM. New jobs queued on
|> intrepid will not be started if they are long enough run past 2
|> PM. We don't expect any currently running production jobs to be
|> impacted (since they will be finished by the time the work begins),
|> but long-running early science jobs may need to be killed if they
|> are still running and we determine that the restart is needed.
|>
|> Access to the login nodes and the filesystems will not be impacted
|> by this work - this will only affect the BlueGene itself.
|>
|> We will send out another note when we know for sure what the impact
|> will be.
|>
|> Thanks...
|>
|> -Andrew Cherry
|> ALCF Support
|>
More information about the intrepid-notify
mailing list