[intrepid-notify] Extended maintenance period beginning Jan. 11, 2010

Tisha Stacey tstacey at alcf.anl.gov
Tue Jan 5 16:08:49 CST 2010


At 8:00 a.m. (U.S. Central time) on Monday, January 11, 2010, we will take Intrepid and Eureka down for an extended maintenance period.  We expect the downtime to last approximately 3.5 days (our target system release is 12:00 noon on Thursday, January 14).  There will be no access to Intrepid and Eureka at all, including the login nodes.

During this time, we will be replacing all the remaining Zarlink optical transceivers in our Myricom data network.  Working with Myricom, we have determined that our Zarlink transceivers came from a bad wafer, and that these faulty transceivers have been the source of many of our network and filesystem instability issues.  We have previously tried to "weed out" bad Zarlink transceivers without success.  This time, we are replacing all of them with Avago transceivers, which have run flawlessly since we came up.  While we apologize for the extended downtime, we're confident that this work will result in a much more stable system.

There will be a brief maintenance period on the following Monday, January 18, which we expect to last no longer than two hours.

We are very sorry for the extremely short notice.  We realize that this will be very inconvenient for some of you, but we assure you that the work is necessary, and we're optimistic that it will bring significant improvement.

Please contact us with any questions or concerns.

Thank you,
The ALCF Support Team


More information about the intrepid-notify mailing list