<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Update:<div><br></div><div>Due to instabilities with the system that were resolved by reverting to V1R2M0, we have aborted the driver upgrade and are now on the previous driver. We will be planning test windows with IBM to more thoroughly test the new driver before switching over to it in the future. If you have recompiled your code, you will need to revert back to your pre-upgrade executables as described below, or recompile your application again. </div><div><br></div><div>It looks as if most of the jobs in the queue on Intrepid have executables that are older than one week, so we are releasing the queues for use. As a precaution, I have put user holds on the following jobs that have executables with modification times newer than Monday:</div><div><br></div><div><div>94288:gjordan:/gpfs1/gjordan/2009/mass_study/8km_16o20/rundir_0004/flash3</div><div>94294:gjordan:/gpfs1/gjordan/2009/extreme_preexpansion/8km_multi_63_128o148_m1385/rundir_0010/flash3</div><div>94324:ckerr:/gpfs1/ckerr/c720-hiram.dev/scripts/runscript</div><div>94328:rajeshn:/gpfs/home/rajeshn/temp/install/bin/./hello</div><div>94335:chulwoo:/gpfs/home/chulwoo/CPS/v5_0_5-tw/reweight/BGP.x.v1</div><div>94337:lynnreid:/home/lynnreid/datadir/extreme_preexpansion/8km_63_128o148_1365_1/rundir_0002/flash3</div><div>94342:knomura:/gpfs/home/knomura/stencil/afd3_mpi_thread/afd3_mpi</div><div>94343:spanu:/gpfs/home/spanu/TurboRVB/bin/turborvb-mpi.x</div><div>94382:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/SCRIPTS_detar/batch_kappa_0_000545.sh</div><div>94383:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/SCRIPTS_detar/batch_kappa_36_000545.sh</div><div>94384:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/SCRIPTS_detar/batch_kappa_72_000545.sh</div><div>94385:detar:/home/detar/kappa_tune/l48144f21b747m0036m018/SCRIPTS_detar/batch_kappa_108_000545.sh</div><div>94386:chan:/gpfs/home/chan/ts_abort/halfsleep_mpiabort</div><div>94387:kweide:/gpfs1/kweide/2009/extreme_preexpansion/8km_79_138o148_1365_1/rundir_0001/flash3</div><div>94389:chulwoo:/gpfs/home/chulwoo/CPS/v5_0_5-tw/reweight/BGP.x.v2</div><div><br></div><div>If you have a job on the list above, I recommend you restore your original executables before releasing the hold on your job.</div><div><br></div><div>We apologize for any inconvenience, and again, thank you for your patience. Please email <a href="mailto:support@alcf.anl.gov">support@alcf.anl.gov</a> if you have further questions or concerns.</div><div><br></div><div>Andrew Cherry</div><div>ALCF Support</div><div><br></div><div><div>On Jan 23, 2009, at 3:39 PM, Andrew Cherry wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>If you've recompiled any of your code for the new BlueGene V1R3M0 driver on Intrepid, or are in the process of doing so, please read this note. </div><div><br></div><div>As mentioned in the communication below, there is a possibility that we will back out of the new V1R3M0 driver installation on Intrepid if we don't find a resolution to the problems we are seeing by the close of business today. One implication of this is that binaries rebuilt for the new driver will most likely not work on the old one. If you are rebuilding your code, please make sure to keep a copy of your old binaries in case we need to revert back.</div><div><br></div><div>If you have already rebuilt your code and overwrote or deleted your original binaries, it is still possible to recover your original files from one of the GPFS snapshots. If you cd into the .snapshots directory in your home directory, you will find several directories named after days of the week. Each of these contains a snapshot of your home directory from the last occurrence of that day; for example, if you need to recover your files from last Saturday, you'll find them in the "saturday" directory. The snapshots are only good for a week (i.e. the "saturday" directory is reused every Saturday), so there's limited time to copy data out of them. Unfortunately, we didn't get a snapshot on Sunday this past week, so if you need something older than Monday's copy you'll need to get it from the Saturday copy. If you do need to get any files from last Saturday's or Monday's snapshots, I suggest you do so today before the snapshots are overwritten tomorrow and Monday.</div><div><br></div><div>Sorry for any inconvenience. We will let you know later today if we end up reverting back to the old driver.</div><div><br></div><div>Andrew Cherry</div><div>ALCF Support</div><div><br></div><div><div>On Jan 22, 2009, at 10:58 PM, <a href="mailto:tstacey@alcf.anl.gov">tstacey@alcf.anl.gov</a> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>We continue to experience extremely unstable behavior when running jobs<br>at scale on the new V1R3M0 driver. Debugging has been complicated by<br>diagnostics calling out intermittent hardware issues. We currently have<br>a level 1 PMR (very high priority trouble ticket) open with IBM and they<br>are working on this problem. If we have not been able to resolve the<br>problem by tomorrow afternoon, we will revert back to our previous<br>driver, so that jobs can run over the weekend.<br><br>We apologize for any inconvenience this outage has caused and are<br>grateful for your continued patience as we work through this issue.<br>Please contact <a href="mailto:support@alcf.anl.gov">support@alcf.anl.gov</a> if you have any further questions.<br><br>Thanks, The ALCF Support Team<br>_______________________________________________<br>intrepid-notify mailing list<br><a href="mailto:intrepid-notify@alcf.anl.gov">intrepid-notify@alcf.anl.gov</a><br><a href="http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify">http://lists.alcf.anl.gov/cgi-bin/mailman/listinfo/intrepid-notify</a><br></div></blockquote></div><br></div></blockquote></div><br></div></body></html>