So it turns out that running in c32 mode yields nearly a 2x speedup over c16 mode (one thread per process in both cases). Unfortunately this results in another question.<br><br>My previous strong scaling test ran the same problem on 1, 2, 4, 8, 16, ..., and 16384 processes, using c1 through c16 for the first tests and c16 for the rest.<br>

<br>Since my code apparently benefits from using 2 MPI processes per core, I would like to run the equivalent tests. However, I'm not certain how to launch, for instance, two MPI processes on one node and have them both run on the same core. I could run on one node with c2 mode, but I think that this would be a bit dishonest, as I suspect that it would really make use of two cores.<br>

<br>Any ideas how to do this?<br><br>Jack<br><br><div class="gmail_quote">On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <span dir="ltr"><<a href="mailto:jack.poulson@gmail.com" target="_blank">jack.poulson@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">The code is almost certainly memory bandwidth limited, and 25 vs. 80 

GB/s would almost explain the 4x difference in performance (the >2x 

factor is *after* adjusting for the fact that BGQ's clock is 1.75x 

slower than my 2.8 GHz desktop).<br>

<br>Also, the desktop results were not using any vendor libraries at all. Just g++-4.7 with Ubuntu's stock math libraries.<br><br>Jack<br><br></div><div class="gmail_quote"><div class="im">On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>


</div><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The BGQ core is fully in-order with a instruction short pipeline and<br>

single-issue per hardware thread and dual-issue per core provided the<br>

ALU and FPU instructions come from two different hardware threads.<br>

<br>

A Xeon core is out-of-order with deep pipelines and can decode up to<br>

four instructions per cycles.  The Internet refuses to tell me for<br>

certain if this means that it is proper to say a Sandy Bridge is<br>

quad-issue, but it seems that way.<br>

<br>

The memory bandwidth measured by STREAM may anywhere from 50% to 200%<br>

higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late<br>

model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't<br>

surprising if a Xeon is ~2x faster.<br>

<br>

In addition to normalizing w.r.t. clock-rate, you should normalize<br>

w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're<br>

running HPL.  An Intel Xeon uses twice that just for the socket, not<br>

including DRAM, IO, etc.<br>

<br>

Note also that the BGQ QPX vector ISA is much more restrictive than<br>

AVX w.r.t. alignment.  Additionally, the Intel compilers are way<br>

better than IBM XL at vectorizing.<br>

<br>

Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in<br>

LAPACK-intensive applications.<br>

<br>

Jeff<br>

<div><div><br>

On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <<a href="mailto:jack.poulson@gmail.com" target="_blank">jack.poulson@gmail.com</a>> wrote:<br>

> Hello,<br>

><br>

> I have benchmarking my code on Vesta and, while I have been seeing excellent<br>

> strong scaling, I am a little underwhelmed by the wall-clock timings<br>

> relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am<br>

> using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my desktop<br>

> (both used -O3), and I am seeing roughly a factor of four difference in<br>

> performance on the same number of cores.<br>

><br>

> If I ignored the fact that I am using a vendor math library on BGQ and<br>

> reference implementations on my desktop, I would expect the BGQ timings to<br>

> be a factor of 1.75 slower due to clockspeed differences. Would anyone have<br>

> an explanation for the additional factor of more than 2x? My algorithm<br>

> spends most of its time in sin/cos/sqrt evaluations and dgemm with two<br>

> right-hand sides.<br>

><br>

> Thanks,<br>

> Jack<br>

><br>

</div></div><div><div>> _______________________________________________<br>

> llvm-bgq-discuss mailing list<br>

> <a href="mailto:llvm-bgq-discuss@lists.alcf.anl.gov" target="_blank">llvm-bgq-discuss@lists.alcf.anl.gov</a><br>

> <a href="https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss" target="_blank">https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss</a><br>

><br>

<br>

<br>

<br>

</div></div><span><font color="#888888">--<br>

Jeff Hammond<br>

Argonne Leadership Computing Facility<br>

University of Chicago Computation Institute<br>

<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381" target="_blank">(630) 252-5381</a><br>

<a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>

<a href="https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</a><br>

</font></span></blockquote></div></div></div><br>

</blockquote></div><br>