The code is almost certainly memory bandwidth limited, and 25 vs. 80
GB/s would almost explain the 4x difference in performance (the >2x
factor is *after* adjusting for the fact that BGQ's clock is 1.75x
slower than my 2.8 GHz desktop).<br>
<br>Also, the desktop results were not using any vendor libraries at all. Just g++-4.7 with Ubuntu's stock math libraries.<br><br>Jack<br><br><div class="gmail_quote">On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The BGQ core is fully in-order with a instruction short pipeline and<br>
single-issue per hardware thread and dual-issue per core provided the<br>
ALU and FPU instructions come from two different hardware threads.<br>
<br>
A Xeon core is out-of-order with deep pipelines and can decode up to<br>
four instructions per cycles. The Internet refuses to tell me for<br>
certain if this means that it is proper to say a Sandy Bridge is<br>
quad-issue, but it seems that way.<br>
<br>
The memory bandwidth measured by STREAM may anywhere from 50% to 200%<br>
higher on Intel Xeon than BGQ. BGQ does 25-30 GB/s whereas as a late<br>
model Xeon can do 80 GB/s. If your code is BW-limited, it isn't<br>
surprising if a Xeon is ~2x faster.<br>
<br>
In addition to normalizing w.r.t. clock-rate, you should normalize<br>
w.r.t. watts per socket. BGQ uses 60-70W per node unless you're<br>
running HPL. An Intel Xeon uses twice that just for the socket, not<br>
including DRAM, IO, etc.<br>
<br>
Note also that the BGQ QPX vector ISA is much more restrictive than<br>
AVX w.r.t. alignment. Additionally, the Intel compilers are way<br>
better than IBM XL at vectorizing.<br>
<br>
Finally, ESSL sucks compared to MKL. That alone may be worth 2x in<br>
LAPACK-intensive applications.<br>
<br>
Jeff<br>
<div class="HOEnZb"><div class="h5"><br>
On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <<a href="mailto:jack.poulson@gmail.com">jack.poulson@gmail.com</a>> wrote:<br>
> Hello,<br>
><br>
> I have benchmarking my code on Vesta and, while I have been seeing excellent<br>
> strong scaling, I am a little underwhelmed by the wall-clock timings<br>
> relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am<br>
> using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my desktop<br>
> (both used -O3), and I am seeing roughly a factor of four difference in<br>
> performance on the same number of cores.<br>
><br>
> If I ignored the fact that I am using a vendor math library on BGQ and<br>
> reference implementations on my desktop, I would expect the BGQ timings to<br>
> be a factor of 1.75 slower due to clockspeed differences. Would anyone have<br>
> an explanation for the additional factor of more than 2x? My algorithm<br>
> spends most of its time in sin/cos/sqrt evaluations and dgemm with two<br>
> right-hand sides.<br>
><br>
> Thanks,<br>
> Jack<br>
><br>
</div></div><div class="HOEnZb"><div class="h5">> _______________________________________________<br>
> llvm-bgq-discuss mailing list<br>
> <a href="mailto:llvm-bgq-discuss@lists.alcf.anl.gov">llvm-bgq-discuss@lists.alcf.anl.gov</a><br>
> <a href="https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss" target="_blank">https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss</a><br>
><br>
<br>
<br>
<br>
</div></div><span class="HOEnZb"><font color="#888888">--<br>
Jeff Hammond<br>
Argonne Leadership Computing Facility<br>
University of Chicago Computation Institute<br>
<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>
<a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
<a href="https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</a><br>
</font></span></blockquote></div><br>