[Llvm-bgq-discuss] Performance relative to Xeons

Tue Mar 5 13:17:01 CST 2013

The BGQ core is fully in-order with a instruction short pipeline and
single-issue per hardware thread and dual-issue per core provided the
ALU and FPU instructions come from two different hardware threads.

A Xeon core is out-of-order with deep pipelines and can decode up to
four instructions per cycles.  The Internet refuses to tell me for
certain if this means that it is proper to say a Sandy Bridge is
quad-issue, but it seems that way.

The memory bandwidth measured by STREAM may anywhere from 50% to 200%
higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
surprising if a Xeon is ~2x faster.

In addition to normalizing w.r.t. clock-rate, you should normalize
w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
running HPL.  An Intel Xeon uses twice that just for the socket, not
including DRAM, IO, etc.

Note also that the BGQ QPX vector ISA is much more restrictive than
AVX w.r.t. alignment.  Additionally, the Intel compilers are way
better than IBM XL at vectorizing.

Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
LAPACK-intensive applications.

Jeff

On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com> wrote:
> Hello,
>
> I have benchmarking my code on Vesta and, while I have been seeing excellent
> strong scaling, I am a little underwhelmed by the wall-clock timings
> relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am
> using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my desktop
> (both used -O3), and I am seeing roughly a factor of four difference in
> performance on the same number of cores.
>
> If I ignored the fact that I am using a vendor math library on BGQ and
> reference implementations on my desktop, I would expect the BGQ timings to
> be a factor of 1.75 slower due to clockspeed differences. Would anyone have
> an explanation for the additional factor of more than 2x? My algorithm
> spends most of its time in sin/cos/sqrt evaluations and dgemm with two
> right-hand sides.
>
> Thanks,
> Jack
>
> _______________________________________________
> llvm-bgq-discuss mailing list
> llvm-bgq-discuss at lists.alcf.anl.gov
> https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond