[Llvm-bgq-discuss] Performance relative to Xeons

Jack Poulson jack.poulson at gmail.com
Tue Mar 5 14:27:22 CST 2013


The code is almost certainly memory bandwidth limited, and 25 vs. 80 GB/s
would almost explain the 4x difference in performance (the >2x factor is
*after* adjusting for the fact that BGQ's clock is 1.75x slower than my 2.8
GHz desktop).

Also, the desktop results were not using any vendor libraries at all. Just
g++-4.7 with Ubuntu's stock math libraries.

Jack

On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:

> The BGQ core is fully in-order with a instruction short pipeline and
> single-issue per hardware thread and dual-issue per core provided the
> ALU and FPU instructions come from two different hardware threads.
>
> A Xeon core is out-of-order with deep pipelines and can decode up to
> four instructions per cycles.  The Internet refuses to tell me for
> certain if this means that it is proper to say a Sandy Bridge is
> quad-issue, but it seems that way.
>
> The memory bandwidth measured by STREAM may anywhere from 50% to 200%
> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
> surprising if a Xeon is ~2x faster.
>
> In addition to normalizing w.r.t. clock-rate, you should normalize
> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
> running HPL.  An Intel Xeon uses twice that just for the socket, not
> including DRAM, IO, etc.
>
> Note also that the BGQ QPX vector ISA is much more restrictive than
> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
> better than IBM XL at vectorizing.
>
> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
> LAPACK-intensive applications.
>
> Jeff
>
> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com>
> wrote:
> > Hello,
> >
> > I have benchmarking my code on Vesta and, while I have been seeing
> excellent
> > strong scaling, I am a little underwhelmed by the wall-clock timings
> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am
> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my
> desktop
> > (both used -O3), and I am seeing roughly a factor of four difference in
> > performance on the same number of cores.
> >
> > If I ignored the fact that I am using a vendor math library on BGQ and
> > reference implementations on my desktop, I would expect the BGQ timings
> to
> > be a factor of 1.75 slower due to clockspeed differences. Would anyone
> have
> > an explanation for the additional factor of more than 2x? My algorithm
> > spends most of its time in sin/cos/sqrt evaluations and dgemm with two
> > right-hand sides.
> >
> > Thanks,
> > Jack
> >
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> >
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20130305/cfc56f55/attachment.html>


More information about the llvm-bgq-discuss mailing list