[Llvm-bgq-discuss] Performance relative to Xeons

Tue Mar 5 20:49:17 CST 2013

So it turns out that running in c32 mode yields nearly a 2x speedup over
c16 mode (one thread per process in both cases). Unfortunately this results
in another question.

My previous strong scaling test ran the same problem on 1, 2, 4, 8, 16,
..., and 16384 processes, using c1 through c16 for the first tests and c16
for the rest.

Since my code apparently benefits from using 2 MPI processes per core, I
would like to run the equivalent tests. However, I'm not certain how to
launch, for instance, two MPI processes on one node and have them both run
on the same core. I could run on one node with c2 mode, but I think that
this would be a bit dishonest, as I suspect that it would really make use
of two cores.

Any ideas how to do this?

Jack

On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <jack.poulson at gmail.com>wrote:

> The code is almost certainly memory bandwidth limited, and 25 vs. 80 GB/s
> would almost explain the 4x difference in performance (the >2x factor is
> *after* adjusting for the fact that BGQ's clock is 1.75x slower than my 2.8
> GHz desktop).
>
> Also, the desktop results were not using any vendor libraries at all. Just
> g++-4.7 with Ubuntu's stock math libraries.
>
> Jack
>
> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov>wrote:
>
>> The BGQ core is fully in-order with a instruction short pipeline and
>> single-issue per hardware thread and dual-issue per core provided the
>> ALU and FPU instructions come from two different hardware threads.
>>
>> A Xeon core is out-of-order with deep pipelines and can decode up to
>> four instructions per cycles.  The Internet refuses to tell me for
>> certain if this means that it is proper to say a Sandy Bridge is
>> quad-issue, but it seems that way.
>>
>> The memory bandwidth measured by STREAM may anywhere from 50% to 200%
>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
>> surprising if a Xeon is ~2x faster.
>>
>> In addition to normalizing w.r.t. clock-rate, you should normalize
>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
>> running HPL.  An Intel Xeon uses twice that just for the socket, not
>> including DRAM, IO, etc.
>>
>> Note also that the BGQ QPX vector ISA is much more restrictive than
>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
>> better than IBM XL at vectorizing.
>>
>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
>> LAPACK-intensive applications.
>>
>> Jeff
>>
>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have benchmarking my code on Vesta and, while I have been seeing
>> excellent
>> > strong scaling, I am a little underwhelmed by the wall-clock timings
>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am
>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my
>> desktop
>> > (both used -O3), and I am seeing roughly a factor of four difference in
>> > performance on the same number of cores.
>> >
>> > If I ignored the fact that I am using a vendor math library on BGQ and
>> > reference implementations on my desktop, I would expect the BGQ timings
>> to
>> > be a factor of 1.75 slower due to clockspeed differences. Would anyone
>> have
>> > an explanation for the additional factor of more than 2x? My algorithm
>> > spends most of its time in sin/cos/sqrt evaluations and dgemm with two
>> > right-hand sides.
>> >
>> > Thanks,
>> > Jack
>> >
>> > _______________________________________________
>> > llvm-bgq-discuss mailing list
>> > llvm-bgq-discuss at lists.alcf.anl.gov
>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20130305/6f627b84/attachment-0001.html>