[Llvm-bgq-discuss] Performance relative to Xeons

Tue Mar 5 20:56:17 CST 2013

if you are running MPI-only, c32 is the bare minimum requirement for
saturating issue rate and this is only true in the ridiculous case
where you have a perfect 50-50 mix of ALU and FPU ops.  Most codes
need 3 hw threads per core to saturate instruction rate but since c48
doesn't exist except in my universe
(https://wiki.alcf.anl.gov/parts/index.php/MARPN), many codes resort
to c64 solely because of instruction issue rate issues (and their
inability to thread).

However, if your code runs faster with c32 than c16, it is not
bandwidth-limited, because BGQ hits the bandwidth limit with 1 thread
per core without any vector load/store (per Bob Walkup's talk at
MiraCon today, if nothing else).

If you want to run 2 MPI ranks on the same core, I can hack MARPN to
give you this via c32 and a fake world or you can implement it
yourself manually using the approach that MARPN uses, which is to
query the hardware location of ranks and MPI_Comm_split off a comm
that has two ranks on the same core.

Best,

Jeff

On Tue, Mar 5, 2013 at 8:49 PM, Jack Poulson <jack.poulson at gmail.com> wrote:
> So it turns out that running in c32 mode yields nearly a 2x speedup over c16
> mode (one thread per process in both cases). Unfortunately this results in
> another question.
>
> My previous strong scaling test ran the same problem on 1, 2, 4, 8, 16, ...,
> and 16384 processes, using c1 through c16 for the first tests and c16 for
> the rest.
>
> Since my code apparently benefits from using 2 MPI processes per core, I
> would like to run the equivalent tests. However, I'm not certain how to
> launch, for instance, two MPI processes on one node and have them both run
> on the same core. I could run on one node with c2 mode, but I think that
> this would be a bit dishonest, as I suspect that it would really make use of
> two cores.
>
> Any ideas how to do this?
>
> Jack
>
> On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <jack.poulson at gmail.com>
> wrote:
>>
>> The code is almost certainly memory bandwidth limited, and 25 vs. 80 GB/s
>> would almost explain the 4x difference in performance (the >2x factor is
>> *after* adjusting for the fact that BGQ's clock is 1.75x slower than my 2.8
>> GHz desktop).
>>
>> Also, the desktop results were not using any vendor libraries at all. Just
>> g++-4.7 with Ubuntu's stock math libraries.
>>
>> Jack
>>
>> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov>
>> wrote:
>>>
>>> The BGQ core is fully in-order with a instruction short pipeline and
>>> single-issue per hardware thread and dual-issue per core provided the
>>> ALU and FPU instructions come from two different hardware threads.
>>>
>>> A Xeon core is out-of-order with deep pipelines and can decode up to
>>> four instructions per cycles.  The Internet refuses to tell me for
>>> certain if this means that it is proper to say a Sandy Bridge is
>>> quad-issue, but it seems that way.
>>>
>>> The memory bandwidth measured by STREAM may anywhere from 50% to 200%
>>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
>>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
>>> surprising if a Xeon is ~2x faster.
>>>
>>> In addition to normalizing w.r.t. clock-rate, you should normalize
>>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
>>> running HPL.  An Intel Xeon uses twice that just for the socket, not
>>> including DRAM, IO, etc.
>>>
>>> Note also that the BGQ QPX vector ISA is much more restrictive than
>>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
>>> better than IBM XL at vectorizing.
>>>
>>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
>>> LAPACK-intensive applications.
>>>
>>> Jeff
>>>
>>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have benchmarking my code on Vesta and, while I have been seeing
>>> > excellent
>>> > strong scaling, I am a little underwhelmed by the wall-clock timings
>>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I am
>>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my
>>> > desktop
>>> > (both used -O3), and I am seeing roughly a factor of four difference in
>>> > performance on the same number of cores.
>>> >
>>> > If I ignored the fact that I am using a vendor math library on BGQ and
>>> > reference implementations on my desktop, I would expect the BGQ timings
>>> > to
>>> > be a factor of 1.75 slower due to clockspeed differences. Would anyone
>>> > have
>>> > an explanation for the additional factor of more than 2x? My algorithm
>>> > spends most of its time in sin/cos/sqrt evaluations and dgemm with two
>>> > right-hand sides.
>>> >
>>> > Thanks,
>>> > Jack
>>> >
>>> > _______________________________________________
>>> > llvm-bgq-discuss mailing list
>>> > llvm-bgq-discuss at lists.alcf.anl.gov
>>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
>>> >
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond