[Llvm-bgq-discuss] Performance relative to Xeons

Tue Mar 5 21:32:06 CST 2013

MARPN is about 100 LOC and it will take me about 3 minutes to hack
c2-on-1core for you but I'm not going to do it tonight.

Calling MPI_Comm_split with color= Kernel_ProcessorCoreID() in c32
mode and use the resulting comm with color=0 will give you the same
behavior if your app can be initialized with any subcomm.  If you're
using MPI_COMM_WORLD directly, MPI zealots everywhere (or maybe just
Argonne) will wag their finger at you :-)

MPI_Reduce_scatter_block sucks on BG and you should continue to use
MPI_Allreduce+memcpy as I suggested on BGP.  See attached showing
allreduce > reduce and know that, at least in some cases (e.g.
MPI_COMM_WORLD), reduce >> reduce_scatter.  Also double > integer, but
I assume you are using doubles.

I can send you my complete collective test results if you care.

Jeff

On Tue, Mar 5, 2013 at 9:08 PM, Jack Poulson <jack.poulson at gmail.com> wrote:
> Yikes. I will take that as "no, there is not any easy way to do that". I
> guess the fair thing to do in the mean time is to just start my strong
> scaling test from one node and go from there. I was seeing weird spikes in
> MPI_Reduce_scatter_block communication time when using c2-c8 anyway.
>
> Jack
>
>
> On Tue, Mar 5, 2013 at 6:56 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>
>> if you are running MPI-only, c32 is the bare minimum requirement for
>> saturating issue rate and this is only true in the ridiculous case
>> where you have a perfect 50-50 mix of ALU and FPU ops.  Most codes
>> need 3 hw threads per core to saturate instruction rate but since c48
>> doesn't exist except in my universe
>> (https://wiki.alcf.anl.gov/parts/index.php/MARPN), many codes resort
>> to c64 solely because of instruction issue rate issues (and their
>> inability to thread).
>>
>> However, if your code runs faster with c32 than c16, it is not
>> bandwidth-limited, because BGQ hits the bandwidth limit with 1 thread
>> per core without any vector load/store (per Bob Walkup's talk at
>> MiraCon today, if nothing else).
>>
>> If you want to run 2 MPI ranks on the same core, I can hack MARPN to
>> give you this via c32 and a fake world or you can implement it
>> yourself manually using the approach that MARPN uses, which is to
>> query the hardware location of ranks and MPI_Comm_split off a comm
>> that has two ranks on the same core.
>>
>> Best,
>>
>> Jeff
>>
>> On Tue, Mar 5, 2013 at 8:49 PM, Jack Poulson <jack.poulson at gmail.com>
>> wrote:
>> > So it turns out that running in c32 mode yields nearly a 2x speedup over
>> > c16
>> > mode (one thread per process in both cases). Unfortunately this results
>> > in
>> > another question.
>> >
>> > My previous strong scaling test ran the same problem on 1, 2, 4, 8, 16,
>> > ...,
>> > and 16384 processes, using c1 through c16 for the first tests and c16
>> > for
>> > the rest.
>> >
>> > Since my code apparently benefits from using 2 MPI processes per core, I
>> > would like to run the equivalent tests. However, I'm not certain how to
>> > launch, for instance, two MPI processes on one node and have them both
>> > run
>> > on the same core. I could run on one node with c2 mode, but I think that
>> > this would be a bit dishonest, as I suspect that it would really make
>> > use of
>> > two cores.
>> >
>> > Any ideas how to do this?
>> >
>> > Jack
>> >
>> > On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <jack.poulson at gmail.com>
>> > wrote:
>> >>
>> >> The code is almost certainly memory bandwidth limited, and 25 vs. 80
>> >> GB/s
>> >> would almost explain the 4x difference in performance (the >2x factor
>> >> is
>> >> *after* adjusting for the fact that BGQ's clock is 1.75x slower than my
>> >> 2.8
>> >> GHz desktop).
>> >>
>> >> Also, the desktop results were not using any vendor libraries at all.
>> >> Just
>> >> g++-4.7 with Ubuntu's stock math libraries.
>> >>
>> >> Jack
>> >>
>> >> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov>
>> >> wrote:
>> >>>
>> >>> The BGQ core is fully in-order with a instruction short pipeline and
>> >>> single-issue per hardware thread and dual-issue per core provided the
>> >>> ALU and FPU instructions come from two different hardware threads.
>> >>>
>> >>> A Xeon core is out-of-order with deep pipelines and can decode up to
>> >>> four instructions per cycles.  The Internet refuses to tell me for
>> >>> certain if this means that it is proper to say a Sandy Bridge is
>> >>> quad-issue, but it seems that way.
>> >>>
>> >>> The memory bandwidth measured by STREAM may anywhere from 50% to 200%
>> >>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
>> >>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
>> >>> surprising if a Xeon is ~2x faster.
>> >>>
>> >>> In addition to normalizing w.r.t. clock-rate, you should normalize
>> >>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
>> >>> running HPL.  An Intel Xeon uses twice that just for the socket, not
>> >>> including DRAM, IO, etc.
>> >>>
>> >>> Note also that the BGQ QPX vector ISA is much more restrictive than
>> >>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
>> >>> better than IBM XL at vectorizing.
>> >>>
>> >>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
>> >>> LAPACK-intensive applications.
>> >>>
>> >>> Jeff
>> >>>
>> >>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > I have benchmarking my code on Vesta and, while I have been seeing
>> >>> > excellent
>> >>> > strong scaling, I am a little underwhelmed by the wall-clock timings
>> >>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I
>> >>> > am
>> >>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my
>> >>> > desktop
>> >>> > (both used -O3), and I am seeing roughly a factor of four difference
>> >>> > in
>> >>> > performance on the same number of cores.
>> >>> >
>> >>> > If I ignored the fact that I am using a vendor math library on BGQ
>> >>> > and
>> >>> > reference implementations on my desktop, I would expect the BGQ
>> >>> > timings
>> >>> > to
>> >>> > be a factor of 1.75 slower due to clockspeed differences. Would
>> >>> > anyone
>> >>> > have
>> >>> > an explanation for the additional factor of more than 2x? My
>> >>> > algorithm
>> >>> > spends most of its time in sin/cos/sqrt evaluations and dgemm with
>> >>> > two
>> >>> > right-hand sides.
>> >>> >
>> >>> > Thanks,
>> >>> > Jack
>> >>> >
>> >>> > _______________________________________________
>> >>> > llvm-bgq-discuss mailing list
>> >>> > llvm-bgq-discuss at lists.alcf.anl.gov
>> >>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jeff Hammond
>> >>> Argonne Leadership Computing Facility
>> >>> University of Chicago Computation Institute
>> >>> jhammond at alcf.anl.gov / (630) 252-5381
>> >>> http://www.linkedin.com/in/jeffhammond
>> >>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reduce_vs_allreduce_n49152_c16.png
Type: image/png
Size: 98519 bytes
Desc: not available
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20130305/ee91fd87/attachment-0001.png>