[Llvm-bgq-discuss] Performance relative to Xeons

Mon Mar 11 06:32:56 CDT 2013

subcomms tend to use MPICH collectives not PAMI ones such that
convention wisdom does apply w.r.t. performance of collectives.

jeff

On Mon, Mar 11, 2013 at 12:09 AM, Jack Poulson <jack.poulson at gmail.com> wrote:
> FYI, on 1024 nodes in c64 mode MPI_Reduce_scatter_block was noticeably
> faster in my application than MPI_Allreduce composed with memcpy (the
> payloads were double-precision floating-point data). I should mention that
> each MPI_Allreduce call took place over a different subcommunicator of four
> processes, and that the whole algorithm only sent log p messages from each
> of the p=65,536 processes.
>
> Jack
>
>
> On Tue, Mar 5, 2013 at 7:32 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>
>> MARPN is about 100 LOC and it will take me about 3 minutes to hack
>> c2-on-1core for you but I'm not going to do it tonight.
>>
>> Calling MPI_Comm_split with color= Kernel_ProcessorCoreID() in c32
>> mode and use the resulting comm with color=0 will give you the same
>> behavior if your app can be initialized with any subcomm.  If you're
>> using MPI_COMM_WORLD directly, MPI zealots everywhere (or maybe just
>> Argonne) will wag their finger at you :-)
>>
>> MPI_Reduce_scatter_block sucks on BG and you should continue to use
>> MPI_Allreduce+memcpy as I suggested on BGP.  See attached showing
>> allreduce > reduce and know that, at least in some cases (e.g.
>> MPI_COMM_WORLD), reduce >> reduce_scatter.  Also double > integer, but
>> I assume you are using doubles.
>>
>> I can send you my complete collective test results if you care.
>>
>> Jeff
>>
>> On Tue, Mar 5, 2013 at 9:08 PM, Jack Poulson <jack.poulson at gmail.com>
>> wrote:
>> > Yikes. I will take that as "no, there is not any easy way to do that". I
>> > guess the fair thing to do in the mean time is to just start my strong
>> > scaling test from one node and go from there. I was seeing weird spikes
>> > in
>> > MPI_Reduce_scatter_block communication time when using c2-c8 anyway.
>> >
>> > Jack
>> >
>> >
>> > On Tue, Mar 5, 2013 at 6:56 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>> > wrote:
>> >>
>> >> if you are running MPI-only, c32 is the bare minimum requirement for
>> >> saturating issue rate and this is only true in the ridiculous case
>> >> where you have a perfect 50-50 mix of ALU and FPU ops.  Most codes
>> >> need 3 hw threads per core to saturate instruction rate but since c48
>> >> doesn't exist except in my universe
>> >> (https://wiki.alcf.anl.gov/parts/index.php/MARPN), many codes resort
>> >> to c64 solely because of instruction issue rate issues (and their
>> >> inability to thread).
>> >>
>> >> However, if your code runs faster with c32 than c16, it is not
>> >> bandwidth-limited, because BGQ hits the bandwidth limit with 1 thread
>> >> per core without any vector load/store (per Bob Walkup's talk at
>> >> MiraCon today, if nothing else).
>> >>
>> >> If you want to run 2 MPI ranks on the same core, I can hack MARPN to
>> >> give you this via c32 and a fake world or you can implement it
>> >> yourself manually using the approach that MARPN uses, which is to
>> >> query the hardware location of ranks and MPI_Comm_split off a comm
>> >> that has two ranks on the same core.
>> >>
>> >> Best,
>> >>
>> >> Jeff
>> >>
>> >> On Tue, Mar 5, 2013 at 8:49 PM, Jack Poulson <jack.poulson at gmail.com>
>> >> wrote:
>> >> > So it turns out that running in c32 mode yields nearly a 2x speedup
>> >> > over
>> >> > c16
>> >> > mode (one thread per process in both cases). Unfortunately this
>> >> > results
>> >> > in
>> >> > another question.
>> >> >
>> >> > My previous strong scaling test ran the same problem on 1, 2, 4, 8,
>> >> > 16,
>> >> > ...,
>> >> > and 16384 processes, using c1 through c16 for the first tests and c16
>> >> > for
>> >> > the rest.
>> >> >
>> >> > Since my code apparently benefits from using 2 MPI processes per
>> >> > core, I
>> >> > would like to run the equivalent tests. However, I'm not certain how
>> >> > to
>> >> > launch, for instance, two MPI processes on one node and have them
>> >> > both
>> >> > run
>> >> > on the same core. I could run on one node with c2 mode, but I think
>> >> > that
>> >> > this would be a bit dishonest, as I suspect that it would really make
>> >> > use of
>> >> > two cores.
>> >> >
>> >> > Any ideas how to do this?
>> >> >
>> >> > Jack
>> >> >
>> >> > On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson
>> >> > <jack.poulson at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> The code is almost certainly memory bandwidth limited, and 25 vs. 80
>> >> >> GB/s
>> >> >> would almost explain the 4x difference in performance (the >2x
>> >> >> factor
>> >> >> is
>> >> >> *after* adjusting for the fact that BGQ's clock is 1.75x slower than
>> >> >> my
>> >> >> 2.8
>> >> >> GHz desktop).
>> >> >>
>> >> >> Also, the desktop results were not using any vendor libraries at
>> >> >> all.
>> >> >> Just
>> >> >> g++-4.7 with Ubuntu's stock math libraries.
>> >> >>
>> >> >> Jack
>> >> >>
>> >> >> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond
>> >> >> <jhammond at alcf.anl.gov>
>> >> >> wrote:
>> >> >>>
>> >> >>> The BGQ core is fully in-order with a instruction short pipeline
>> >> >>> and
>> >> >>> single-issue per hardware thread and dual-issue per core provided
>> >> >>> the
>> >> >>> ALU and FPU instructions come from two different hardware threads.
>> >> >>>
>> >> >>> A Xeon core is out-of-order with deep pipelines and can decode up
>> >> >>> to
>> >> >>> four instructions per cycles.  The Internet refuses to tell me for
>> >> >>> certain if this means that it is proper to say a Sandy Bridge is
>> >> >>> quad-issue, but it seems that way.
>> >> >>>
>> >> >>> The memory bandwidth measured by STREAM may anywhere from 50% to
>> >> >>> 200%
>> >> >>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a
>> >> >>> late
>> >> >>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
>> >> >>> surprising if a Xeon is ~2x faster.
>> >> >>>
>> >> >>> In addition to normalizing w.r.t. clock-rate, you should normalize
>> >> >>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
>> >> >>> running HPL.  An Intel Xeon uses twice that just for the socket,
>> >> >>> not
>> >> >>> including DRAM, IO, etc.
>> >> >>>
>> >> >>> Note also that the BGQ QPX vector ISA is much more restrictive than
>> >> >>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
>> >> >>> better than IBM XL at vectorizing.
>> >> >>>
>> >> >>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
>> >> >>> LAPACK-intensive applications.
>> >> >>>
>> >> >>> Jeff
>> >> >>>
>> >> >>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson
>> >> >>> <jack.poulson at gmail.com>
>> >> >>> wrote:
>> >> >>> > Hello,
>> >> >>> >
>> >> >>> > I have benchmarking my code on Vesta and, while I have been
>> >> >>> > seeing
>> >> >>> > excellent
>> >> >>> > strong scaling, I am a little underwhelmed by the wall-clock
>> >> >>> > timings
>> >> >>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @
>> >> >>> > 2.80GHz). I
>> >> >>> > am
>> >> >>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on
>> >> >>> > my
>> >> >>> > desktop
>> >> >>> > (both used -O3), and I am seeing roughly a factor of four
>> >> >>> > difference
>> >> >>> > in
>> >> >>> > performance on the same number of cores.
>> >> >>> >
>> >> >>> > If I ignored the fact that I am using a vendor math library on
>> >> >>> > BGQ
>> >> >>> > and
>> >> >>> > reference implementations on my desktop, I would expect the BGQ
>> >> >>> > timings
>> >> >>> > to
>> >> >>> > be a factor of 1.75 slower due to clockspeed differences. Would
>> >> >>> > anyone
>> >> >>> > have
>> >> >>> > an explanation for the additional factor of more than 2x? My
>> >> >>> > algorithm
>> >> >>> > spends most of its time in sin/cos/sqrt evaluations and dgemm
>> >> >>> > with
>> >> >>> > two
>> >> >>> > right-hand sides.
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Jack
>> >> >>> >
>> >> >>> > _______________________________________________
>> >> >>> > llvm-bgq-discuss mailing list
>> >> >>> > llvm-bgq-discuss at lists.alcf.anl.gov
>> >> >>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Jeff Hammond
>> >> >>> Argonne Leadership Computing Facility
>> >> >>> University of Chicago Computation Institute
>> >> >>> jhammond at alcf.anl.gov / (630) 252-5381
>> >> >>> http://www.linkedin.com/in/jeffhammond
>> >> >>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jeff Hammond
>> >> Argonne Leadership Computing Facility
>> >> University of Chicago Computation Institute
>> >> jhammond at alcf.anl.gov / (630) 252-5381
>> >> http://www.linkedin.com/in/jeffhammond
>> >> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> >
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond