[Llvm-bgq-discuss] Performance relative to Xeons

Mon Mar 11 00:09:17 CDT 2013

FYI, on 1024 nodes in c64 mode MPI_Reduce_scatter_block was noticeably
faster in my application than MPI_Allreduce composed with memcpy (the
payloads were double-precision floating-point data). I should mention that
each MPI_Allreduce call took place over a different subcommunicator of four
processes, and that the whole algorithm only sent log p messages from each
of the p=65,536 processes.

Jack

On Tue, Mar 5, 2013 at 7:32 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:

> MARPN is about 100 LOC and it will take me about 3 minutes to hack
> c2-on-1core for you but I'm not going to do it tonight.
>
> Calling MPI_Comm_split with color= Kernel_ProcessorCoreID() in c32
> mode and use the resulting comm with color=0 will give you the same
> behavior if your app can be initialized with any subcomm.  If you're
> using MPI_COMM_WORLD directly, MPI zealots everywhere (or maybe just
> Argonne) will wag their finger at you :-)
>
> MPI_Reduce_scatter_block sucks on BG and you should continue to use
> MPI_Allreduce+memcpy as I suggested on BGP.  See attached showing
> allreduce > reduce and know that, at least in some cases (e.g.
> MPI_COMM_WORLD), reduce >> reduce_scatter.  Also double > integer, but
> I assume you are using doubles.
>
> I can send you my complete collective test results if you care.
>
> Jeff
>
> On Tue, Mar 5, 2013 at 9:08 PM, Jack Poulson <jack.poulson at gmail.com>
> wrote:
> > Yikes. I will take that as "no, there is not any easy way to do that". I
> > guess the fair thing to do in the mean time is to just start my strong
> > scaling test from one node and go from there. I was seeing weird spikes
> in
> > MPI_Reduce_scatter_block communication time when using c2-c8 anyway.
> >
> > Jack
> >
> >
> > On Tue, Mar 5, 2013 at 6:56 PM, Jeff Hammond <jhammond at alcf.anl.gov>
> wrote:
> >>
> >> if you are running MPI-only, c32 is the bare minimum requirement for
> >> saturating issue rate and this is only true in the ridiculous case
> >> where you have a perfect 50-50 mix of ALU and FPU ops.  Most codes
> >> need 3 hw threads per core to saturate instruction rate but since c48
> >> doesn't exist except in my universe
> >> (https://wiki.alcf.anl.gov/parts/index.php/MARPN), many codes resort
> >> to c64 solely because of instruction issue rate issues (and their
> >> inability to thread).
> >>
> >> However, if your code runs faster with c32 than c16, it is not
> >> bandwidth-limited, because BGQ hits the bandwidth limit with 1 thread
> >> per core without any vector load/store (per Bob Walkup's talk at
> >> MiraCon today, if nothing else).
> >>
> >> If you want to run 2 MPI ranks on the same core, I can hack MARPN to
> >> give you this via c32 and a fake world or you can implement it
> >> yourself manually using the approach that MARPN uses, which is to
> >> query the hardware location of ranks and MPI_Comm_split off a comm
> >> that has two ranks on the same core.
> >>
> >> Best,
> >>
> >> Jeff
> >>
> >> On Tue, Mar 5, 2013 at 8:49 PM, Jack Poulson <jack.poulson at gmail.com>
> >> wrote:
> >> > So it turns out that running in c32 mode yields nearly a 2x speedup
> over
> >> > c16
> >> > mode (one thread per process in both cases). Unfortunately this
> results
> >> > in
> >> > another question.
> >> >
> >> > My previous strong scaling test ran the same problem on 1, 2, 4, 8,
> 16,
> >> > ...,
> >> > and 16384 processes, using c1 through c16 for the first tests and c16
> >> > for
> >> > the rest.
> >> >
> >> > Since my code apparently benefits from using 2 MPI processes per
> core, I
> >> > would like to run the equivalent tests. However, I'm not certain how
> to
> >> > launch, for instance, two MPI processes on one node and have them both
> >> > run
> >> > on the same core. I could run on one node with c2 mode, but I think
> that
> >> > this would be a bit dishonest, as I suspect that it would really make
> >> > use of
> >> > two cores.
> >> >
> >> > Any ideas how to do this?
> >> >
> >> > Jack
> >> >
> >> > On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <jack.poulson at gmail.com
> >
> >> > wrote:
> >> >>
> >> >> The code is almost certainly memory bandwidth limited, and 25 vs. 80
> >> >> GB/s
> >> >> would almost explain the 4x difference in performance (the >2x factor
> >> >> is
> >> >> *after* adjusting for the fact that BGQ's clock is 1.75x slower than
> my
> >> >> 2.8
> >> >> GHz desktop).
> >> >>
> >> >> Also, the desktop results were not using any vendor libraries at all.
> >> >> Just
> >> >> g++-4.7 with Ubuntu's stock math libraries.
> >> >>
> >> >> Jack
> >> >>
> >> >> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov
> >
> >> >> wrote:
> >> >>>
> >> >>> The BGQ core is fully in-order with a instruction short pipeline and
> >> >>> single-issue per hardware thread and dual-issue per core provided
> the
> >> >>> ALU and FPU instructions come from two different hardware threads.
> >> >>>
> >> >>> A Xeon core is out-of-order with deep pipelines and can decode up to
> >> >>> four instructions per cycles.  The Internet refuses to tell me for
> >> >>> certain if this means that it is proper to say a Sandy Bridge is
> >> >>> quad-issue, but it seems that way.
> >> >>>
> >> >>> The memory bandwidth measured by STREAM may anywhere from 50% to
> 200%
> >> >>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a
> late
> >> >>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
> >> >>> surprising if a Xeon is ~2x faster.
> >> >>>
> >> >>> In addition to normalizing w.r.t. clock-rate, you should normalize
> >> >>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
> >> >>> running HPL.  An Intel Xeon uses twice that just for the socket, not
> >> >>> including DRAM, IO, etc.
> >> >>>
> >> >>> Note also that the BGQ QPX vector ISA is much more restrictive than
> >> >>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
> >> >>> better than IBM XL at vectorizing.
> >> >>>
> >> >>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
> >> >>> LAPACK-intensive applications.
> >> >>>
> >> >>> Jeff
> >> >>>
> >> >>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <
> jack.poulson at gmail.com>
> >> >>> wrote:
> >> >>> > Hello,
> >> >>> >
> >> >>> > I have benchmarking my code on Vesta and, while I have been seeing
> >> >>> > excellent
> >> >>> > strong scaling, I am a little underwhelmed by the wall-clock
> timings
> >> >>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @
> 2.80GHz). I
> >> >>> > am
> >> >>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on
> my
> >> >>> > desktop
> >> >>> > (both used -O3), and I am seeing roughly a factor of four
> difference
> >> >>> > in
> >> >>> > performance on the same number of cores.
> >> >>> >
> >> >>> > If I ignored the fact that I am using a vendor math library on BGQ
> >> >>> > and
> >> >>> > reference implementations on my desktop, I would expect the BGQ
> >> >>> > timings
> >> >>> > to
> >> >>> > be a factor of 1.75 slower due to clockspeed differences. Would
> >> >>> > anyone
> >> >>> > have
> >> >>> > an explanation for the additional factor of more than 2x? My
> >> >>> > algorithm
> >> >>> > spends most of its time in sin/cos/sqrt evaluations and dgemm with
> >> >>> > two
> >> >>> > right-hand sides.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Jack
> >> >>> >
> >> >>> > _______________________________________________
> >> >>> > llvm-bgq-discuss mailing list
> >> >>> > llvm-bgq-discuss at lists.alcf.anl.gov
> >> >>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Jeff Hammond
> >> >>> Argonne Leadership Computing Facility
> >> >>> University of Chicago Computation Institute
> >> >>> jhammond at alcf.anl.gov / (630) 252-5381
> >> >>> http://www.linkedin.com/in/jeffhammond
> >> >>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Jeff Hammond
> >> Argonne Leadership Computing Facility
> >> University of Chicago Computation Institute
> >> jhammond at alcf.anl.gov / (630) 252-5381
> >> http://www.linkedin.com/in/jeffhammond
> >> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> >
> >
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20130310/2b2f63ac/attachment.html>