[Llvm-bgq-discuss] Performance relative to Xeons

Tue Mar 5 21:08:58 CST 2013

Yikes. I will take that as "no, there is not any easy way to do that". I
guess the fair thing to do in the mean time is to just start my strong
scaling test from one node and go from there. I was seeing weird spikes in
MPI_Reduce_scatter_block communication time when using c2-c8 anyway.

Jack

On Tue, Mar 5, 2013 at 6:56 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:

> if you are running MPI-only, c32 is the bare minimum requirement for
> saturating issue rate and this is only true in the ridiculous case
> where you have a perfect 50-50 mix of ALU and FPU ops.  Most codes
> need 3 hw threads per core to saturate instruction rate but since c48
> doesn't exist except in my universe
> (https://wiki.alcf.anl.gov/parts/index.php/MARPN), many codes resort
> to c64 solely because of instruction issue rate issues (and their
> inability to thread).
>
> However, if your code runs faster with c32 than c16, it is not
> bandwidth-limited, because BGQ hits the bandwidth limit with 1 thread
> per core without any vector load/store (per Bob Walkup's talk at
> MiraCon today, if nothing else).
>
> If you want to run 2 MPI ranks on the same core, I can hack MARPN to
> give you this via c32 and a fake world or you can implement it
> yourself manually using the approach that MARPN uses, which is to
> query the hardware location of ranks and MPI_Comm_split off a comm
> that has two ranks on the same core.
>
> Best,
>
> Jeff
>
> On Tue, Mar 5, 2013 at 8:49 PM, Jack Poulson <jack.poulson at gmail.com>
> wrote:
> > So it turns out that running in c32 mode yields nearly a 2x speedup over
> c16
> > mode (one thread per process in both cases). Unfortunately this results
> in
> > another question.
> >
> > My previous strong scaling test ran the same problem on 1, 2, 4, 8, 16,
> ...,
> > and 16384 processes, using c1 through c16 for the first tests and c16 for
> > the rest.
> >
> > Since my code apparently benefits from using 2 MPI processes per core, I
> > would like to run the equivalent tests. However, I'm not certain how to
> > launch, for instance, two MPI processes on one node and have them both
> run
> > on the same core. I could run on one node with c2 mode, but I think that
> > this would be a bit dishonest, as I suspect that it would really make
> use of
> > two cores.
> >
> > Any ideas how to do this?
> >
> > Jack
> >
> > On Tue, Mar 5, 2013 at 12:27 PM, Jack Poulson <jack.poulson at gmail.com>
> > wrote:
> >>
> >> The code is almost certainly memory bandwidth limited, and 25 vs. 80
> GB/s
> >> would almost explain the 4x difference in performance (the >2x factor is
> >> *after* adjusting for the fact that BGQ's clock is 1.75x slower than my
> 2.8
> >> GHz desktop).
> >>
> >> Also, the desktop results were not using any vendor libraries at all.
> Just
> >> g++-4.7 with Ubuntu's stock math libraries.
> >>
> >> Jack
> >>
> >> On Tue, Mar 5, 2013 at 11:17 AM, Jeff Hammond <jhammond at alcf.anl.gov>
> >> wrote:
> >>>
> >>> The BGQ core is fully in-order with a instruction short pipeline and
> >>> single-issue per hardware thread and dual-issue per core provided the
> >>> ALU and FPU instructions come from two different hardware threads.
> >>>
> >>> A Xeon core is out-of-order with deep pipelines and can decode up to
> >>> four instructions per cycles.  The Internet refuses to tell me for
> >>> certain if this means that it is proper to say a Sandy Bridge is
> >>> quad-issue, but it seems that way.
> >>>
> >>> The memory bandwidth measured by STREAM may anywhere from 50% to 200%
> >>> higher on Intel Xeon than BGQ.  BGQ does 25-30 GB/s whereas as a late
> >>> model Xeon can do 80 GB/s.  If your code is BW-limited, it isn't
> >>> surprising if a Xeon is ~2x faster.
> >>>
> >>> In addition to normalizing w.r.t. clock-rate, you should normalize
> >>> w.r.t. watts per socket.  BGQ uses 60-70W per node unless you're
> >>> running HPL.  An Intel Xeon uses twice that just for the socket, not
> >>> including DRAM, IO, etc.
> >>>
> >>> Note also that the BGQ QPX vector ISA is much more restrictive than
> >>> AVX w.r.t. alignment.  Additionally, the Intel compilers are way
> >>> better than IBM XL at vectorizing.
> >>>
> >>> Finally, ESSL sucks compared to MKL.  That alone may be worth 2x in
> >>> LAPACK-intensive applications.
> >>>
> >>> Jeff
> >>>
> >>> On Tue, Mar 5, 2013 at 12:59 PM, Jack Poulson <jack.poulson at gmail.com>
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > I have benchmarking my code on Vesta and, while I have been seeing
> >>> > excellent
> >>> > strong scaling, I am a little underwhelmed by the wall-clock timings
> >>> > relative to my desktop (Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz). I
> am
> >>> > using the newest version of bgclang++ on Vesta, and g++-4.7.2 on my
> >>> > desktop
> >>> > (both used -O3), and I am seeing roughly a factor of four difference
> in
> >>> > performance on the same number of cores.
> >>> >
> >>> > If I ignored the fact that I am using a vendor math library on BGQ
> and
> >>> > reference implementations on my desktop, I would expect the BGQ
> timings
> >>> > to
> >>> > be a factor of 1.75 slower due to clockspeed differences. Would
> anyone
> >>> > have
> >>> > an explanation for the additional factor of more than 2x? My
> algorithm
> >>> > spends most of its time in sin/cos/sqrt evaluations and dgemm with
> two
> >>> > right-hand sides.
> >>> >
> >>> > Thanks,
> >>> > Jack
> >>> >
> >>> > _______________________________________________
> >>> > llvm-bgq-discuss mailing list
> >>> > llvm-bgq-discuss at lists.alcf.anl.gov
> >>> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Jeff Hammond
> >>> Argonne Leadership Computing Facility
> >>> University of Chicago Computation Institute
> >>> jhammond at alcf.anl.gov / (630) 252-5381
> >>> http://www.linkedin.com/in/jeffhammond
> >>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> >>
> >>
> >
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20130305/f46c5d18/attachment.html>