[Llvm-bgq-discuss] clang on BGQ performance

Tue Mar 25 13:03:31 CDT 2014

----- Original Message -----
> From: "Thomas Gooding" <tgooding at us.ibm.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "John A. Biddiscombe" <biddisco at cscs.ch>, llvm-bgq-discuss at lists.alcf.anl.gov,
> llvm-bgq-discuss-bounces at lists.alcf.anl.gov
> Sent: Tuesday, March 25, 2014 12:58:36 PM
> Subject: Re: [Llvm-bgq-discuss] clang on BGQ performance
> 
> 
> 
> bgclang's non-OMP good COPY performance is due to an implicit call to
> memcpy(), which is QPX optimized. (see previous thread about
> built-ins ;)

Yes, although it does not even do that when OpenMP is enabled. On the other hand, the QPX generation for the other kernels (when OpenMP is not enabled) is actually not horrible.

> 
> If it helps, my sampling profiler has this breakdown of the -fopenmp.
> 
> thread 0 count= 772 (19.30%) ..omp_microtask.37 thread 0 count= 748
> (18.70%) ..omp_microtask.35 thread 0 count= 717 (17.93%)
> ..omp_microtask.36 thread 0 count= 622 (15.55%) ..omp_microtask.34
> thread 0 count= 102 (2.55%) .checkSTREAMresults thread 0 count= 72
> (1.80%) ..omp_microtask.15 thread 0 count= 44 (1.10%)
> ..omp_microtask.12
> 
> runjob --strace 0 shows several calls to gettimeofday, but not much
> other kernel activity. So I suspect its spending its time in an OMP
> runtime optimization opportunity.

I think the primary problem is that we're losing loop bounds information and pointer aliasing information when we do the OpenMP outlining (the creating of the inner microtask functions).

 -Hal

> 
> Tom Gooding
> Senior Engineer / Blue Gene SW Lead / CAPI
> tgooding at us.ibm.com 507-253-0747
> 
> 
> Inactive hide details for Hal Finkel ---03/25/2014 12:03:10
> PM---John, Thanks for looking into this (and providing a useful
> benHal Finkel ---03/25/2014 12:03:10 PM---John, Thanks for looking
> into this (and providing a useful benchmark)! You'll find this
> interesting:
> 
> 
> 
> 
> From:
> Hal Finkel <hfinkel at anl.gov>
> 
> 
> 
> To:
> "John A. Biddiscombe" <biddisco at cscs.ch>
> 
> 
> 
> Cc:
> llvm-bgq-discuss at lists.alcf.anl.gov
> 
> 
> 
> Date:
> 03/25/2014 12:03 PM
> 
> 
> 
> Subject:
> Re: [Llvm-bgq-discuss] clang on BGQ performance
> 
> 
> 
> Sent by:
> llvm-bgq-discuss-bounces at lists.alcf.anl.gov
> 
> 
> 
> John,
> 
> Thanks for looking into this (and providing a useful benchmark)!
> You'll find this interesting:
> 
> bgclang -O3 -fopenmp with 1 thread:
> 
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 635.7 0.251708 0.251708 0.251709
> Scale: 519.7 0.307855 0.307855 0.307856
> Add: 802.0 0.299267 0.299266 0.299267
> Triad: 753.4 0.318716 0.318566 0.318735
> 
> gcc 4.7.2 -O3 -fopenmp with 1 thread:
> 
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 2067.4 0.077393 0.077392 0.077395
> Scale: 1329.4 0.120353 0.120353 0.120354
> Add: 1943.5 0.123490 0.123489 0.123490
> Triad: 1872.4 0.128179 0.128178 0.128179
> 
> gcc without OpenMP is actually slightly worse, go figure ;)
> 
> bgclang -O3 with 1 thread (with no -fopenmp)
> 
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 15660.2 0.010296 0.010217 0.010870
> Scale: 5523.7 0.028967 0.028966 0.028967
> Add: 6283.2 0.038198 0.038197 0.038198
> Triad: 6331.9 0.037906 0.037903 0.037920
> 
> bgxlc_r -O3 -qsmp=omp with 1 thread:
> 
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 3762.0 0.042535 0.042531 0.042538
> Scale: 5083.5 0.031481 0.031474 0.031494
> Add: 7394.2 0.032487 0.032458 0.032510
> Triad: 7397.6 0.032481 0.032443 0.032499
> 
> bgxlc_r -O3 (no -qsmp=omp) with 1 thread:
> 
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 3574.1 0.044768 0.044767 0.044769
> Scale: 3301.2 0.048468 0.048467 0.048469
> Add: 4233.2 0.056696 0.056694 0.056699
> Triad: 4350.1 0.055173 0.055171 0.055177
> 
> all of these defined TUNED (just because it puts the kernels into
> separate functions). It seems that the OpenMP outlining in
> Clang/LLVM is seriously interfering with the ability of the
> vectorizer and instruction scheduler to do useful work. I assume
> that most of this is because of pointer aliasing information being
> lost in the OpenMP transformation. We'll need to work on this! (I'm
> actually in the middle of working on a new pointer aliasing
> framework for LLVM, and I'll be able to use that to solve a lot of
> these issues).
> 
> -Hal
> 
> ----- Original Message -----
> > From: "John A. Biddiscombe" <biddisco at cscs.ch>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> > Sent: Tuesday, March 25, 2014 11:44:50 AM
> > Subject: RE: [Llvm-bgq-discuss] clang on BGQ performance
> > 
> > > Can you please provide details on exactly what you did? What
> > > compile flags
> > > did you use, did you define TUNED?
> > 
> > edited Makefile to skip the fortran and set bgclang vars
> > 
> > bbpbgas040:~/bgas/clang/build/stream$ cat Makefile
> > 
> > CC = bgclang
> > CFLAGS = -O3 -fopenmp
> > -L/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/bgclang/omp/lib/
> > 
> > all: stream_c.exe
> > 
> > stream_c.exe: stream.c
> > $(CC) $(CFLAGS) stream.c -o stream_c.exe
> > 
> > clean:
> > rm -f stream_c.exe *.o
> > 
> > 
> > then just a make. I didn't set any other vars (like TUNED etc)
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> llvm-bgq-discuss mailing list
> llvm-bgq-discuss at lists.alcf.anl.gov
> https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> 
> 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory