[Llvm-bgq-discuss] clang on BGQ performance

Tue Mar 25 12:02:59 CDT 2014

John,

Thanks for looking into this (and providing a useful benchmark)! You'll find this interesting:

bgclang -O3 -fopenmp with 1 thread:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:             635.7     0.251708     0.251708     0.251709
Scale:            519.7     0.307855     0.307855     0.307856
Add:              802.0     0.299267     0.299266     0.299267
Triad:            753.4     0.318716     0.318566     0.318735

gcc 4.7.2 -O3 -fopenmp with 1 thread:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            2067.4     0.077393     0.077392     0.077395
Scale:           1329.4     0.120353     0.120353     0.120354
Add:             1943.5     0.123490     0.123489     0.123490
Triad:           1872.4     0.128179     0.128178     0.128179

gcc without OpenMP is actually slightly worse, go figure ;)

bgclang -O3 with 1 thread (with no -fopenmp)

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           15660.2     0.010296     0.010217     0.010870
Scale:           5523.7     0.028967     0.028966     0.028967
Add:             6283.2     0.038198     0.038197     0.038198
Triad:           6331.9     0.037906     0.037903     0.037920

bgxlc_r -O3 -qsmp=omp with 1 thread:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3762.0     0.042535     0.042531     0.042538
Scale:           5083.5     0.031481     0.031474     0.031494
Add:             7394.2     0.032487     0.032458     0.032510
Triad:           7397.6     0.032481     0.032443     0.032499

bgxlc_r -O3 (no -qsmp=omp) with 1 thread:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3574.1     0.044768     0.044767     0.044769
Scale:           3301.2     0.048468     0.048467     0.048469
Add:             4233.2     0.056696     0.056694     0.056699
Triad:           4350.1     0.055173     0.055171     0.055177

all of these defined TUNED (just because it puts the kernels into separate functions). It seems that the OpenMP outlining in Clang/LLVM is seriously interfering with the ability of the vectorizer and instruction scheduler to do useful work. I assume that most of this is because of pointer aliasing information being lost in the OpenMP transformation. We'll need to work on this! (I'm actually in the middle of working on a new pointer aliasing framework for LLVM, and I'll be able to use that to solve a lot of these issues).

 -Hal

----- Original Message -----
> From: "John A. Biddiscombe" <biddisco at cscs.ch>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> Sent: Tuesday, March 25, 2014 11:44:50 AM
> Subject: RE: [Llvm-bgq-discuss] clang on BGQ performance
> 
> > Can you please provide details on exactly what you did? What
> > compile flags
> > did you use, did you define TUNED?
> 
> edited Makefile to skip the fortran and set bgclang vars
> 
> bbpbgas040:~/bgas/clang/build/stream$ cat Makefile
> 
> CC = bgclang
> CFLAGS = -O3 -fopenmp
> -L/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/bgclang/omp/lib/
> 
> all:  stream_c.exe
> 
> stream_c.exe: stream.c
>         $(CC) $(CFLAGS) stream.c -o stream_c.exe
> 
> clean:
>         rm -f stream_c.exe *.o
> 
> 
> then just a make. I didn't set any other vars (like TUNED etc)
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory