<html><body>
<p><font size="2" face="sans-serif">bgclang's non-OMP good COPY performance is due to an implicit call to memcpy(), which is QPX optimized. (see previous thread about built-ins ;)</font><br>
<br>
<font size="2" face="sans-serif">If it helps, my sampling profiler has this breakdown of the -fopenmp. </font>
<ul style="padding-left: 18pt"><tt><font size="2">thread 0 count= 772 (19.30%) ..omp_microtask.37</font></tt><br>
<tt><font size="2">thread 0 count= 748 (18.70%) ..omp_microtask.35</font></tt><br>
<tt><font size="2">thread 0 count= 717 (17.93%) ..omp_microtask.36</font></tt><br>
<tt><font size="2">thread 0 count= 622 (15.55%) ..omp_microtask.34</font></tt><br>
<tt><font size="2">thread 0 count= 102 (2.55%) .checkSTREAMresults</font></tt><br>
<tt><font size="2">thread 0 count= 72 (1.80%) ..omp_microtask.15</font></tt><br>
<tt><font size="2">thread 0 count= 44 (1.10%) ..omp_microtask.12</font></tt></ul>
<br>
<br>
<font size="2" face="sans-serif">runjob --strace 0 shows several calls to gettimeofday, but not much other kernel activity. So I suspect its spending its time in an OMP runtime optimization opportunity.</font><br>
<br>
<font size="2" face="sans-serif">Tom Gooding<br>
Senior Engineer / Blue Gene SW Lead / CAPI<br>
tgooding@us.ibm.com 507-253-0747<br>
</font><br>
<br>
<img width="16" height="16" src="cid:1__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt="Inactive hide details for Hal Finkel ---03/25/2014 12:03:10 PM---John, Thanks for looking into this (and providing a useful ben"><font size="2" color="#424282" face="sans-serif">Hal Finkel ---03/25/2014 12:03:10 PM---John, Thanks for looking into this (and providing a useful benchmark)! You'll find this interesting:</font><br>
<br>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">From:</font></ul>
</td><td width="100%"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">Hal Finkel <hfinkel@anl.gov></font></td></tr>
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">To:</font></ul>
</td><td width="100%"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">"John A. Biddiscombe" <biddisco@cscs.ch></font></td></tr>
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">Cc:</font></ul>
</td><td width="100%" valign="middle"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">llvm-bgq-discuss@lists.alcf.anl.gov</font></td></tr>
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">Date:</font></ul>
</td><td width="100%"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">03/25/2014 12:03 PM</font></td></tr>
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">Subject:</font></ul>
</td><td width="100%"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">Re: [Llvm-bgq-discuss] clang on BGQ performance</font></td></tr>
<tr valign="top"><td width="1%"><img width="96" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<ul style="padding-left: 4pt"><font size="1" color="#5F5F5F" face="sans-serif">Sent by:</font></ul>
</td><td width="100%"><img width="1" height="1" src="cid:2__=08BBF635DFF264748f9e8a93df938@us.ibm.com" border="0" alt=""><br>
<font size="1" face="sans-serif">llvm-bgq-discuss-bounces@lists.alcf.anl.gov</font></td></tr>
</table>
<hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br>
<br>
<br>
<tt><font size="2">John,<br>
<br>
Thanks for looking into this (and providing a useful benchmark)! You'll find this interesting:<br>
<br>
bgclang -O3 -fopenmp with 1 thread:<br>
<br>
Function Best Rate MB/s Avg time Min time Max time<br>
Copy: 635.7 0.251708 0.251708 0.251709<br>
Scale: 519.7 0.307855 0.307855 0.307856<br>
Add: 802.0 0.299267 0.299266 0.299267<br>
Triad: 753.4 0.318716 0.318566 0.318735<br>
<br>
gcc 4.7.2 -O3 -fopenmp with 1 thread:<br>
<br>
Function Best Rate MB/s Avg time Min time Max time<br>
Copy: 2067.4 0.077393 0.077392 0.077395<br>
Scale: 1329.4 0.120353 0.120353 0.120354<br>
Add: 1943.5 0.123490 0.123489 0.123490<br>
Triad: 1872.4 0.128179 0.128178 0.128179<br>
<br>
gcc without OpenMP is actually slightly worse, go figure ;)<br>
<br>
bgclang -O3 with 1 thread (with no -fopenmp)<br>
<br>
Function Best Rate MB/s Avg time Min time Max time<br>
Copy: 15660.2 0.010296 0.010217 0.010870<br>
Scale: 5523.7 0.028967 0.028966 0.028967<br>
Add: 6283.2 0.038198 0.038197 0.038198<br>
Triad: 6331.9 0.037906 0.037903 0.037920<br>
<br>
bgxlc_r -O3 -qsmp=omp with 1 thread:<br>
<br>
Function Best Rate MB/s Avg time Min time Max time<br>
Copy: 3762.0 0.042535 0.042531 0.042538<br>
Scale: 5083.5 0.031481 0.031474 0.031494<br>
Add: 7394.2 0.032487 0.032458 0.032510<br>
Triad: 7397.6 0.032481 0.032443 0.032499<br>
<br>
bgxlc_r -O3 (no -qsmp=omp) with 1 thread:<br>
<br>
Function Best Rate MB/s Avg time Min time Max time<br>
Copy: 3574.1 0.044768 0.044767 0.044769<br>
Scale: 3301.2 0.048468 0.048467 0.048469<br>
Add: 4233.2 0.056696 0.056694 0.056699<br>
Triad: 4350.1 0.055173 0.055171 0.055177<br>
<br>
all of these defined TUNED (just because it puts the kernels into separate functions). It seems that the OpenMP outlining in Clang/LLVM is seriously interfering with the ability of the vectorizer and instruction scheduler to do useful work. I assume that most of this is because of pointer aliasing information being lost in the OpenMP transformation. We'll need to work on this! (I'm actually in the middle of working on a new pointer aliasing framework for LLVM, and I'll be able to use that to solve a lot of these issues).<br>
<br>
-Hal<br>
<br>
----- Original Message -----<br>
> From: "John A. Biddiscombe" <biddisco@cscs.ch><br>
> To: "Hal Finkel" <hfinkel@anl.gov><br>
> Cc: llvm-bgq-discuss@lists.alcf.anl.gov<br>
> Sent: Tuesday, March 25, 2014 11:44:50 AM<br>
> Subject: RE: [Llvm-bgq-discuss] clang on BGQ performance<br>
> <br>
> > Can you please provide details on exactly what you did? What<br>
> > compile flags<br>
> > did you use, did you define TUNED?<br>
> <br>
> edited Makefile to skip the fortran and set bgclang vars<br>
> <br>
> bbpbgas040:~/bgas/clang/build/stream$ cat Makefile<br>
> <br>
> CC = bgclang<br>
> CFLAGS = -O3 -fopenmp<br>
> -L/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/bgclang/omp/lib/<br>
> <br>
> all: stream_c.exe<br>
> <br>
> stream_c.exe: stream.c<br>
> $(CC) $(CFLAGS) stream.c -o stream_c.exe<br>
> <br>
> clean:<br>
> rm -f stream_c.exe *.o<br>
> <br>
> <br>
> then just a make. I didn't set any other vars (like TUNED etc)<br>
> <br>
> <br>
<br>
-- <br>
Hal Finkel<br>
Assistant Computational Scientist<br>
Leadership Computing Facility<br>
Argonne National Laboratory<br>
_______________________________________________<br>
llvm-bgq-discuss mailing list<br>
llvm-bgq-discuss@lists.alcf.anl.gov<br>
</font></tt><tt><font size="2"><a href="https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss">https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss</a></font></tt><tt><font size="2"><br>
<br>
</font></tt><br>
<br>
</body></html>