[Llvm-bgq-discuss] Patches for r176829-20130309 (the current vesta version)

Thu Apr 18 09:15:23 CDT 2013

----- Original Message -----
> From: "Michael Kruse" <MichaelKruse at meinersbur.de>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Michael Kruse" <reply at meinersbur.de>, llvm-bgq-discuss at lists.alcf.anl.gov
> Sent: Thursday, April 18, 2013 8:45:33 AM
> Subject: Re: [Llvm-bgq-discuss] Patches for r176829-20130309 (the current vesta version)
> 
> 2013/4/17 Hal Finkel <hfinkel at anl.gov>:
> >> Not a problem; I'd rather get reports early and often.
> 
> Good to know. I only get embarrassed when I report something that was
> caused by my own stupidity and consumed someone else's time.
> 
> >> > 2. bgclang is quite unreliable on inline assembler. With this:
> >> > asm (
> >> >             "dcbt       0,%[ptr]  \n"
> >> >             "dcbt  %[c64],%[ptr]  \n"
> >> >             "dcbt %[c128],%[ptr]  \n"
> >> >             "dcbt %[c192],%[ptr]  \n"
> >> >             "dcbt %[c256],%[ptr]  \n"
> >> >             "dcbt %[c320],%[ptr]  \n"
> >> >             : :
> >> >                 [ptr] "+r" (ptr),
> >> >               [c64]  "b" (64),
> >> >               [c128] "b" (128),
> >> >               [c192] "b" (192),
> >> >               [c256] "b" (256),
> >> >               [c320] "b" (320)
> >> >         );
> >> >
> >> > I sometimes get
> >> > error: invalid input constraint '+r' in asm
> >> > other times
> >> > fatal error: error in backend: Do not know how to split the
> >> > result
> >> > of
> >> > this operator!
> >> > (though I am not sure it's this piece of code, clang doesn't
> >> > give
> >> > me
> >> > a location)
> >
> > Also, in the mean time, you can use the __dcbt intrinsic. It works
> > just like in xlc (so special header currently required).
> 
> There is actually a reason why I use this inline assembly.
> 
> I have a big loop body working on two contiguous streams of data. Per
> iteration, there are 1536 + 2304 bytes to read that I want to
> prefetch. Using
> 
> __dcbt(p+0)
> __dcbt(p+64)
> __dcbt(p+128)
> ...
> 
> will make xlc generate code like
> 
> dcbt 0, r1
> li r2 64
> dcbt r2, r1
> li r2 128
> dcbt r2, r1
> ...
> 
> because there are more constants involved than general purpose
> registers available. So the constants have to be rematerialised in
> every loop iteration.
> Using a scheme like
> 
> dcbt 0, r1
> dcbt r3, r1
> dcbt r4, r1
> dcbt r5, r1
> dcbt r6, r1
> addi r1, r1, 320
> dcbt 0, r1
> dcbt r3, r1
> dcbt r4, r1
> dcbt r5, r1
> dcbt r6, r1
> addi r1, r1, 320
> 
> only 4 constants are needed in registers that can be preserved during
> loop iterations.

Interesting. Both xlc and llvm have 'loop strength reduction' passes that are supposed to take care of this kind of thing. However, please do let me know how clang behaves for you? If it is not doing the right thing here, then I'd like to fix it. Prefetching is *very* important on the Q because of the L1P access latency, and we need to make sure that the compiler supports it as well as it possibly can (and dcbt is really the only common instruction without an update form).

> 
> The situation is even worse when using vec_ld, which generates code
> with just a single qvlfdux (with u=update) instruction and lots of
> "li"s .

Please let me know; I've seen llvm generate chains of update forms on some loops, but if you have cases in which it does not work well, then I'll improve it.

The whole point of having our own compiler is so that we can make sure it gets these kinds of things right :)

Thanks again,
Hal

> 
> I don't know yet how clang behaves here.
> 
> Regards,
> Michael
> 
> 
> --
> Tardyzentrismus verboten!
>