[Llvm-bgq-discuss] Patches for r176829-20130309 (the current vesta version)

Michael Kruse MichaelKruse at meinersbur.de
Thu Apr 18 08:45:33 CDT 2013


2013/4/17 Hal Finkel <hfinkel at anl.gov>:
>> Not a problem; I'd rather get reports early and often.

Good to know. I only get embarrassed when I report something that was
caused by my own stupidity and consumed someone else's time.

>> > 2. bgclang is quite unreliable on inline assembler. With this:
>> > asm (
>> >             "dcbt       0,%[ptr]  \n"
>> >             "dcbt  %[c64],%[ptr]  \n"
>> >             "dcbt %[c128],%[ptr]  \n"
>> >             "dcbt %[c192],%[ptr]  \n"
>> >             "dcbt %[c256],%[ptr]  \n"
>> >             "dcbt %[c320],%[ptr]  \n"
>> >             : :
>> >                 [ptr] "+r" (ptr),
>> >               [c64]  "b" (64),
>> >               [c128] "b" (128),
>> >               [c192] "b" (192),
>> >               [c256] "b" (256),
>> >               [c320] "b" (320)
>> >         );
>> >
>> > I sometimes get
>> > error: invalid input constraint '+r' in asm
>> > other times
>> > fatal error: error in backend: Do not know how to split the result
>> > of
>> > this operator!
>> > (though I am not sure it's this piece of code, clang doesn't give
>> > me
>> > a location)
>
> Also, in the mean time, you can use the __dcbt intrinsic. It works just like in xlc (so special header currently required).

There is actually a reason why I use this inline assembly.

I have a big loop body working on two contiguous streams of data. Per
iteration, there are 1536 + 2304 bytes to read that I want to
prefetch. Using

__dcbt(p+0)
__dcbt(p+64)
__dcbt(p+128)
...

will make xlc generate code like

dcbt 0, r1
li r2 64
dcbt r2, r1
li r2 128
dcbt r2, r1
...

because there are more constants involved than general purpose
registers available. So the constants have to be rematerialised in
every loop iteration.
Using a scheme like

dcbt 0, r1
dcbt r3, r1
dcbt r4, r1
dcbt r5, r1
dcbt r6, r1
addi r1, r1, 320
dcbt 0, r1
dcbt r3, r1
dcbt r4, r1
dcbt r5, r1
dcbt r6, r1
addi r1, r1, 320

only 4 constants are needed in registers that can be preserved during
loop iterations.

The situation is even worse when using vec_ld, which generates code
with just a single qvlfdux (with u=update) instruction and lots of
"li"s .

I don't know yet how clang behaves here.

Regards,
Michael


--
Tardyzentrismus verboten!


More information about the llvm-bgq-discuss mailing list