[Llvm-bgq-discuss] more issues from trying bgclang with GROMACS

Mark Abraham mark.abraham at scilifelab.se
Fri Feb 7 12:35:26 CST 2014


On Fri, Feb 7, 2014 at 3:44 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
> > From: "Mark Abraham" <mark.abraham at scilifelab.se>
> > Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> > Sent: Friday, February 7, 2014 8:29:18 AM
> > Subject: Re: [Llvm-bgq-discuss] more issues from trying bgclang with
> GROMACS
> >
> >
> >
> >
> >
> > Hi Hal,
> >
> > AFAICS not the problem, but it's hard for a mere user to find this
> > kind of stuff out:
> >
> > [juqueen3 ~ (juq-homedir)] $ ls /bgsys/drivers/
> > ppcfloor toolchain V1R2M0 V1R2M1
>
> Yep; that seems good. FYI, if you run ls -l /bgsys/drivers you can see to
> which driver ppcfloor is symlinked, and that will give you the answer.
>

Ja V1R2M1.


> >
> > Jeff helpfully tried to sort me out with an ALCF account last year,
> > but the cryptocard they shipped never worked with the PIN they sent
> > with it, and the helpdesk insisted on me calling from Sweden to get
> > any help at all, so I gave up. :-( That was about the time Congress
> > decided to act more like children than usual, so maybe things were
> > messier than usual! :-)
>
> That's odd. They should be able to authenticate your identity by some
> mechanism other than using caller-id. :( -- When you have a chance please,
> try again; if they still won't help you, I'll raise the issue internally.
>

OK.

On the bgclang issue, if you can reasonably provide instructions on how to
> repeat a test showing this issue, I can try it on my end as well. More
> likely than not, if there is a correctness issue to debug, I'd need to do
> that at some point anyway.
>

I haven't pinpointed where the problem arises this time, but last time it
was with an omp parallel do over threads executing the innermost kernels
for MD nonbonded interactions.

I've looked into how simple I can make reproducing the OpenMP crash, and it
is ugly. The OpenMP debug build (-g) runs OK, with plain C and QPX-specific
kernels. The OpenMP release build (-O3) runs OK with plain C, but gives
junk results with QPX somewhere leading to a subsequent segfault. So the
core file stack trace is not useful.

Altogether, that is a good suggestion that problems start to occur with the
same omp parallel do, and that it is at least somewhat specific to bgclang.
It might be possible to do some dirty comparison of the resulting energy
and force at -O3 -g between the C and QPX versions. My guess is that the
problem will be visible by the end of the very first inner loop, which
should be binary reproducible between the two versions. If not, then the
problem is probably when the subsequent reduction from the thread-local
force buffers occurs. I just don't have the time to try that and be on the
wrong track, sit waiting trying to reproduce two parallel debugging
sessions, lead to no conclusion, we don't actually need bgclang to work
because xlc does, etc. Happy to advise if there's something you can
identify, though. Can tarball you code + build instructions + single input
file off list if you'd like to try.

Both plain C and QPX kernels are a horrible mess of nested file #inclusion
and subsequent #ifdefs, because we have about 70 different kernels per SIMD
flavour (and growing). (We're working on a python generator instead, but
that's not here yet.) I'll preprocess them by hand into correct single
files if you want.

The other correctness issue could be anywhere - unfortunately basically all
of our tests are end-to-end, so when a bunch of them fail you have to work
out the theme(s). I'll have to do that and that probably won't be soon!

Ah, one more thing: are you linking against anything that also links in
> IBM's OpenMP runtime (like ESSL SMP)? That can also cause issues like this.
>

There had been a dependency on a system FFTW, but the above was done with a
fully independent GROMACS. So I think the issue is not an
OpenMP-runtime-version clash.

Mark


>  -Hal
>
> >
> > Mark
> >
> >
> >
> >
> >
> >
> > On Fri, Feb 7, 2014 at 2:27 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> >
> >
> >
> > ----- Original Message -----
> > > From: "Mark Abraham" < mark.j.abraham at gmail.com >
> > > To: llvm-bgq-discuss at lists.alcf.anl.gov
> >
> > > Sent: Friday, February 7, 2014 4:55:24 AM
> > > Subject: Re: [Llvm-bgq-discuss] more issues from trying bgclang
> > > with GROMACS
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> >
> > > Unfortunately, the OpenMP runs failed outright (results from
> > > reduction over threads were nan, reason unclear), and there was
> > > some
> > > other issue. That will take some time to dig into, because we don't
> > > have a "known good with bgclang" code version with which to
> > > compare.
> > > I'll get back to this, but it'll be a few weeks, sorry.
> >
> > What driver version is the machine running? (there are known issues
> > with OpenMP and driver version V1R1M2 (and earlier) -- which I did
> > not think anyone was still using, but it seems some folks still
> > are).
> >
> > -Hal
> >
> >
> >
> > >
> > >
> > > Thanks again,
> > >
> > >
> > > Mark
> > >
> > >
> > >
> > > On Fri, Feb 7, 2014 at 2:23 AM, Mark Abraham <
> > > mark.j.abraham at gmail.com > wrote:
> > >
> > >
> > >
> > > Oops, I did indeed forget to unpack that RPM. Thanks for the tip!
> > > With it, the OpenMP aspect build was flawless. I was able to work
> > > around the other bug by compiling those files at -O2 - which is
> > > fine
> > > for normal GROMACS. Test run in the queue :-)
> > >
> > >
> > > Mark
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Feb 7, 2014 at 1:37 AM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Mark Abraham" < mark.j.abraham at gmail.com >
> > > > To: llvm-bgq-discuss at lists.alcf.anl.gov
> > > > Sent: Thursday, February 6, 2014 6:26:55 PM
> > > > Subject: [Llvm-bgq-discuss] more issues from trying bgclang with
> > > > GROMACS
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > >
> > > > I had another go compiling GROMACS 5.0 beta with bgclang latest
> > > > RPM
> > > > (r200401-20140129). CMake detection of OpenMP support in mpiclang
> > > > failed. Detection should just work because using the -fopenmp
> > > > flag
> > > > is a standard way to do it. When I tried a manual compile:
> > > >
> > > >
> > > >
> > > > $ ~/progs/bgclang/current/bin/bgclang -fopenmp test.c -o test
> > > > /homea/slbio/slbio013/progs/bgclang/r200401-20140129/binutils/bin/ld:
> > > > cannot find -liomp5
> > > > clang: error: linker command failed with exit code 1 (use -v to
> > > > see
> > > > invocation)
> > > >
> > > >
> > > > That looks like a lingering Intel-ism?
> > >
> > > Yes, but that's okay, the libomp package should create the
> > > necessary
> > > symlink for you. Did you install
> > > bgclang-libomp-r200401-20140129-1-1.ppc64.rpm?
> > >
> > > -Hal
> > >
> > >
> > > >
> > > >
> > > > The MPI plus non-OpenMP build seemed to go OK, but a file in our
> > > > bundled lapack subset provoked a bug (attached in tarball). That
> > > > file was not a problem in ~August 2013.
> > > >
> > > >
> > > > Thanks again for the effort!
> > > >
> > > >
> > > > Cheers,
> > > >
> > > >
> > > > Mark
> > > > _______________________________________________
> > > > llvm-bgq-discuss mailing list
> > > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> > >
> > > _______________________________________________
> > > llvm-bgq-discuss mailing list
> > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> >
> >
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/llvm-bgq-discuss/attachments/20140207/a24ce7f6/attachment.html>


More information about the llvm-bgq-discuss mailing list