[Llvm-bgq-discuss] more issues from trying bgclang with GROMACS

Hal Finkel hfinkel at anl.gov
Fri Feb 7 13:10:21 CST 2014


----- Original Message -----
> From: "Mark Abraham" <mark.abraham at scilifelab.se>
> To: llvm-bgq-discuss at lists.alcf.anl.gov
> Sent: Friday, February 7, 2014 12:35:26 PM
> Subject: Re: [Llvm-bgq-discuss] more issues from trying bgclang with GROMACS
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, Feb 7, 2014 at 3:44 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Mark Abraham" < mark.abraham at scilifelab.se >
> > Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> > Sent: Friday, February 7, 2014 8:29:18 AM
> > Subject: Re: [Llvm-bgq-discuss] more issues from trying bgclang
> > with GROMACS
> > 
> > 
> > 
> > 
> > 
> 
> > Hi Hal,
> > 
> > AFAICS not the problem, but it's hard for a mere user to find this
> > kind of stuff out:
> > 
> > [juqueen3 ~ (juq-homedir)] $ ls /bgsys/drivers/
> > ppcfloor toolchain V1R2M0 V1R2M1
> 
> Yep; that seems good. FYI, if you run ls -l /bgsys/drivers you can
> see to which driver ppcfloor is symlinked, and that will give you
> the answer.
> 
> 
> 
> 
> Ja V1R2M1.
> 
> 
> 
> 
> > 
> > Jeff helpfully tried to sort me out with an ALCF account last year,
> > but the cryptocard they shipped never worked with the PIN they sent
> > with it, and the helpdesk insisted on me calling from Sweden to get
> > any help at all, so I gave up. :-( That was about the time Congress
> > decided to act more like children than usual, so maybe things were
> > messier than usual! :-)
> 
> That's odd. They should be able to authenticate your identity by some
> mechanism other than using caller-id. :( -- When you have a chance
> please, try again; if they still won't help you, I'll raise the
> issue internally.
> 
> 
> 
> OK.
> 
> 
> 
> On the bgclang issue, if you can reasonably provide instructions on
> how to repeat a test showing this issue, I can try it on my end as
> well. More likely than not, if there is a correctness issue to
> debug, I'd need to do that at some point anyway.
> 
> 
> 
> I haven't pinpointed where the problem arises this time, but last
> time it was with an omp parallel do over threads executing the
> innermost kernels for MD nonbonded interactions.
> 
> I've looked into how simple I can make reproducing the OpenMP crash,
> and it is ugly. The OpenMP debug build (-g) runs OK, with plain C
> and QPX-specific kernels. The OpenMP release build (-O3) runs OK
> with plain C, but gives junk results with QPX somewhere leading to a
> subsequent segfault. So the core file stack trace is not useful.
> 
> Altogether, that is a good suggestion that problems start to occur
> with the same omp parallel do, and that it is at least somewhat
> specific to bgclang. It might be possible to do some dirty
> comparison of the resulting energy and force at -O3 -g between the C
> and QPX versions. My guess is that the problem will be visible by
> the end of the very first inner loop, which should be binary
> reproducible between the two versions. If not, then the problem is
> probably when the subsequent reduction from the thread-local force
> buffers occurs. I just don't have the time to try that and be on the
> wrong track, sit waiting trying to reproduce two parallel debugging
> sessions, lead to no conclusion, we don't actually need bgclang to
> work because xlc does, etc. Happy to advise if there's something you
> can identify, though. Can tarball you code + build instructions +
> single input file off list if you'd like to try.

Yes, please.

 -Hal

> 
> Both plain C and QPX kernels are a horrible mess of nested file
> #inclusion and subsequent #ifdefs, because we have about 70
> different kernels per SIMD flavour (and growing). (We're working on
> a python generator instead, but that's not here yet.) I'll
> preprocess them by hand into correct single files if you want.
> 
> 
> The other correctness issue could be anywhere - unfortunately
> basically all of our tests are end-to-end, so when a bunch of them
> fail you have to work out the theme(s). I'll have to do that and
> that probably won't be soon!
> 
> 
> 
> 
> Ah, one more thing: are you linking against anything that also links
> in IBM's OpenMP runtime (like ESSL SMP)? That can also cause issues
> like this.
> 
> 
> 
> There had been a dependency on a system FFTW, but the above was done
> with a fully independent GROMACS. So I think the issue is not an
> OpenMP-runtime-version clash.
> 
> Mark
> 
> 
> 
> 
> -Hal
> 
> 
> 
> > 
> > Mark
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, Feb 7, 2014 at 2:27 PM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Mark Abraham" < mark.j.abraham at gmail.com >
> > > To: llvm-bgq-discuss at lists.alcf.anl.gov
> > 
> > > Sent: Friday, February 7, 2014 4:55:24 AM
> > > Subject: Re: [Llvm-bgq-discuss] more issues from trying bgclang
> > > with GROMACS
> > > 
> > > 
> > > 
> > > Hi,
> > > 
> > > 
> > 
> > > Unfortunately, the OpenMP runs failed outright (results from
> > > reduction over threads were nan, reason unclear), and there was
> > > some
> > > other issue. That will take some time to dig into, because we
> > > don't
> > > have a "known good with bgclang" code version with which to
> > > compare.
> > > I'll get back to this, but it'll be a few weeks, sorry.
> > 
> > What driver version is the machine running? (there are known issues
> > with OpenMP and driver version V1R1M2 (and earlier) -- which I did
> > not think anyone was still using, but it seems some folks still
> > are).
> > 
> > -Hal
> > 
> > 
> > 
> > > 
> > > 
> > > Thanks again,
> > > 
> > > 
> > > Mark
> > > 
> > > 
> > > 
> > > On Fri, Feb 7, 2014 at 2:23 AM, Mark Abraham <
> > > mark.j.abraham at gmail.com > wrote:
> > > 
> > > 
> > > 
> > > Oops, I did indeed forget to unpack that RPM. Thanks for the tip!
> > > With it, the OpenMP aspect build was flawless. I was able to work
> > > around the other bug by compiling those files at -O2 - which is
> > > fine
> > > for normal GROMACS. Test run in the queue :-)
> > > 
> > > 
> > > Mark
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Fri, Feb 7, 2014 at 1:37 AM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Mark Abraham" < mark.j.abraham at gmail.com >
> > > > To: llvm-bgq-discuss at lists.alcf.anl.gov
> > > > Sent: Thursday, February 6, 2014 6:26:55 PM
> > > > Subject: [Llvm-bgq-discuss] more issues from trying bgclang
> > > > with
> > > > GROMACS
> > > > 
> > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > 
> > > > I had another go compiling GROMACS 5.0 beta with bgclang latest
> > > > RPM
> > > > (r200401-20140129). CMake detection of OpenMP support in
> > > > mpiclang
> > > > failed. Detection should just work because using the -fopenmp
> > > > flag
> > > > is a standard way to do it. When I tried a manual compile:
> > > > 
> > > > 
> > > > 
> > > > $ ~/progs/bgclang/current/bin/bgclang -fopenmp test.c -o test
> > > > /homea/slbio/slbio013/progs/bgclang/r200401-20140129/binutils/bin/ld:
> > > > cannot find -liomp5
> > > > clang: error: linker command failed with exit code 1 (use -v to
> > > > see
> > > > invocation)
> > > > 
> > > > 
> > > > That looks like a lingering Intel-ism?
> > > 
> > > Yes, but that's okay, the libomp package should create the
> > > necessary
> > > symlink for you. Did you install
> > > bgclang-libomp-r200401-20140129-1-1.ppc64.rpm?
> > > 
> > > -Hal
> > > 
> > > 
> > > > 
> > > > 
> > > > The MPI plus non-OpenMP build seemed to go OK, but a file in
> > > > our
> > > > bundled lapack subset provoked a bug (attached in tarball).
> > > > That
> > > > file was not a problem in ~August 2013.
> > > > 
> > > > 
> > > > Thanks again for the effort!
> > > > 
> > > > 
> > > > Cheers,
> > > > 
> > > > 
> > > > Mark
> > > > _______________________________________________
> > > > llvm-bgq-discuss mailing list
> > > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > llvm-bgq-discuss mailing list
> > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > 
> > 
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> _______________________________________________
> llvm-bgq-discuss mailing list
> llvm-bgq-discuss at lists.alcf.anl.gov
> https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


More information about the llvm-bgq-discuss mailing list