[Llvm-bgq-discuss] Details behind MPI wrapper for bgclang++

Hal Finkel hfinkel at anl.gov
Fri Mar 1 14:43:32 CST 2013


----- Original Message -----
> From: "Jack Poulson" <jack.poulson at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Jeff Hammond" <jhammond at alcf.anl.gov>, llvm-bgq-discuss at lists.alcf.anl.gov
> Sent: Friday, March 1, 2013 2:22:10 PM
> Subject: Re: [Llvm-bgq-discuss] Details behind MPI wrapper for bgclang++
> 
> On Fri, Mar 1, 2013 at 12:04 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> 
> ----- Original Message -----
> > From: "Jack Poulson" < jack.poulson at gmail.com >
> > To: "Hal Finkel" < hfinkel at anl.gov >
> > Cc: "Jeff Hammond" < jhammond at alcf.anl.gov >,
> > llvm-bgq-discuss at lists.alcf.anl.gov
> 
> > Sent: Friday, March 1, 2013 10:16:24 AM
> > Subject: Re: [Llvm-bgq-discuss] Details behind MPI wrapper for
> > bgclang++
> > 
> 
> 
> > On Thu, Feb 28, 2013 at 10:15 PM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > 
> > 
> > 
> > 
> > Not a problem! Thanks for being a beta tester :) I've updated the
> > installed libc++ libraries to use CLOCK_REALTIME instead of
> > CLOCK_MONOTONIC. Please try again.
> > 
> > -Hal
> > 
> > 
> > 
> > 
> > One more problem taken care of it seems. Unfortunately my program
> > now
> > segfaults in an MPI_Gather call (and the trace still seems a bit
> > corrupted, see core.13). There is really only one instance in my
> > program where MPI_Gather is called, and it looks like this:
> > 
> > 
> > vector<int> myCoords(d), coords(1);
> > // <fill myCoords here>
> > if( commRank == 0 )
> > coords.resize( d*commSize );
> > MPI_Gather( &myCoords[0], d, MPI_INT, &coords[0], d, MPI_INT, 0,
> > comm
> > );
> > 
> > 
> > In the above snippet, 'd' is the dimension of the domain, which is
> > two for the executable in question, and space for storing every
> > process's coordinates is only allocated on the root process. This
> > is
> > pretty straightforward MPI in my opinion, so I am skeptical that I
> > have a bug here.
> 
> Unfortunately, the debug into seems completely useless here. Some of
> our IBM contributors have been working on fixing problems with debug
> info, so hopefully this will improve soon.
> 
> In any case, the actual crash is in:
> dbf::bfly::PotentialField<float, 2ul,
> 8ul>::Evaluate(std::__1::array<float, 2ul> const&) const
> 
> just after a call to:
> dbf::bfly::Context<float, 2ul, 8ul>::Lagrange(unsigned long,
> std::__1::array<float, 2ul> const&) const
> 
> Does that give enough context to guess at the source location? Also,
> can you try linking the executable statically? I wonder if this is
> some kind of PIC problem.
> 
> 
> 
> That is infinitely more information than I had before. What did you
> do to find this out?

The lightweight core files are really text files, I looked at the line:
While executing instruction at..........0x000000000100c7c4

Then I ran powerpc64-bgq-linux-objdump -C -d Backproj-2d and looked at the assembly around address 100c7c4 (if you search for it in the file, note that objdump may omit the leading 0s in the address).

> 
> The latter routine heavily used restrict, but after removing all
> usages of restrict from my entire program and recompiling I received
> an essentially identical coredump file (though I suppose that it is
> possible that the crash occurred somewhere else).

I don't think that's the problem, but thanks for checking!

Can you try compiling/linking with /home/projects/llvm/r175919-20130222/bin/bgclang++ instead of the default one; this is a newer build and I'd like to see if it still has whatever bug is yielding this miscompile.

Thanks again,
Hal

> 
> Jack
> 


More information about the llvm-bgq-discuss mailing list