[Llvm-bgq-discuss] trouble with latest clang install

Fri Feb 21 02:48:10 CST 2014

----- Original Message -----
> From: "John A. Biddiscombe" <biddisco at cscs.ch>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> Sent: Friday, February 21, 2014 1:13:30 AM
> Subject: RE: [Llvm-bgq-discuss] trouble with latest clang install
> 
> Hal,
> 
> Just as an update, we tracked down the cause of the (2nd) crash that
> I was getting and it turned out to be a stack overflow in the HPX
> initialization which was fixed by setting some hpx flags.
> Hello world programs are running, but I've yet to test my newly
> rewritten main program - that should happen soon.
> 
> One thing I ought to mention is that although I'm compiling code to
> run on CNK, I'm also compiling code to run on the IO nodes which run
> linux. If there are flags which set CNK specific options that would
> break a linux (Red hat enterprise 6.4) build then do please warn me
> so that I can make sure I set things appropriately. In actual fact
> the majority of my code is running on IO nodes at the moment so it's
> more important for me to get things right there.
> 
> Everything is running as expected at the moment. (NB. I'm compiling
> my code using cmake and the bgclang++11 wrappers, and setting all
> the mpi stuff 'by hand' so NOT using the mpiclang++11 wrappers
> [mpi=mvapich on the IONs]).

Interesting. If you're compiling for the IONs you should pass -mllvm -qpx-stack-unaligned (this tells the backend that the stack is only 16-byte aligned, not 32-byte aligned as under CNK -- sorry, I know this is badly named) and either turn off QPX generation with -mno-qpx or force dynamic stack relocation in all functions with -mllvm -ppc-always-use-base-pointer. Unless you really need QPX on the IONs, then I recommend just turning it off because forcing dynamic stack relocation in all functions will slow everything else down.

 -Hal

> 
> yours
> 
> JB
> 
> > -----Original Message-----
> > From: llvm-bgq-discuss-bounces at lists.alcf.anl.gov
> > [mailto:llvm-bgq-discuss-
> > bounces at lists.alcf.anl.gov] On Behalf Of Hal Finkel
> > Sent: 20 February 2014 22:40
> > To: Thomas Gooding
> > Cc: llvm-bgq-discuss at lists.alcf.anl.gov
> > Subject: Re: [Llvm-bgq-discuss] trouble with latest clang install
> > 
> > ----- Original Message -----
> > > From: "Thomas Gooding" <tgooding at us.ibm.com>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: llvm-bgq-discuss at lists.alcf.anl.gov, "thom heller"
> > > <thom.heller at gmail.com>
> > > Sent: Thursday, February 20, 2014 3:04:54 PM
> > > Subject: Re: [Llvm-bgq-discuss] trouble with latest clang install
> > >
> > >
> > >
> > > Hi Hal,
> > >
> > > CNK has support for .tbss/.tdata segments (thread specific),
> > > which is
> > > what glibc uses to track thread-specific locale information. The
> > > rest
> > > of the support is entirely within glibc, I don't recall that it
> > > was
> > > disabled. As I recall, if you don't have support for tbss/tdata,
> > > programs will crashes when printing floating-point values (glibc
> > > needs
> > > to know whether to print a comma or decimal per the locale).
> > 
> > I'll double-check the patchset again. As I recall, they only
> > compile in the data
> > tables for the 'C' locale, and nothing else is available
> > (dynamically or
> > otherwise). From a space-saving standpoint this probably makes
> > sense.
> > 
> >  -Hal
> > 
> > >
> > > Tom
> > >
> > > Tom Gooding
> > > Senior Engineer / Blue Gene SW Lead / C2 tgooding at us.ibm.com
> > > 507-253-0747
> > >
> > >
> > > Inactive hide details for Hal Finkel ---02/20/2014 01:39:09
> > > PM-------- Original Message ----- > From: "Thomas Heller"
> > > <thom.helHal Finkel ---02/20/2014 01:39:09 PM-------- Original
> > > Message
> > > ----- > From: "Thomas Heller" <thom.heller at gmail.com>
> > >
> > >
> > >
> > >
> > > From:
> > > Hal Finkel <hfinkel at anl.gov>
> > >
> > >
> > >
> > > To:
> > > thom heller <thom.heller at gmail.com>
> > >
> > >
> > >
> > > Cc:
> > > llvm-bgq-discuss at lists.alcf.anl.gov
> > >
> > >
> > >
> > > Date:
> > > 02/20/2014 01:39 PM
> > >
> > >
> > >
> > > Subject:
> > > Re: [Llvm-bgq-discuss] trouble with latest clang install
> > >
> > >
> > >
> > > Sent by:
> > > llvm-bgq-discuss-bounces at lists.alcf.anl.gov
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Thomas Heller" <thom.heller at gmail.com>
> > > > To: "John A. Biddiscombe" <biddisco at cscs.ch>
> > > > Cc: llvm-bgq-discuss at lists.alcf.anl.gov, "Hal Finkel"
> > > > <hfinkel at anl.gov>
> > > > Sent: Friday, February 14, 2014 2:02:27 PM
> > > > Subject: Re: [Llvm-bgq-discuss] trouble with latest clang
> > > > install
> > > >
> > > > Hi all,
> > > >
> > > > Ok, I think i tracked it down.
> > > > If my suspicions are correct, the segfault isn't caused by
> > > > bgclang
> > > > or hpx directly. It looks like parts of boost can't deal with
> > > > locales correctly on John's system. Here is how it happens:
> > > > On a regular BGQ compute node, you don't have interactive
> > > > access and
> > > > i think no locale information available. However, John's
> > > > scenario is
> > > > slightly
> > > > different:
> > > > 1) He uses SLURM to get on the nodes (interactively or through
> > > > batch
> > > > jobs)
> > > > 2) He uses the BGAS nodes directly
> > > >
> > > > Now, using 1) has the implication of a feature of SLURM which
> > > > makes
> > > > the bash it spawns once the job has enough resources inherit
> > > > all the
> > > > environment variables the job submission had set (this includes
> > > > LANG. LC_*). It looks like some flavors of linux (especially in
> > > > the
> > > > embedded world) have a problem with this. I ran into a similar
> > > > problem when porting HPX to the Xeon Phi.
> > > > Everything was working nicely on our local machine (no job
> > > > control,
> > > > direct access through ssh etc.). I then moved on to Stampede,
> > > > when
> > > > logging into one of the Phis directly, everything still worked
> > > > great. But only until i stopped using an interactive mode and
> > > > started to submit jobs through the batch system.
> > > > Which lead to similar problems John is running into right now
> > > > ...
> > > > About 2) ... I am not exactly sure how this is related to the
> > > > problem at hand ...
> > > >
> > > > Anyway, I was able to reproduce the problem on one of the CNK
> > > > based
> > > > compute nodes on JUQUEEN by using this jobscript:
> > > > # @ job_name = HPX_Hello_World
> > > > # @ comment = "HPX Hello World testrun"
> > > > # @ error = $(job_name).$(jobid).err # @ output =
> > > > $(job_name).$(jobid).out # @ environment = COPY_ALL # @
> > > > wall_clock_limit = 00:30:00 # @ notification = error # @
> > > > notify_user
> > > > = thom.heller at gmail.com # @ job_type = bluegene # @ bg_size =
> > > > 32 # @
> > > > queue
> > > >
> > > > APP="$HOME/build/hpx/debug/bin/hello_world"
> > > >
> > > > ENVS="LANG=en_US LC_CTYPE=\"en_US\" LC_NUMERIC=\"en_US\"
> > > > LC_TIME=en_GB
> > > > LC_COLLATE=\"en_US\" LC_MONETARY=\"en_US\"
> > LC_MESSAGES=\"en_US\"
> > > > LC_PAPER=\"en_US\" LC_NAME=\"en_US\" LC_ADDRESS=\"en_US\"
> > > > LC_TELEPHONE=\"en_US\" LC_MEASUREMENT=\"en_US\"
> > > > LC_IDENTIFICATION=\"en_US\"
> > > > LC_ALL=\"en_US\""
> > > >
> > > > runjob --ranks-per-node 1 --exe $APP --args "-t1" --envs $ENVS
> > > >
> > > > Which lead to the exact same error. What I am unsure about
> > > > though is
> > > > who's fault it is. The stack trace John posted earlier comes
> > > > out of
> > > > the static section of the binary which initializes some globals
> > > > out
> > > > of the boost filesystem library. So we have three candidates:
> > > > 1)
> > > > Boost.Filesystem
> > > > 2) libc++
> > > > 3) the libc/posix on the BGAS node.
> > >
> > > This sounds right. CNK (and, specifically, its associated build
> > > of
> > > glibc) don't have locale support enabled. As a result, as I
> > > recall,
> > > only the default ('C') is supported.
> > >
> > > If it turns out that this is a bug in libc++, then we should fix
> > > it
> > > there. Maybe it is worthwhile to have a Boost build that is
> > > patched to
> > > avoid this problem as well?
> > >
> > > In any case, thanks for investigating this and sharing your
> > > findings!
> > >
> > > -Hal
> > >
> > > >
> > > > The solution to this problem is btw to unset all those
> > > > environment
> > > > variables.
> > > > I commited a fix for HPX for working around this problem which
> > > > should not require to manually unset those environment
> > > > variables (
> > > > https://github.com/STEllAR-
> > GROUP/hpx/commit/65ce125466ae43e68e19e89b
> > > > 3e50ece0721786de
> > > > ).
> > > > Thanks for the patience.
> > > >
> > > > Regards,
> > > > Thomas
> > > >
> > > > On Friday, February 14, 2014 12:52:24 Biddiscombe, John A.
> > > > wrote:
> > > > > Hal
> > > > >
> > > > > Apologies, I didn’t realize I was using the wrong wrapper.
> > > > >
> > > > > I recompiled using the bgclang++11 wrapper and things work
> > > > > much
> > > > > better.
> > > > > I first compiled boost ok, but had trouble linking to it - I
> > > > > ran
> > > > > into the cxxABI link error with boost program_options:: __1
> > > > > etc
> > > > > etc
> > > > >
> > > > > After a bit of goggling around explained to me the std c++
> > > > > lib
> > > > > issues, so I had another go using the following settings …
> > > > >
> > > > > export
> > > > > CC=/gpfs/bbp.cscs.ch/home/biddisco/bgas/apps/clang/bin/bgclang
> > > > > export
> > > > >
> > CXX=/gpfs/bbp.cscs.ch/home/biddisco/bgas/apps/clang/bin/bgclang++1
> > > > > 1
> > > > > export
> > > > > PATH=/gpfs/bbp.cscs.ch/home/biddisco/bgas/apps/clang/bin:$PATH
> > > > >
> > > > > I found some info about building boost with clang and
> > > > > followed
> > > > > instructions here
> > > > > http://stackoverflow.com/questions/11081818/linking-troubles-with-
> > > > > boostprog
> > > > > ram-options-on-osx-using-llvm?lq=1
> > > > > I modified tools/build/v2/user-config.jam to include the
> > > > > clang-11
> > > > > option using clang : 11
> > > > >
> > > > > :
> > > > > "/gpfs/bbp.cscs.ch/home/biddisco/bgas/apps/clang/bin/bgclang++11"
> > > > > : <cxxflags>"-std=c++11 -stdlib=libc++ -ftemplate-depth=512"
> > > > >
> > > > > <linkflags>"-stdlib=libc++"
> > > > > ;
> > > > >
> > > > >
> > > > > And then proceeded to building boost using the following
> > > > > commands
> > > > > ./bootstrap.sh --with-toolset=clang-11
> > > > > ./b2 -j 16 toolset=clang-11 cxxflags="-fPIC"
> > > > > --threading=multi
> > > > > --without-mpi --without-python
> > > > > --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/boost_1_54_0
> > > > >
> > > > > And boost compiles fine.
> > > > > "The Boost C++ Libraries were successfully built!"
> > > > >
> > > > > To test, I compiled the boost serialisation demo from this
> > > > > page
> > > > > http://www.boost.org/doc/libs/1_42_0/libs/serialization/example/de
> > > > > mo.cpp And also a simple boost::program_options demo and
> > > > > boost::filesystem demo they all run fine
> > > > >
> > > > > Thank you very much for the help and all the work you’ve put
> > > > > in
> > > > > getting the clang stuff running..
> > > > >
> > > > > But…
> > > > >
> > > > > when I run simple demos from the HPX library
> > > > >
> > > > > bbpbg2:~/bgas/build/hpx$ bin/hello_world terminate called
> > > > > after
> > > > > throwing an instance of 'std::__1::runtime_error'
> > > > > what(): collate_byname<char>::collate_byname failed to
> > > > > construct
> > > > > for Aborted (core dumped)
> > > > >
> > > > >
> > > > > gdb shows me a trace …
> > > > > (gdb) where
> > > > > #0 0x00000fffb3458c5c in raise (sig=6) at
> > > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:67
> > > > > #1 0x00000fffb345abd4 in abort () at abort.c:92
> > > > > #2 0x00000fffb3aa7b00 in
> > > > > __gnu_cxx::__verbose_terminate_handler
> > > > > ()
> > > > > at
> > > > > /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/gcc-4.4.6/libstdc++-v3/l
> > > > > ibsupc++/
> > > > > vterminate.cc:93
> > > > > #3 0x00000fffb3aa4d74 in __cxxabiv1::__terminate
> > > > > (handler=<value
> > > > > optimized out>) at
> > > > > /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/gcc-4.4.6/libstdc++-v3/l
> > > > > ibsupc++/
> > > > > eh_terminate.cc:38
> > > > > #4 0x00000fffb3aa4db8 in std::terminate () at
> > > > > /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/gcc-4.4.6/libstdc++-v3/l
> > > > > ibsupc++/
> > > > > eh_terminate.cc:48
> > > > > #5 0x00000fffb47b1c14 in .__clang_call_terminate () from
> > > > > /gpfs/bbp.cscs.ch/home/biddisco/apps/clang/boost_1_54_0/lib/libboo
> > > > > st_filesy
> > > > > stem.so.1.54.0
> > > > > #6 0x00000fffb47b48a0 in
> > > > > ._ZNK5boost10filesystem4path7compareERKS1_ () from
> > > > > /gpfs/bbp.cscs.ch/home/biddisco/apps/clang/boost_1_54_0/lib/libboo
> > > > > st_filesy
> > > > > stem.so.1.54.0
> > > > > Backtrace stopped: frame did not save the PC
> > > > >
> > > > >
> > > > > It looks very suspicious as there are some stdlib++
> > > > > appearances in
> > > > > there.
> > > > >
> > > > > Does anything here give you any idea of what might have gone
> > > > > wrong.
> > > > > I’ve
> > > > > tried a number of rebuilds and the error persists, whilst
> > > > > simple
> > > > > demos run ok. I’m not sure where to look to diagnose what’s
> > > > > up
> > > > > (I’ve contacted the HPX people as well). One question is why
> > > > > the
> > > > > shared clang libc++ links to the stdlibc++ one. If I do an
> > > > >
> > > > > bbpbg2:~/bgas/build/c++test$ ldd
> > > > > /gpfs/bbp.cscs.ch/home/biddisco/bgas/apps/clang/libc++/lib/libc++.
> > > > > so.1.0
> > > > >
> > > > > linux-vdso64.so.1 => (0x00000fff9ad40000)
> > > > > libpthread.so.0 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-linux/lib/libp
> > > > > thread.so
> > > > > .0 (0x00000fff9ab00000)
> > > > > librt.so.1 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-linux/lib/libr
> > > > > t.so.1
> > > > > (0x00000fff9a9d0000)
> > > > > libc.so.6 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-linux/lib/libc
> > > > > .so.6
> > > > > (0x00000fff9a790000)
> > > > > libstdc++.so.6 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-
> > linux/lib/libstdc++.so.
> > > > > 6 (0x00000fff9a550000)
> > > > > /lib64/ld64.so.1 (0x0000000032420000)
> > > > > libm.so.6 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-
> > linux/lib/libm
> > > > > .so.6
> > > > > (0x00000fff9a430000)
> > > > > libgcc_s.so.1 =>
> > > > > /bgsys/drivers/V1R2M1/ppc64/gnu-linux/powerpc64-bgq-linux/lib/libg
> > > > > cc_s.so.1
> > > > > (0x00000fff9a320000)
> > > > >
> > > > >
> > > > > It seems odd. Could this be causing the trouble? (the demos
> > > > > run
> > > > > fine though, so I guess not).
> > > > >
> > > > > Anyway, I’ll keep poking around, if anything comes to mind,
> > > > > I’m
> > > > > grateful for help.
> > > > >
> > > > > Thanks
> > > > >
> > > > > JB
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > llvm-bgq-discuss mailing list
> > > > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > > >
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > _______________________________________________
> > > llvm-bgq-discuss mailing list
> > > llvm-bgq-discuss at lists.alcf.anl.gov
> > > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> > >
> > >
> > >
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > llvm-bgq-discuss mailing list
> > llvm-bgq-discuss at lists.alcf.anl.gov
> > https://lists.alcf.anl.gov/mailman/listinfo/llvm-bgq-discuss
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory