[Llvm-bgq-discuss] [alcf-support #325179] Opening application executable failed, errno 2 No such file or directory

Hal Finkel hfinkel at anl.gov
Fri Feb 3 15:24:33 CST 2017


Hi Jozsef,

Interesting. I think the best way to track this down is to run the code 
to get binary (i.e. "real") core files so that we look at what is going 
on in the debugger. See 
https://www.alcf.anl.gov/user-guides/core-file-settings (in short, set 
the 'BG_COREDUMPBINARY=*' environmental variable). If you can do this 
and provide me with access to your directory, or alternatively, provide 
me with the source necessary to reproduce the issue, I'm happy to help.

  -Hal


On 02/03/2017 02:52 PM, Jozsef Bakosi wrote:
> Hi Hal,
>
> I'm not sure how useful this will be but this is the backtrace I get from
> coreprocessor:
>
> 0 : (IAR=Node)Node (2)
> 1 : (IAR=0x0000000000000000)    0000000000000000 (1)
> 2 : (IAR=0x0000000001fa5994)        .__libc_start_main (1)
> 3 : (IAR=0x0000000001fa5468)            .generic_start_main (1)
> 4 : (IAR=0x0000000001fa5f10)                .__libc_csu_init (1)
> 5 : (IAR=0x000000000100fc08)                    ._GLOBAL__sub_I_Parser.C (1)
> 6 : (IAR=0x000000000100fb80)                        .__cxx_global_var_init.53 (1)
> 7 : (IAR=0x00000000011a8348) .tk::Print::Print(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, std::__1::basic_ostream<char, std::__1::char_traits<char> >&) (1)
> 1 :    <traceback not fetched> (1)
>
> tk::Print is my code, calling the default constructor of std::stringstream,
> which I believe the segfault comes from, which is at
> /soft/compilers/bgclang/r284961-stable/libc++/include/c++/v1/sstream:246
> (coreprocessor's "Location" points to).
>
> Jozsef
>
> On 02.03.2017 14:42, Hal Finkel wrote:
>> Hi Jozef,
>>
>> [-support; cc'ing support and this mailing list is going to be confusing
>> because not all of the messages will appear on the mailing list]
>>
>> Can you provide the backtrace? I don't recall running into a problem in this
>> specific place, but I have seen problems with streams in the past for
>> various reasons (i.e. things, like basic locale support, that BG/Q does not
>> support).
>>
>>   -Hal
>>
>>
>> On 02/01/2017 12:29 PM, Jozsef Bakosi wrote:
>>> Hi Ramesh and Tim,
>>>
>>> Thanks for your help. I recompiled with debug info, ran using a single core, and
>>> used the coreprocessor to find that I get the segfault from the standard
>>> library, libc++:
>>>
>>> Location: /soft/compilers/bgclang/r284961-stable/libc++/include/c++/v1/sstream:246:
>>>
>>> 241 template <class _CharT, class _Traits, class _Allocator>
>>> 242 basic_stringbuf<_CharT, _Traits, _Allocator>::basic_stringbuf(ios_base::openmode __wch)
>>> 243     : __hm_(0),
>>> 244       __mode_(__wch)
>>> 245 {
>>> 246     str(string_type());
>>> 247 }
>>>
>>> I'm CCing the bgclang list. Has anyone ever seen this basic_stringbuf
>>> constructor segfaulting at this location? Is there another libc++ version I can
>>> try?
>>>
>>> In the meantime, I will probably try using gnu stdlibc++ instead of libc++.
>>>
>>> Thanks for all your help,
>>> Jozsef
>>>
>>> On 01.31.2017 22:16, Balakrishnan, Ramesh wrote:
>>>>      We have a perl based tool called [1]coreprocessor.pl  Make sure you
>>>>      compile your code with the -g flag (in addition to the others that you
>>>>      use) and use this tool to look at the core files (assuming that you are
>>>>      getting core files). If you are not getting core files, you may want to
>>>>      force the job to produce core files by using [2]--env
>>>>      BG_COREDUMPONEXIT=1 in your qsub invocation.
>>>>
>>>>      Hope this helps.
>>>>
>>>>      Ramesh
>>>>
>>>>      On Jan 31, 2017, at 3:56 PM, Jozsef Bakosi <[3]jbakosi at lanl.gov> wrote:
>>>>
>>>>      Hi Ramesh,
>>>>      I have built the executable using mpic++11. Is there a way to get more
>>>>      information than the following?
>>>>      2017-01-31 21:41:37.936 (WARN ) [0x4000122bde0]
>>>>      CET-02400-13731-128:1911876:ibm.runjob.client.Job: terminated by signal
>>>>      11
>>>>      2017-01-31 21:41:37.936 (WARN ) [0x4000122bde0]
>>>>      CET-02400-13731-128:1911876:ibm.runjob.client.Job: abnormal termination
>>>>      by signal 11 from rank 16
>>>>      Thanks,
>>>>      Jozsef
>>>>      On 01.31.2017 21:32, Balakrishnan, Ramesh wrote:
>>>>
>>>>          Jozsef,
>>>>          I am not sure how you are building your code, but I noticed in
>>>>        your
>>>>          earlier email that you are using bgclang++11. bgclang++11 is fine
>>>>        for
>>>>          non-MPI builds, but you will need to pull in a long list of
>>>>        libraries
>>>>          if you want to use bgclang++11 for buildign MPI code, and this
>>>>        route
>>>>          can lead to runtime errors. Instead, can you try building your MPI
>>>>        code
>>>>          with mpiclang++11 as opposed to bgclang++11. The mpiclang++11
>>>>        wrapper,
>>>>          around the bgclang++11 compiler, will pull in all of the necessary
>>>>          libraries necessary for your MPI code.
>>>>          Ramesh
>>>>          On Jan 31, 2017, at 2:00 PM, Jozsef Bakosi
>>>>        <[1][4]jbakosi at lanl.gov> wrote:
>>>>          Hi Ramesh,
>>>>          Based on your qsub line I tried this:
>>>>          $ qsub -t 10 -n 1 --mode c16
>>>>          /home/jbakosi/code/quinoa/build/clang/Main/unittest -v
>>>>          and beside 16 core files, I get, in the job error file:
>>>>          2017-01-31 19:51:26.031 (INFO ) [0x4000122bde0]
>>>>          CET-40000-51331-128:1911641:ibm.runjob.client.Job: job 1911641
>>>>        started
>>>>          2017-01-31 19:51:31.066 (INFO ) [0x40000c334e0]
>>>>          15824:tatu.runjob.monitor: tracklib completed
>>>>          2017-01-31 19:51:43.674 (WARN ) [0x4000122bde0]
>>>>          CET-40000-51331-128:1911641:ibm.runjob.client.Job: terminated by
>>>>        signal
>>>>          11
>>>>          2017-01-31 19:51:43.675 (WARN ) [0x4000122bde0]
>>>>          CET-40000-51331-128:1911641:ibm.runjob.client.Job: abnormal
>>>>        termination
>>>>          by signal 11 from rank 4
>>>>          2017-01-31 19:51:43.675 (INFO ) [0x4000122bde0]
>>>>        tatu.runjob.client:
>>>>          task terminated by signal 11
>>>>          I guess it started fine, but it segfaults right away?
>>>>          How can I get a more detailed output from my application? My job
>>>>        output
>>>>          file is
>>>>          zero length.
>>>>          Jozsef
>>>>        References
>>>>          1. [5]mailto:jbakosi at lanl.gov
>>>>
>>>> References
>>>>
>>>>      1. http://www.alcf.anl.gov/user-guides/coreprocessor
>>>>      2. https://www.alcf.anl.gov/user-guides/core-file-settings
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-bgq-discuss mailing list