Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flecsi-tutorial/00-driver stopped working with legion runtime #52

Open
korobkin opened this issue Mar 23, 2018 · 20 comments
Open

flecsi-tutorial/00-driver stopped working with legion runtime #52

korobkin opened this issue Mar 23, 2018 · 20 comments
Assignees

Comments

@korobkin
Copy link

After commit bfd2828 the flecsi-tutorial/00-driver won't work with legion on Wolf:
Steps to reproduce:

  1. module load cmake/3.9.0 python/2.7-anaconda-4.1.1 gcc/6.4.0 boost/1.61 mpich/3.2.1
  2. checkout flecsi-third-party commit bfd2828
  3. checkout flecsi, current master branch (or commit laristra/flecsi@ecd3c76)
  4. configure and build flecsi-third-party (complete config: CMakeCache-ftp-wolf.txt):
  cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/tmp/issue03 \
         -DENABLE_HPX=OFF -DENABLE_LEGION=ON -DLEGION_USE_OPENMP=OFF 
  make -j8
  1. configure, make and install flecsi (full config file: CMakeCache-flecsi-wolf.txt)
  cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/tmp/issue03 \
         -DENABLE_HPX=OFF -DFLECSI_RUNTIME_MODEL=legion
  make -j8
  make install
  1. Setup tutorial environment and run the test:
  source $CMAKE_INSTALL_PREFIX/bin/flecsi-tutorial.sh
  cd flecsi/flecsi-tutorial/00-driver
  flecsit compile driver.cc
  ./driver
  • Expected: "Hello world!" output
  • Actual:
*** Caught a fatal signal: SIGSEGV(11) on node 0/1
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 
in the environment to generate a backtrace. 
Segmentation fault (core dumped)

Full error output with the flag GASNET_BACKTRACE=1: error_message.txt

@korobkin
Copy link
Author

Note: with the previous commit of flecsi-third-party (1e8dc30), or with -DFLECSI_RUNTIME_MODEL=mpi the example works fine.

korobkin referenced this issue in laristra/flecsi Mar 23, 2018
Add global_object_wrapper.h to the list of header files to be installed.
@ipdemes ipdemes self-assigned this Mar 28, 2018
@ipdemes
Copy link

ipdemes commented Mar 28, 2018

@korobkin : there was a bug in flecsi-tutorial that made this test fail for us. I have a fix here laristra/flecsi@839edaf, and I should merge it into master sometime today. Could you, please, check if the test still fail for you after the fix is merge into the master?

@korobkin
Copy link
Author

@ipdemes : unfortunately still fails, with the same error (segfault)

@charest
Copy link
Contributor

charest commented Mar 29, 2018

Was that version of mpich built with "--enable-threads=multiple"?

@korobkin
Copy link
Author

@charest : good call! but yes, I built it with that option:

   ./configure --prefix=$PROJECT/share --enable-threads=multiple --enable-fortran=yes

@ipdemes
Copy link

ipdemes commented Mar 30, 2018

@korobkin : have you build+used your own mpich or do you use one from the module "mpich/3.2.1" ?

@korobkin
Copy link
Author

@ipdemes : I have built it on my own, because turquoise does not have mpich/3.2.1. Maybe there is something wrong with my build? I made it into a module, so that must have been confusing.

@korobkin
Copy link
Author

korobkin commented Mar 31, 2018

@ipdemes : I will try other versions of MPI.
On the other hand, it was working before: there is no error if I run with runtime = legion but previous version of flecsi-third-party, before bfd2828. Or if I run with runtime = mpi.

@lightsighter
Copy link
Collaborator

If this is a single node run, your should pull the latest version of the Legion master branch as I believe the issue you are seeing has been fixed.

@korobkin
Copy link
Author

korobkin commented Apr 1, 2018

@lightsighter : Great! I checked out the master branch of Legion + tried CMAKE_BUILD_TYPE = Debug, and the segfault disappeared.
However, without Debug option (default) the segfault still persists.

@ipdemes
Copy link

ipdemes commented Apr 1, 2018

@korobkin: that explains why I don’t see the issue: I always build in Debug mode

@lightsighter
Copy link
Collaborator

@korobkin How many nodes is this with? Is it the same backtrace as before? Is there any chance you can compile with optimizations on but with the "-g" flag also so we can get line numbers for the backtrace?

@korobkin
Copy link
Author

korobkin commented Apr 2, 2018

@lightsighter @ipdemes : OK just found out that an old trick of simply increasing the stack limit makes segfault disappear, both for Debug and Release compilation options: ulimit -s unlimited. Shall we close the issue?

@ipdemes
Copy link

ipdemes commented Apr 2, 2018

@lightsighter : I think I found the issue: the legion_defines.h file has DEBUG options wnabled for both Debug and Release builds:
#ifndef DEBUG_REALM
#define DEBUG_REALM
#endif
#ifndef DEBUG_LEGION
#define DEBUG_LEGION
#endif
#ifndef PRIVILEGE_CHECKS
#define PRIVILEGE_CHECKS
#endif
#ifndef BOUNDS_CHECKS
#define BOUNDS_CHECKS
#endif

@ipdemes
Copy link

ipdemes commented Apr 2, 2018

@korobkin : I don't think so. I think we should fix an issue with "legion_defines.h" first

@korobkin
Copy link
Author

korobkin commented Apr 2, 2018

@ipdemes : do you mean this file? legion/cmake/legion_defines.h.in
Can you reproduce the problem? If not, what is your suggestion to fix it, so I can try it out.

@lightsighter : yes, this is a single-node run. If I try to run it in gdb after recompiling with '-g' option, the problem disappears. If I increase the stack, it also disappears. There is a file above (error_messages.txt) which hints at approximate location of where the problem might have occured, but without exact line numbers.

@lightsighter
Copy link
Collaborator

@ipdemes Let's fix the issue that you identified and then see where we are.

@ipdemes
Copy link

ipdemes commented Apr 2, 2018

@lightsighter , @korobkin : sorry for the confusion: it seems like the issue is already fixed in the most recent legion's master branch.

@lightsighter
Copy link
Collaborator

@ipdemes Yes, that would make sense to me given the backtrace that was reported earlier. @korobkin can you still create the segfault case on the most recent master branch of Legion?

@ipdemes
Copy link

ipdemes commented Feb 12, 2019

@korobkin : can we close the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants