Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regent_cuda.cu build failure on PPC #1511

Closed
Tracked by #1032
cmelone opened this issue Jul 24, 2023 · 21 comments
Closed
Tracked by #1032

regent_cuda.cu build failure on PPC #1511

cmelone opened this issue Jul 24, 2023 · 21 comments

Comments

@cmelone
Copy link
Contributor

cmelone commented Jul 24, 2023

Running control replication on Lassen. This error started popping up after 79ef214c. I think it might be related to this change, but not too sure.

/usr/tce/packages/cuda/cuda-11.8.0/bin/nvcc -o regent_cuda.cu.o -c regent_cuda.cu  -Xcompiler -fPIC -ccbin mpicxx -O2   -DVOLTA_ARCH -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -Xcudafe --diag_suppress=boolean_controlling_expr_is_constant  -I/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/bindings/regent -I/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime -I/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime/mappers -I/usr/tce/packages/cuda/cuda-11.8.0/include -I/usr/WS1/stanf_ci/psaap-ci/codes/legion-latest/gpu-release/language/gasnet/release/include -I/usr/WS1/stanf_ci/psaap-ci/codes/legion-latest/gpu-release/language/gasnet/release/include/ibv-conduit  
/usr/include/sys/platform/ppc.h(31): error: identifier "__builtin_ppc_get_timebase" is undefined
/usr/include/sys/platform/ppc.h(31): error: identifier "__builtin_ppc_get_timebase" is undefined
1 error detected in the compilation of "regent_cuda.cu".
make: *** [/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime/runtime.mk:1441: regent_cuda.cu.o] Error 1
make: *** Waiting for unfinished jobs....
1 error detected in the compilation of "/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime/legion/legion_redop.cu".
make: *** [/usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime/runtime.mk:1453: /usr/WS1/stanf_ci/psaap-ci/artifacts/1340172/legion/runtime/legion/legion_redop.cu.o] Error 1
@cmelone
Copy link
Contributor Author

cmelone commented Jul 25, 2023

@elliottslaughter could you please add this to #1032? thanks

@lightsighter
Copy link
Contributor

@muraj What was the motivation for adding the __GLIBC__ requirement on this line?
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/timers.h?ref_type=heads#L30

@lightsighter
Copy link
Contributor

@elliottslaughter
Copy link
Contributor

@cmelone as a workaround you could try building with CXXFLAGS=-DREALM_TIMERS_USE_RDTSC=0, which would also help us confirm where the issue is.

@cmelone
Copy link
Contributor Author

cmelone commented Jul 25, 2023

I get the same issue building with that flag

@elliottslaughter
Copy link
Contributor

@cmelone
Copy link
Contributor Author

cmelone commented Jul 25, 2023

that allows it to build successfully

@elliottslaughter
Copy link
Contributor

Ok, so then we seem to have two issues:

  1. Setting REALM_TIMERS_USE_RDTSC=0 does not actually disable RDTSC in the build
  2. RDTSC is broken on PPC

@elliottslaughter
Copy link
Contributor

GLIBC

@muraj What was the motivation for adding the __GLIBC__ requirement on this line? https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/timers.h?ref_type=heads#L30

I think this is not actually the issue. As you can see in the commit you referenced, the commit not only added the __GLIBC__ guard, it also added the call to __ppc_get_timebase_freq(). We seem to have __GLIBC__, so we're going down that path, but __ppc_get_timebase_freq() is not defined. So either:

  1. The guard is wrong and we need a tighter (not looser) bound
  2. We're not getting the right header file, probably due to some sort of OS/distro difference

@elliottslaughter
Copy link
Contributor

@cmelone it might help us track this down to know what OS, distro and compiler you are on (including versions, as appropriate).

@cmelone
Copy link
Contributor Author

cmelone commented Jul 25, 2023

Lassen is using RHEL 7.9 and GCC 8.3.1

[melone1@lassen709:scale]$ uname -a
Linux lassen709 4.14.0-115.35.1.3chaos.ch6a.ppc64le #1 SMP Wed Jul 21 17:12:16 PDT 2021 ppc64le ppc64le ppc64le GNU/Linux

@muraj
Copy link

muraj commented Jul 26, 2023

@cmelone Sorry about this, I don't have a ppc compilation nor test in CI for this. __ppc_get_timebase_freq should come from sys/platform/ppc.h which is included in line 44 of timers.inl. I had originally used some inline assembly for this path, but there wasn't a lot of concrete documentation on how to retrieve the ppc timebase frequency for calibration, so figured it would be okay to use the glibc builtins for this.

As to the fact that REALM_TIMERS_USE_RDTSC=0 doesn't work, I'm not sure why, I'm lookint into it now, I think there's still some #ifdef rather than #ifs in timers.inl which is causing the problem. Gimme just a moment to fix.

@muraj
Copy link

muraj commented Jul 26, 2023

@cmelone can you also provide the version of glibc you have? I am seeing the following in ppc.h:

https://github.com/bminor/glibc/blob/4290aed05135ae4c0272006442d147f2155e70d7/sysdeps/powerpc/sys/platform/ppc.h#L28

This is where the __builtin_ppc_get_timebase is being referenced, which seems to require at least gcc 4.8. What I wonder is if nvcc is actually causing an issue here. Can you try compiling the following code snippet from godbolt (which seems to work on all the gcc versions supported there) and see if that works for you? If so, then maybe we need to not use nvcc as our main compiler for all our source files (that's usually not a good idea in general...).

https://godbolt.org/z/6e7hMG9ja

Thanks and apologies for the issue.

@muraj
Copy link

muraj commented Jul 26, 2023

@elliottslaughter the build failure with RDTSC define set to zero is because it's the include that is causing the failure, and that was only protected under a #ifdef not a #if. Fix incoming for that.

@cmelone
Copy link
Contributor Author

cmelone commented Jul 26, 2023

@muraj, you're all good! I'm happy to test anything on ppc in the future if that would be helpful

Lassen's glibc version...2.17. I ran the test snippet and it compiles with both g++ and nvcc

@muraj
Copy link

muraj commented Jul 26, 2023

That's really weird that the test snippet works, but the same thing in realm doesn't. Also, I double checked, glibc code here hasn't changed in 2.17. I'm really not sure why this would fail. Are you sure you're using the same environment? Also, noticed this:
-ccbin mpicxx
Can you try compiling the code snippet with mpicxx? If that fails, then we can figure out what mpicxx is doing.

@cmelone
Copy link
Contributor Author

cmelone commented Jul 26, 2023

Yup, 100% sure they are the same environment. this is what I'm running to compile the snippet:

nvcc -o test1 -c test.cpp
nvcc -o test2 -c test.cpp -ccbin mpicxx

both succeed

@muraj
Copy link

muraj commented Jul 26, 2023

Okay, I have no idea how that is possible. Anyway, I just merged a change to master to be able to skip over this path and effectively disable rdtsc for ppc, so you can try disabling it as Sean suggested earlier with the latest master branch.

@cmelone
Copy link
Contributor Author

cmelone commented Jul 26, 2023

sounds good, thanks. to illustrate, this is how I'm setting the environment. I doubt it, but I'm not sure if setup_env.py is changing the env to something that may be causing this discrepancy

run.sh

module load gcc/8.3.1 cuda/11.8.0 cmake/3.23.1 python/3.8.2
export CC=gcc
export CXX=g++
export CONDUIT=ibv

export P=1341712
export LEGION_DIR=/usr/WS1/stanf_ci/psaap-ci/artifacts/$P/legion
export HDF_ROOT="$LEGION_DIR"/language/hdf/install
export USE_CUDA=1
export USE_OPENMP=1
export USE_GASNET=1
export USE_HDF=1
export CUDA_HOME=/usr/tce/packages/cuda/cuda-11.8.0
export CUDA="$CUDA_HOME"
export GPU_ARCH=volta

nvcc -o test1 -c test.cpp
nvcc -o test1 -c test.cpp -ccbin mpicxx

cd $LEGION_DIR/language
scripts/setup_env.py

@cmelone
Copy link
Contributor Author

cmelone commented Jul 30, 2023

Can confirm that the extra flag works to compile legion. The solver itself fails to build with the same error and adding the flag doesn't seem to help.

This machine is where the vast majority of our users run the code and if possible, I'd like to avoid adding to our already complex build instructions given that this issue wasn't present a couple weeks ago.

Is there anything else I can do to help debug this, or is this something on the system site that needs to be resolved? thanks again for all your efforts

Edit: went back to re-verify the old builds which disproves any issue with the system itself

Succeeds: 79ef214c Fails: 1881f68e. This is the smallest diff I've found that causes the issue.

@cmelone
Copy link
Contributor Author

cmelone commented Aug 6, 2023

Hi @muraj just following up

@cmelone cmelone closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants