Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMREX Cuda issue with Cuda 11.2/11.3/11.6/11.7 #3598

Open
nishaag opened this issue Oct 18, 2023 · 11 comments
Open

AMREX Cuda issue with Cuda 11.2/11.3/11.6/11.7 #3598

nishaag opened this issue Oct 18, 2023 · 11 comments

Comments

@nishaag
Copy link

nishaag commented Oct 18, 2023

I built the AMReX/amrex/Tests/GPU/Vector code for NVIDIA A100 GPU with the command make CUDA_ARCH=80, it built successfully but threw the below error at runtime. I tried with the CUDA version 11.2/11.3/11.6/11.7 but everytime facing the same issue , Please help in this regard

[/AMReX/amrex/Tests/GPU/Vector]$ ./main3d.gnu.CUDA.ex inputs
Initializing CUDA...
CUDA initialized with 1 device.
amrex::Abort::0::GPU last error detected in file ../../..//Src/Base/AMReX_GpuLaunchFunctsG.H line 885: invalid argument !!!
SIGABRT
See Backtrace.0 file for details
(cuda-11.7) aglnisha@scn37-mn:~/AMReX/amrex/Tests/GPU/Vector$ exit
exit

@nishaag

@WeiqunZhang
Copy link
Member

Could you provide more information?

What does nvidia-smi show?

What does free show?

What does git log -n 1 show?

What does git diff HEAD show?

How do you build the test? If you do this in amrex/Tests/GPU/Vector,

make clean
make -j8 USE_CUDA=TRUE CUDA_ARCH=80 >& make.ou

What do you get in make.ou?

The error message says "See Backtrace.0 file for details". Could we see that file?

@nishaag
Copy link
Author

nishaag commented Oct 19, 2023

Hi WeiqunZhang,

Below are the requested outputs

output of nvidia-smi
++++++++++++++++++++++++

nvidia-smi

Free
+++++++++++++++++++++++++++++++++++
free

Git log
++++++++++++++++++++++++
git-log-n-amrexDir

git diff head
++++++++++++++++++++++++++

git-diff-head-amrexDir

make.ou
+++++++++++++++++++++++++

make.ou.txt

Backtrace.0
++++++++++++++++++++++++++++++++
Backtrace.0.txt

@WeiqunZhang
Copy link
Member

Could you run the two git commands in amrex directory? I am trying to see which version of amrex you are using and whether there are any local changes.

From the backtrace file, it seems that it dies at the first gpu kernel. The issue might be the driver is incompatible with the cuda toolkit. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7.

@ax3l
Copy link
Member

ax3l commented Oct 19, 2023

@nishaag your drivers are too old.
You need to update to newer drivers than 450.236.01 to support recent CUDA releases. Here is a list:
https://gist.github.com/ax3l/9489132

For 11.2 for instance, you need at least 460.27.04.
For 11.7 for instance, you need at least 515.65.01.

There is some compatibility that relaxes this strict constraint since the 455.23.05 driver series (~CUDA 11.1+), but your currently installed system drivers are also too old for that.

@nishaag
Copy link
Author

nishaag commented Oct 27, 2023

Could you run the two git commands in amrex directory? I am trying to see which version of amrex you are using and whether there are any local changes.

From the backtrace file, it seems that it dies at the first gpu kernel. The issue might be the driver is incompatible with the cuda toolkit. https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7.

@WeiqunZhang If it is due to the incompatible driver with Cuda toolkit ....it should work with Cuda toolkit 11.0 with the driver (450.236.xx that we have on the system. But unfortunately, it is not working .

Git commands output in AMREX directory

git log n

git-log-n-amrexDir

git diff HEAD

git-diff-head-amrexDir

@nishaag
Copy link
Author

nishaag commented Nov 8, 2023

@WeiqunZhang If it is due to the incompatible driver with Cuda toolkit ....it should work with Cuda toolkit 11.0 with the driver (450.236.xx that we have on the system. But unfortunately, it is not working

@nishaag
Copy link
Author

nishaag commented Nov 8, 2023

@ax3l If it is due to the incompatible driver with Cuda toolkit ....it should work with Cuda toolkit 11.0 with the driver (450.236.xx that we have on the system. But unfortunately, it is not working .

@WeiqunZhang
Copy link
Member

I don't have any explanation.

@superDNY
Copy link

hello, I encountered the same error in CUDA version 11.7. Have you solved it?

@WeiqunZhang
Copy link
Member

First of all, I don't believe the issue is in AMReX. I have tested the current amrex on various machines with CUDA 11.x and 12.x. They all work just fine.

On my workstation using Ubuntu, I have had various issues with the CUDA installation in the past. Sometimes a simple reboot could resolve the issue. Sometimes upgrading CUDA helped. Sometimes, I had to remove all packages containing the word nvidia or cuda, and then reinstall CUDA. This last resort has always worked for me.

@nishaag
Copy link
Author

nishaag commented Feb 16, 2024

hello, I encountered the same error in CUDA version 11.7. Have you solved it?

Hi SuperDNY,

I observed that the issue is with the NVIDIA device driver version available on the system, it is not working with the driver version 450.236.xxx but it is working with the driver version 470.xxx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants