Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda 10 #5

Open
hrheydarian opened this issue Mar 18, 2019 · 7 comments
Open

cuda 10 #5

hrheydarian opened this issue Mar 18, 2019 · 7 comments

Comments

@hrheydarian
Copy link
Contributor

Dear @benvanwerkhoven,

Recently, we got new GPU card and an update for the CUDA drivers. We now have CUDA 10.1.
Fortunately, running MakeFile for 3D code works fine without any error (only some warning about gcc version). However, when I run our matlab script I get the following error:

invalid MEX-file '/home/.../MATLAB/all2all/mex_expdist.mexa64':
lib//expdist.so: undefined symbol: cudaSetupArgument.

Could you please tell me if we need to change something in the MakeFile to adapt it to CUDA10?

Thanks

@hrheydarian
Copy link
Contributor Author

@benvanwerkhoven
Dear Ben,

Do you have time to look at this issue?

Bests,
Hamidreza

@benvanwerkhoven
Copy link
Collaborator

Hi Hamidreza,

It would help me a lot to have access to the HPC servers to be able to reproduce the problem. I finally have a TUDelft guest account, but it seems Ronald still needs to add me to the hpc servers. Did you check my suggestion that it may be the case that you are still using the old mexfile from matlab? Given this error, I would expect that the build system sends the compiled mexfile to a location that is different from where matlab is looking for it.

Best,
Ben

@hrheydarian
Copy link
Contributor Author

Hi @ronligt

Would it be possible for you to give access to Ben for the HPC servers?

@benvanwerkhoven Yes, I did that. On the same machine, I load CUDA 8.0 and it works fine on a fresh copy of the codes and it also works fine but I get this error at runtime.

Bests,
Hamidreza

@ronligt
Copy link

ronligt commented May 1, 2019

@hrheydarian , the account for @benvanwerkhoven is created and he should be able to login into the hpc24, hpc29 and hpc30

@benvanwerkhoven
Copy link
Collaborator

Hi Hamidreza,

I'm currently failing to reproduce the error that you are receiving. I've built everything on the hpc18 machine under cuda80 (typing cmake ., make, and make install). Then I login to hpc29 (typing module load cuda/10.1) and run the demo script using:
matlab -nodesktop -nodisplay -nosplash -r "demo_all2all"

And I get the output:

all2all registration started !
There are 255 rows !
row 1 started!
Starting parallel pool (parpool) using the 'local' profile ...
connected to 12 workers.
row 1 done in 34.0614 seconds
row 2 started!
row 2 done in 6.9226 seconds
row 3 started!
row 3 done in 7.1196 seconds
...

I did have to add the shared libraries generated by make to my LD_LIBRARY_PATH variable, a step that is currently missing in the documentation on the README. Perhaps that's where things go wrong. Could it be that the shared library loader (which follows LD_LIBRARY_PATH) picks up an old version of the shared library somewhere on your system?

@hrheydarian
Copy link
Contributor Author

Hi @benvanwerkhoven ,

Thanks for checking that.

I did the same procedure as you did and there is no problem with that. However, the problem is when you also compile the code with cuda/10.1. In this situation, the code is compiling again without error but when you run the script the error that I mentioned occurs.

Bests,
Hamidreza

@benvanwerkhoven
Copy link
Collaborator

Hi @hrheydarian,

I noticed that when I ran CMake for the first time on the hpc29 (with module cuda/10.1 loaded) that CMake still finds and uses the CUDA 8 installation instead of the CUDA 10 installation. You can force CMake to use a specific version by specifying the path to the cuda root dir:
cmake -D CUDA_TOOLKIT_ROOT_DIR=//usr/local/cuda-10.1 .
make sure to also run make and make install.

If you build the code like that, and make sure that only the newly build shared libraries can be loaded using LD_LIBRARY_PATH, do you still run into the error? Because for me it runs on hpc29 if I build like this with CUDA 10.

Best,
Ben

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants