You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Tensorflow ML noteboook fails to run on GPU. The suggestions in the thread don't work.
To Reproduce
Steps to reproduce the behavior:
In the jupyter hub instance terminal mamba install -c nvidia cuda-nvcc
shutdown kernel of the notebooks if they were active
Run the example in the thread or any example which uses TF.
You will see either a libdevice not found or StatefulPartitionedCall_2
Expected behavior
Images prior to April (when the thread was posted) gave the StatefulPartitionedCall_2 error where as the latest images were giving the libdevice not found error
Docker Image Version
following images were tried:
libdevice not found error
2023.11.14
2023.10.24
Hi @racheetmatai, thanks for opening this bug report. It seems like you've tested docker images up to tag 2023.11.14. I'm wondering if any of the newer ones, e.g. 2024.01.03 which includes the CUDA 11.2 to 11.8 update (#505) might help with this issue?
There are a few things we can try, there are some major changes on conda-forge related to CUDA 12, and I'm wondering if the cuda-nvcc issue could be handled differently now if we update from CUDA 11.8 to 12, cc @ngam. There are also some tensorflow updates we need to do related to flax (#489), but I'm not sure if it would help here.
Hi @weiji14, I should have mentioned, i tried the example today (the default TF ML notebook on pangeo) before submitting the issue and got the StatefulPartitionedCall_2 error. I know that the cuda version, driver version and cudnn version have to match exactly (or atleast thats how it used to be) and this is particularly painful on ubuntu. Just updating one of these used to sometimes leave the build broken.
@weiji14pip install 'flax==0.7.2' 'jax<=0.4.13' 'ml_dtypes==0.2.0' mamba install cuda-nvcc==11.6.* -c nvidia and adding os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook' in the beginning of my notebook works. Thanks :)
Describe the bug
Tensorflow ML noteboook fails to run on GPU. The suggestions in the thread don't work.
To Reproduce
Steps to reproduce the behavior:
libdevice not found
orStatefulPartitionedCall_2
Expected behavior
Images prior to April (when the thread was posted) gave the
StatefulPartitionedCall_2
error where as the latest images were giving thelibdevice not found
errorDocker Image Version
following images were tried:
libdevice not found error
2023.11.14
2023.10.24
StatefulPartitionedCall_2 error
2023.05.18
2023.04.15
2023.01.04
Infrastructure (Where you are running this image):
Additional context
The same notebook runs fine on the CPU Tensorflow ML notebooks
The text was updated successfully, but these errors were encountered: