Tensorflow ML notebook error on GPU #511

racheetmatai · 2024-01-25T19:59:01Z

Describe the bug
Tensorflow ML noteboook fails to run on GPU. The suggestions in the thread don't work.

To Reproduce
Steps to reproduce the behavior:

In the jupyter hub instance terminal mamba install -c nvidia cuda-nvcc
shutdown kernel of the notebooks if they were active
Run the example in the thread or any example which uses TF.
You will see either a libdevice not found or StatefulPartitionedCall_2

Expected behavior
Images prior to April (when the thread was posted) gave the StatefulPartitionedCall_2 error where as the latest images were giving the libdevice not found error

Docker Image Version
following images were tried:
libdevice not found error
2023.11.14
2023.10.24

StatefulPartitionedCall_2 error
2023.05.18
2023.04.15
2023.01.04

Infrastructure (Where you are running this image):

Pangeo JupyterHub

Additional context
The same notebook runs fine on the CPU Tensorflow ML notebooks

The text was updated successfully, but these errors were encountered:

weiji14 · 2024-01-25T21:34:48Z

Hi @racheetmatai, thanks for opening this bug report. It seems like you've tested docker images up to tag 2023.11.14. I'm wondering if any of the newer ones, e.g. 2024.01.03 which includes the CUDA 11.2 to 11.8 update (#505) might help with this issue?

There are a few things we can try, there are some major changes on conda-forge related to CUDA 12, and I'm wondering if the cuda-nvcc issue could be handled differently now if we update from CUDA 11.8 to 12, cc @ngam. There are also some tensorflow updates we need to do related to flax (#489), but I'm not sure if it would help here.

racheetmatai · 2024-01-25T22:01:54Z

Hi @weiji14, I should have mentioned, i tried the example today (the default TF ML notebook on pangeo) before submitting the issue and got the StatefulPartitionedCall_2 error. I know that the cuda version, driver version and cudnn version have to match exactly (or atleast thats how it used to be) and this is particularly painful on ubuntu. Just updating one of these used to sometimes leave the build broken.

racheetmatai · 2024-05-07T18:41:11Z

@weiji14 pip install 'flax==0.7.2' 'jax<=0.4.13' 'ml_dtypes==0.2.0' mamba install cuda-nvcc==11.6.* -c nvidia and adding os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook' in the beginning of my notebook works. Thanks :)

weiji14 added the bug Something isn't working label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow ML notebook error on GPU #511

Tensorflow ML notebook error on GPU #511

racheetmatai commented Jan 25, 2024 •

edited

Loading

weiji14 commented Jan 25, 2024

racheetmatai commented Jan 25, 2024 •

edited

Loading

racheetmatai commented May 7, 2024

Tensorflow ML notebook error on GPU #511

Tensorflow ML notebook error on GPU #511

Comments

racheetmatai commented Jan 25, 2024 • edited Loading

weiji14 commented Jan 25, 2024

racheetmatai commented Jan 25, 2024 • edited Loading

racheetmatai commented May 7, 2024

racheetmatai commented Jan 25, 2024 •

edited

Loading

racheetmatai commented Jan 25, 2024 •

edited

Loading