-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly built wheels from CI doesn't work on Navi 31 #2142
Comments
Made some progress building TF for 7900XTX. #2191 |
We have a thread ROCm/ROCm#1880 for it, and after #2101, TensorFlow should work to some extent. It's even already integrated in their CI. But in your PR there is a missing comma, so I don't know where has it been. |
It looks like #2101 is already merged. It's weird that the comma is still missing in the main source. With the comma added, my build is using the 7900XTX. I still see instability here and there. Hopefully, they will release the working code soon. |
Is anyone else having issues building tf from the source? For me, the build usually fails 2 to 3 times, usually after consuming a lot of memory. Hower, if run the build script again, eventually it completes and generates whl file. I'm now trying to build using 5.7 environment and the problem seems to be much worse. I retried at least 10 time and it still fails, usually with the following error. But if I retry, it compiles for a while again then crashes.
|
seems the build couldn't find rocm stuff at all, I suspect it's related to LLVM version (cannot guarantee). We recommend you're using rocm/tensorflow-build to build from the source if you want to try |
To use ROCm 5.7, I changed the ROCM_INSTALL_DIR in build_rocm_python3 to point to /opt/rocm. Making it to point to /opt/rocm-5.7.0 explicitly seems to fix the issue. Not sure why though since /opt/ROCm is a sym-link to the exact same location. 🤷♂️ |
I am not sure what's your environment, but could you try it in rocm/tensorflow-build ? Thanks |
@i-chaochen Thanks! |
If you installed docker container, then you can follow the instructions from our repo. https://github.com/ROCmSoftwarePlatform/tensorflow-upstream#tensorflow-rocm-port
Within the launched container, you can git clone our Tensorflow repo and change the correct rocm-version with directory path in
PS: Because I am using rocm5.6.0 as an example, it's same as in this script. If you're using rocm5.7.0 docker container, you need to change to rocm-5.7.0 as well., e.g., /opt/rocm-5.7.0 in https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/build_rocm_python3#L33 |
Secondly modify this line. In my case 32gb 16 cores and 32 threads.
|
Issue Type
Bug
Have you reproduced the bug with TF nightly?
Yes
Source
source
Tensorflow Version
tf_nightly_rocm-2.14.0dev20230628.550-cp310-cp310-manylinux2014_x86_64.whl
Custom Code
Yes
OS Platform and Distribution
Ubuntu 22.04.2
Tested in ROCm 5.5.1 host and ROCm 5.5.0 container.
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current Behaviour?
The sample script doesn't work when using
tf-nightly-rocm
from CI, which should have Navi 31 support out of the box for the time being.The log for a previously succeeded run can be found here: ROCm/ROCm#1880 (comment).
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: