-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2023-10-30 23:26:05.514405: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1284] failed to query device memory info: HIP_ERROR_InvalidValue #2289
Comments
feel free to reach me directly/internally ... thank you |
I observed the same behaviour and thought of an incompatibility between ROCm 5.6 and TF 2.13. But that was just a wild guess. |
My home set up with the new tensoflow:latest docker does the same (different GPUs Radeon VII). |
|
r |
the rocm version default seems to be 5.7 but hip is 5.6 ? |
Any takers for this Issue ? |
Is there any one ? |
echo ... echo ... echo |
shoot an email paolod AT amd.com |
Same here. Is there any update? |
@gzitzlsb-it4i no updates on my side |
keeping the comments alive ... |
I see this issue with both rocm5.7-tf2.12-dev and rocm5.7-tf2.13-dev. Maybe this is related to the change of rom 5.6->5.7? |
Thanksgiving ... take your time. |
I'm observing the same problem with rocm 5.7 and both tf 2.12 and tf 2.13. |
Anyone can redirect me to a person I can talk to ? |
I guess we will wait for rocm 6 |
I tried to pull again, there is no new version |
is there a docker tensorflow for rocm 6 ? |
Keep this alive because the last pull did not fix this |
any update ? |
|
|
|
I thought the latest drop would address this .... |
I also confirm that ROCM 6.0 and tensorflow 2.14 still do not work on MI250X, the same error pops up:
ROCM+tensorflow is becoming badly out of date and unusable on large HPC systems that made the mistake of buying AMD MI250X. |
|
@paolodalberto @jpata what AMDGPU driver version are you trying to run the container on? On our HPC system we have a rather old one, I believe no matter the container version you use, the issue is the driver on the host system. |
@dipietrantonio excellent point, thanks a lot! I confirm that LUMI HPC where I'm experiencing this issue uses |
I used my home system (VEGA VII with upgraded ubuntu) and two more advanced ones with MI100 and upgraded recently. Pythorch works |
|
Dear @paolodalberto @jpata , We have installed a newer version of the ROCm driver (6.0.5) on a bunch of nodes for testing and now my container with ROCm 5.7 and TF 2.13 works on the code posted in the description of this issue. The error is gone :) So it is a driver issue as I expected.
|
@dipietrantonio |
When you run a container you rely on the host kernel, not the one installed in your container. The driver is a kernel module. You need to update the driver on the system you are running the container on (at least when you use the Singularity container engine, but I think it is the same for Docker). The problem for me was that the driver of ROCm 5.2 was the issue. I was not expecting that even the ROCm 5.7 driver could have this issue. But as I said, the driver version 6.0.5 solved my issue. |
hmm ... |
Still no new docker with rocm 6 |
new docker arrived
|
|
this is with my system at home and I will check on monday on the real machine |
#2289 (comment) |
I could At least it works for one GPU |
Let me check what I can do on my large machine ... |
the large machine now kicks me out during evaluation |
yep multiple GPUs do not work (single GPU works) |
In practice the multiple GPUs fails so badly that the docker application stalls the machine and breaks the docker deamon that I have to restart manually. This is on a system above ... The funny part this was working on 5.7, 6 months ago .... for tensor flow and pytorch ... |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
v2.13.0-4108-g619eb25934e 2.13.0
Custom code
No
OS platform and distribution
Linux xsjfislx32 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Mobile device
No response
Python version
Python 3.9.18
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
This is the smallest piece of code from a tutorial that reproduce my problem.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: