-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting the 5500 XT to work with a custom build of PyTorch and ROCm #47
Comments
Building ROCm 5.4(.3) indeed fixes the MIOpen compiler error users faced with 5.2.3. The checks passed with flying colors, except that ws_size is 0 rather than 576.
run-miopen-img.sh: produces the exact image as the reference, further confirming its functionality.
My ROCm 5.4.3 build log against gfx1012 target. Now AMD MIGraphX couldn't compile while I could with 5.2.3. I wasn't able to figure out how to resolve this, but it's probably not necessary for PyTorch anyway.
check.sh:
env.sh:
Furthermore, I didn't have to patch PyTorch since ROCm 5.4.3 contains definitions that were absent in 5.2.3. The MNIST sample training sessions correctly utilize my GPU without the need for the HSA_OVERRIDE_GFX_VERSION environment variable. Since I complied only for the RX 5500 XT, it won't work with other cards without rebuilding. However, creating a pip wheel for manylinux_2_35_x86_64 and installing it on my Arch host doesn't quite work. I created a PyTorch diagnosis script to test basic tensor and matrix operations. The script fails when hipMAGMA is involved in the calculation. Interestingly, this doesn't happen in the Ubuntu 22.04 Docker container!
Similarly, running the MNIST training session on the host has a similar error:
If you or anyone else has an idea of how to remove these runtime errors, please let me know, and I'll look into it. Much appreciated. |
@TheTrustedComputer You can use this docker that already has prebuilt Pytorch with rocm for rx5500 https://hub.docker.com/r/serhiin/rocm_gfx1012_pytorch |
UPDATE: I figured out how to resolve the issue when building MIGraphX (ONNX Runtime depends on it as an alternative execution provider) on ROCm 5.4.3. It turns out that Then, I bumped versions in Thank you @serhii-nakon for creating a Docker container of the pre-build ROCm and PyTorch for this card and sharing it with everyone. I ended up not needing it. |
Environment
What is the Expected Behavior
All build scripts will pass and install respective packages; unit tests won't raise runtime errors. It should behave exactly like the precompiled wheel package for PyTorch 1.13.1 stable and 2.0.0 nightly, considered ancient by today's rapidly evolving technologies.
The latest stable ROCm version that works properly with the RX 5000 series cards is 5.2.x. Since I'm aware that later versions (5.3+) break compatibility with these cards, I'll try my luck by compiling PyTorch 2.2.0 against ROCm 5.2.3 using your build script, which is the latest stable PyTorch version as of writing.
I read that someone created a wheel with PyTorch 2.1.0 and can confirm that it works on my system without crashing.
What Actually Happens
Building rocALUTION failed with illegal instruction detected, similar to the linked comment on issue #35. I guess it can't be used on this card without hacky workarounds. Fortunately, it's not a requirement for PyTorch. All other toolchains succeed without errors. Here's a simple build log I created to do this stuff, including the need to patch parts of code and install additional build dependencies along the way.
After all the builds were finished, I ran your check scripts to ensure everything was installed properly. With the exception of rocALUTION, which apparently isn't supported for this family of cards, they appeared to look fine. However, I seem to get a partially functional installation. The run-miopen.sh and run-miopen-img.sh check scripts produced compilation errors. As for the other checks, they all run OK without problems. Thankfully, it's virtually identical to the prebuilds. Below is the output of run-miopen.sh:
check.sh:
I've tried different versions of GCC (11.4, 10.5, and 9.4), all resulted in the same error. This is something I cannot fix, sadly. In the systemd journal logs, I see several messages saying "Could not parse number of program headers from core file: invalid `Elf' handle". Investigation shows that this was reported upstream and is somewhat specific to ROCm 5.2.3; it has been fixed in 5.3, including the illegal instruction messages in rocALUTION.
ROCm/MIOpen#1764
rocm-arch/rocm-arch#857
Nevertheless, I then proceeded to compile PyTorch 2.2.0 along with hipMAGMA support, torchaudio 2.2.0, and torchvision 0.17. It doesn't out of the box due to the use of missing constants. All of this appears in your target version, 5.4.
After making some modifications to the PyTorch code (see the build log), I was able to make it work. If you have any patches that backport these four hipBLAS and MIOpen constants, please provide them and let me know how to apply them. Thank you very much!
How to Reproduce
Create an Ubuntu 22.04 Docker container with these flags, and perform a
repo init
andrepo sync
on ROCm 5.2.x.You can change the volume mount point to whatever you have on your end. Then, implement those adjustments as indicated in the build log.
For reference, here's my
env.sh
file:Also, do an...
apt update && apt install sudo xxd kmod libtinfo5 graphviz libgmp-dev libcjson-dev
...beforehand, or your
install-dependency.sh
script and building specific toolchains like ROCR-Runtime, HIP, rocminfo, ROCgdb, and AMD MIGraphX won't run.The text was updated successfully, but these errors were encountered: