-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZLUDA Wave64 implementation may have issue on GFX8/9 #208
Comments
Good analysis. I don't have a wave64 GPU at hand (neither pre-RDNA nor CDNA), but I can explain what you are seeing and offer some pointers. I think you are correct in suspecting ZLUDA wave64 mode. Only a handful of functions require special treatment in wave64, but they are fairly tricky and there might as well be bugs. Especially since I did not even write it for pre-RDNA desktop cards, but for CDNA. I'm always surprised by all the complex workloads (well, except this one) that reportedly work on pre-RDNA. You would need to figure out which exact kernel is the first kernel that produces different result. If I had the repro: PTX module with the kernel, kernel name, input and good&bad output I could figure out what's wrong. The tricky part is getting there. There is no single simple solution, but how to get there:
BTW, why even use ZLUDA with Stable Diffusion, especially with pre-RDNA? Is there no better path there? I am asking because I never even touched SD and want to focus on workloads that are right now impossible with AMD cards. It was not on the list of potential ZLUDA workloads when I was with AMD. The thinking was that nod.ai has got it covered |
I do have the source code but that is PyTorch, which is several level higher in abstraction layers. It seems eventually I have to step through all the operators involved first. There are few options over there for AMD cards on PyTorch, given that MIOpen for Windows is still not available and PyTorch is not yet configured to build without that. DirectML works but they're buggy and slow, and is no longer maintained. ZLUDA is one of the best way to get there right now, as ROCm 6.1 again comes without Windows support. |
I'm going to log some of my progress here to avoid being lost somehow. For the code below, the outputs go bad as soon as magic >= 256. import torch
magic = 256
ln = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True)
ln_cuda = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True).cuda()
weight_values = torch.ones(magic)
bias_values = torch.zeros(magic)
ln.weight.data = weight_values
ln.bias.data = bias_values
ln_cuda.weight.data = weight_values.cuda()
ln_cuda.bias.data = bias_values.cuda()
input = torch.rand(1, 1, magic)
with torch.no_grad():
output_cpu = ln(input)
output_gpu = ln_cuda(input.cuda())
print(torch.sum(output_cpu))
print(torch.sum(output_gpu.cpu()))
|
Hmm... zluda_dump gives me 1 single ptx file and 9 elf files... not sure what I can do with them. But anyway, here are the files: |
Some more discoveries... Pytorch has a vectorized layer norm optimization that applys when magic % 4 == 0. The above code only have problem when such optimization is applied and when magic > 128. The branch is here: I'm making the code deterministic so problem can be spotted more easier: import torch
magic = 132
ln = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True)
ln_cuda = torch.nn.LayerNorm((magic, ), eps=1e-05, elementwise_affine=True).cuda()
weight_values = torch.ones(magic)
bias_values = torch.zeros(magic)
ln.weight.data = weight_values
ln.bias.data = bias_values
ln_cuda.weight.data = weight_values.cuda()
ln_cuda.bias.data = bias_values.cuda()
input = torch.linspace(-1., 1., steps=magic).unsqueeze(0).unsqueeze(0)
with torch.no_grad():
output_cpu = ln(input)
output_gpu = ln_cuda(input.cuda())
print(output_cpu)
print(output_gpu.cpu()) Example output when magic = 132:
Example output when magic = 256 (the output gets really off):
|
That is perfectly normal and expected. nvcc will compile your code to multiple code modules (module = kernels + globals). Then a single module gets compiled to a single fat binary. Fat binary contains multiple variants of the same module: usually one for each target GPU architecture and a PTX for unknown architectures. What you see is a single fat binary split into those architecture-specific variants and a PTX. The log is slightly weird: it contains only a single kernel dispatch.
Expected output is this:
|
It looks good |
The log is from the minimal reproduction I posted above. It only contains a Layer norm operation. |
Hmmm, I have a suspicion what specifically went wrong. Can you try and use ZLUDA from |
Additionally, could you run this test (on version you already have and if it fails then on the
|
ZLUDA on RDNA can be forced (by changing the source code) to run wave64. I've tried that and I can't reproduce the issue:
Well, there's only one way to know now. I've just ordered a Vega 10 GPU. It will arrive in a few days and then I'll be able to actually debug this. |
Tried |
The ptx test is not failing. |
By the way, while I tried to figure out what special instruction is involved:
|
Thanks! it is fixed for me, verified with gfx906 Vega 20 (Radeon Pro VII) |
This is purely based on deduction.
What's known
When using ZLUDA with Stable Diffusion, an Vega20 user got this sort of image:
It is currently known that gfx803 / gfx900 / gfx906 users all get similar output.
The exact reason behind this image is unknown. However, it is known that the problem exists somewhere in the CLIP-UNet stage, and I have tried VAE Encode/Decode with no issue. I have also tried several basic PyTorch operators and they all succeed with correct result.
What's not causing the issue
I tried to mitigate the issue in several ways, and find out:
My deduction
At this point, there are only 2 components that could cause this issue. One is ZLUDA, and the other one is ROCm driver. I'm not sure what's happening on the driver side as it is closed-source, and I'm not seeing much similar issues on Windows. (notable issues: ollama/ollama#2453 (comment) and ROCm/rocBLAS#1218 (comment), but the situation there seems somewhat different, as gfx9 is mostly issue-free on them)
The 1 big difference between gfx8/9 and gfx10/11 is the support of
Wave32
. WhileDoubleWave32OnWave64
have this sort of issue, I have asked an RX580 user to turn onZLUDA_WAVE64_SLOW_MODE=1
forWave32OnWave64
, and he got this error constantly:At this point, I suspect the implementation of Wave64 in ZLUDA has something to do with this issue. Hopefully someone could point me to the right direction on how to get this fixed.
The text was updated successfully, but these errors were encountered: