-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where is the code about "remaining layers use faster half precision accumulate"? #10
Comments
It's the CublasLinear layers. It's a repo I made which allows matmuls to run with half precision accumulate within the matmul kernel- which doubles the tflops for most consumer gpus. The source is here- https://github.com/aredden/torch-cublas-hgemm - so, wherever you see CublasLinear replacements happening- I think it's actually in the float8_quantize.py file, that's where that occurs. |
@aredden Thanks for your detailed answer!
|
|
Hey @aredden, will a datacenter GPU (like L40S for example) get any benefit from the cublas swap? |
Not really- it has enough sram where it gets the same tflops for fp16 w/ fp32 accumulate as it does for fp16 w/ fp16 accumulate. @spejamas |
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Hello there!
Thanks for sharing your quantization implementation of Flux!
I have a question about "remaining layers use faster half precision accumulate". Could you help to point out the lines that enable "faster half precision accumulate" in the repo?
Thanks in advance!
The text was updated successfully, but these errors were encountered: