You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Activation quantization is another different quantization technique compared with weight quantization. It's dynamically during interference which is different from weight quantization (static after training). I don't think it would help to address the major performance bottleneck of LLM inference, so we didn't add it. But we encourage users to copy-paste, fork, and play around the repo with new ideas, you can try it if you are interested.
Many papers have recently addressed the issues with quantization of activations for LLMs.
Examples:
https://github.com/ziplab/QLLM?tab=readme-ov-file#%F0%9F%9B%A0-install
https://github.com/mit-han-lab/lmquant?tab=readme-ov-file#efficiency-benchmarks
https://github.com/spcl/QuaRot
Is it possible to add activation quantization support to gpt-fast for even more speedup?
Any insight on the limitations and possibilities is appreciated.
The text was updated successfully, but these errors were encountered: