-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some minor quant strategies tweaks #117
base: main
Are you sure you want to change the base?
Conversation
Here's what I'd suggest for starters : Rationalize Q2_K_S ffn_down and attn_v (+1% size, -2.5% ppl) Bump attn_v and attn_k for Q2_K_S and Q2_K if GQA>=2. Uncripple attn_k for IQ3_XXS / IQ3_XS if GQA>=2 -> Gemma v2 (GQA2) is popular and sensitive to both. L3 models as well. Apply 8 experts rules to : - MOEs with more than 8 experts.. - MOEs with 4 experts which should be treated as 8 considering that their shared tensors relative size is already low compared to their ffn tensors). - models with 2 or more experts (such Frankenstein hybrids are published on HF with 2 experts, let them have MOE quants equivalent in bpw to standard models). - Rationalize MOEs attn_k and attn_v for the 1 & 2 bit IQ quants, and attn_q for 1,2 and small 3 bpw quants. - Rationalize attn_ouput for IQ2_XXS, IQ2_XS, IQ2_S and IQ2_M (IQ3_XXS is sufficient), in respect for what was done for the IQ1 quants, themselves shrunk in IQ2_KS. (no tests made today except for IQ2_S and M, it's mere common sense). - rationalize the ffn_down on IQ2_S and IQ2_M. (size is equivalent with the attn_output shrink, ppl drops by 0.5%). Test made today on Sheared Llama 2.7b, but I use those recipes among others for a long time already; Further ideas for a subsequent PR : IQ and IQ_K should maybe not be mixed together unless they are switchable 1:1 on all the supported hardware, accounting also for those having a Cuda MMQ kernel available and those which don't. Maybe also the IQ1 IQ2 tree should be dismantled and spread into the tensor trees like every other quants.
Can you provide some data to support these changes? |
Not really, IK, i'd have to remake all tests I did during the previous months. I never knew how to log properly (aka. in an automated fashion) LlamaCPP data, so I accumulated knowledge and edits along the way and just restitute you the simplest part of it. I submit that to you in a "trust me bro" fashion because I suppose that you know what I know and then some, and just have more interesting things to do with your skillset than to mess hamster-style with quant strategies like I did since early 2024. Broadly, there's a few principles that I discovered through your work :
So, without any disrespect, pick what you like, I'm sure that some of it makes sense to you (I often replicate your own quant strategies patterns, for after all, they are based on similar observations.. that you sometimes didn't systematize), and ditch what's "too much" for your taste. And if you'd like me to go on with the quant strategies, please tell me, I'd be glad to help on something that I actually can grasp and have experience upon. Here's for you to eventually get a look on some experiments I made so you can check how far I went : 07ad6c6f321ea3643cff5d38766ce8f13a785bfcmaster_loot_2/ |
Here's what I'd suggest for starters :
Rationalize Q2_K_S ffn_down and attn_v (+1% size, -2.5% ppl)
Bump attn_v and attn_k for Q2_K_S and Q2_K if GQA>=2. Uncripple attn_k for IQ3_XXS / IQ3_XS if GQA>=2
-> Gemma v2 (GQA2) is popular and sensitive to both. L3 models as well.
Apply 8 experts rules to :
Rationalize MOEs attn_k and attn_v for the 1 & 2 bit IQ quants, and attn_q for 1,2 and small 3 bpw quants.
Rationalize attn_ouput for IQ2_XXS, IQ2_XS, IQ2_S and IQ2_M (IQ3_XXS is sufficient), in respect for what was done for the IQ1 quants, themselves shrunk in IQ2_KS. (no tests made today except for IQ2_S and M, it's mere common sense).
rationalize the ffn_down on IQ2_S and IQ2_M. (size is equivalent with the attn_output shrink, ppl drops by 0.5%).
Test made today on Sheared Llama 2.7b, but I use those recipes among others for a long time already;
Further ideas for a subsequent PR :
IQ and IQ_K should maybe not be mixed together unless they are switchable 1:1 on all the supported hardware, accounting also for those having a Cuda MMQ kernel available and those which don't.
Maybe also the IQ1 IQ2 tree should be dismantled and spread into the tensor trees like every other quants.
I have read the contributing guidelines
Self-reported review complexity: