-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow speed for some models. #33
Comments
When loading a model, it tells you the quantization version. Versions 0 and 2 are slow. 0 because it is old, 2 because upstream GPTQ prefers accuracy over speed. If you want fast models, use version 1. They usually show up on Hugginface as compatible with KoboldAI. |
If you can't achieve that, I have then 2 questions:
|
I've tried a few models and am seeing the same. 2 tk/s with this version of kobald ai (same speed as standard) and 10-12 tk/s with oogabooga same models using exllama. |
Hey,
I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main
For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.
Some other models (like raw llama 13b) gives me 7 tokens/s which is fine
I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.
Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool
Thanks in advance.
The text was updated successfully, but these errors were encountered: