Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow speed for some models. #33

Open
BadisG opened this issue May 21, 2023 · 4 comments
Open

Slow speed for some models. #33

BadisG opened this issue May 21, 2023 · 4 comments

Comments

@BadisG
Copy link

BadisG commented May 21, 2023

Hey,

I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main

For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.
image

Some other models (like raw llama 13b) gives me 7 tokens/s which is fine

I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.

Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool

Thanks in advance.

@0cc4m
Copy link
Owner

0cc4m commented May 21, 2023

When loading a model, it tells you the quantization version. Versions 0 and 2 are slow. 0 because it is old, 2 because upstream GPTQ prefers accuracy over speed. If you want fast models, use version 1. They usually show up on Hugginface as compatible with KoboldAI.

@BadisG
Copy link
Author

BadisG commented May 21, 2023

When loading a model, it tells you the quantization version.

Oh yeah I have the Version 2

image

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

@BadisG
Copy link
Author

BadisG commented May 21, 2023

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

If you can't achieve that, I have then 2 questions:

  1. How do you make a "Version 1 GPTQ" when you decided to quantize a model?
  2. Do you loose a lot of accuracy when using the version 1?

@liquidsnakeblue
Copy link

liquidsnakeblue commented Aug 5, 2023

I've tried a few models and am seeing the same. 2 tk/s with this version of kobald ai (same speed as standard) and 10-12 tk/s with oogabooga same models using exllama.

pi6am added a commit that referenced this issue Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants