Enable using different quantization of models #3439

ghthor · 2024-03-04T15:54:24Z

ghthor
Mar 4, 2024

Please describe the feature you want

Currently it appears that tabby internally assumes that all models are using Q8 quantization, but that appears to not be a requirement. I forked the registry and modifying a Q8 download to instead download a a Q4_K_M download of the deepseek 6.7B model as I needed a smaller usage of RAM so I could run on my nvidia 2080 SUPER.

Tabby still downloads the model to a file name q8_0.v2.gguf but we see the sha256sum does match the Q4_K_M.gguf that I overloaded in my fork of registry-tabby.

✦ ❯ jq .[-1] /home/ghthor/.tabby/models/ghthor/models.json
{
  "name": "DeepseekCoder-6.7B",
  "prompt_template": "<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>",
  "urls": [
    "https://huggingface.co/TheBloke/deepseek-coder-6.7B-base-GGUF/resolve/main/deepseek-coder-6.7b-base.Q4_K_M.gguf"
  ],
  "sha256": "28cef03e1b2d2478dafdb09f1520417cab55efcd3d1cc22bb1950c90bcd8804b"
}

Mon Mar  4 10:45:14 2024 exit 0 🟢 took 2s
registry-tabby on  main 
✦ ❯ find ~/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/
/home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/
/home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf

Mon Mar  4 10:45:17 2024 exit 0 🟢 took 2s
registry-tabby on  main 
✦ ❯ sha256sum /home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf
28cef03e1b2d2478dafdb09f1520417cab55efcd3d1cc22bb1950c90bcd8804b  /home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf

Mon Mar  4 10:45:22 2024 exit 0 🟢 took 4s

Once I performed this "override" via my fork of registry-tabby, I was able to load the model without issue as llama-cpp doesn't require that we use only Q8 models.

I think this would probably require an additional field to the registry-tabby json structure that would allow tabby to map the model file to a different filename; in addition we couldn't hardcode the model filename as has been done here.

Implementation details aside, my main point is that llama-cpp supports loading ~~ggml~~ gguf models other than Q8 and it would be nice if tabby supported this without the ugly registry hack that I've done

Additional context

Please reply with a 👍 if you want this feature.

wsxiaoys · 2024-03-04T18:30:51Z

wsxiaoys
Mar 4, 2024
Maintainer

Hey - the naming convention of "q8_0" is primarily due to legacy reasons - it doesn't necessarily mean that only q8_0 quantized models can be loaded.

The approach you took with the registry is actually the recommended method for loading different model checkpoints. Feel free to name your model something like {ModelName}-q4 to distinguish it.

Related discussion: #1398

0 replies

rudiservo · 2024-03-16T16:40:22Z

rudiservo
Mar 16, 2024

You should just add it do documentation and close this, it's not really an issue, I ran FP16, Q6_K, Q5_M all using the same q8_0.v2.gguf file name without any issue.
IMO dough going lower than Q8 will affect the quality of model noticeably.
i.e. starcoder2 15b Q8 has noticeable better code quality and accuracy than the Q6, even with context.

0 replies

Mte90 · 2024-11-18T16:58:34Z

Mte90
Nov 18, 2024

Any updates about it?
I want to speciffy the quantization of the models I am using it or I can't run tabby locally on my dev machine.

0 replies

rudiservo · 2024-11-18T17:37:34Z

rudiservo
Nov 18, 2024

@Mte90 you can make your own registry with models with different quantization. the name of the file has no relevance on the model ability to run with Q6 or Q8.

Check my registry-tabby project has an example.

0 replies

Mte90 · 2024-11-19T09:03:19Z

Mte90
Nov 19, 2024

It isn't very handy create a custom registry just to try models.

I was thinking to download it manually and rename/put the file inside the .tabby specific model folder to simplify.

0 replies

rudiservo · 2024-11-19T10:28:03Z

rudiservo
Nov 19, 2024

that works to, I did that a lot, still you need to have the prompt_template well configured.
you can connect tabby to llamacpp or ollama and try it that way.

0 replies

Mte90 · 2024-11-19T10:51:07Z

Mte90
Nov 19, 2024

the problem is that if I put a single file for a tabby model that instead is split in the registry automatically delete it and download again the q8.

0 replies

Mte90 · 2024-11-19T11:20:02Z

Mte90
Nov 19, 2024

So I did in that way:

modified models.json in .tabby folder with the right path to download the q_4 for my model
removed the write permission on models.json file

in this way downloads the models without doing a new registry somewhere :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable using different quantization of models #3439

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enable using different quantization of models #3439

Replies: 8 comments

wsxiaoys Mar 4, 2024 Maintainer

wsxiaoys
Mar 4, 2024
Maintainer