Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: GGUF format support #69

Open
orkutmuratyilmaz opened this issue Feb 12, 2024 · 6 comments
Open

Feature Request: GGUF format support #69

orkutmuratyilmaz opened this issue Feb 12, 2024 · 6 comments

Comments

@orkutmuratyilmaz
Copy link

Hello and thanks for this beautiful repo,

Do you have plans to provide GGUF file? It would be great if we can have it.

Best,
Orkut

@onurgu
Copy link

onurgu commented Feb 27, 2024

Hi, thanks for the interest. We're working on it 👍🏼

@helizac
Copy link

helizac commented Jun 13, 2024

Hello, you can reach out to GGUF support at helizac/TURNA_GGUF and see a usage example at TURNA_GGUF_USAGE.ipynb.

Currently, only CPU usage is supported, but CUDA support will be implemented if huggingface/candle supports it. For more information, see this related issue.

llama-cpp does not support quantized-t5 model at the moment but will be implemented in case of improvements

I recommend using Q8_1 or Q8K models for efficiency. At the moment, these models generate 5-6 tokens per second.

@gokceuludogan
Copy link
Contributor

That's great news! Thank you for your contribution. We look forward to the implementation of CUDA support.

@onurgu
Copy link

onurgu commented Jun 14, 2024

Thank you @helizac! How did you do this? llama.cpp repo was not supporting T5 models, I see there are some developments yesterday

ggerganov/llama.cpp#5763

Did you do it yourself, if so where is the code?

@helizac
Copy link

helizac commented Jun 14, 2024

Hello, unfortunately I did not make this development in llama.cpp repo issue mentioned, but I will try this branch and inform this issue. I implemented it in Rust language with the huggingface/candle framework as follows. I saw that CUDA support could be provided in some examples on the framework, but I encountered problems in the implementation part. I think CUDA support can be provided with a few changes via in:
https://github.com/huggingface/candle/blob/main/candle-examples%2Fexamples%2Fquantized-t5%2Fmain.rs

Related issue: huggingface/candle#2266

Currently, only CPU supported .gguf conversion process is below.

RUST_GGUF_CONVERT:
https://colab.research.google.com/drive/1s97zTs8hfT0wyGTDHvs8cVOm9mVgXd9G?usp=sharing

With the methods in this notebook, TURNA can be used in .gguf format.

@helizac
Copy link

helizac commented Jun 14, 2024

So, I tried the edited new t5 branch -> https://github.com/fairydreaming/llama.cpp/tree/t5 but it's not suitable for the TURNA at the moment.

At the beginning, t5 branch expects a spiece.model file. But TURNA is using hf tokenizers. I converted the code for hf tokenizer implementation. But I faced a second problem. Due to tensor models are defined by
MODEL_TENSOR.DEC_FFN_UP: "decoder.block.{bid}.layer.2.DenseReluDense.wi
and
MODEL_TENSOR.ENC_FFN_UP: "encoder.block.{bid}.layer.1.DenseReluDense.wi"

in TENSOR_MODELS it didn't worked. Because TURNA expects:

INFO:hf-to-gguf:dec.blk.0.ffn_up.weight, torch.float32 --> F32, shape = {1024, 2816}
INFO:hf-to-gguf:dec.blk.0.dense_relu_dense.wi_1.weight, torch.float32 --> F32, shape = {1024, 2816}
INFO:hf-to-gguf:enc.blk.0.ffn_up.weight, torch.float32 --> F32, shape = {1024, 2816}
INFO:hf-to-gguf:enc.blk.0.dense_relu_dense.wi_1.weight, torch.float32 --> F32, shape = {1024, 2816}

I defined the TENSORS on my own and I could export a .gguf output but llama won't work with it due to math calculations -> "error loading model vocabulary: Index out of array bounds in XCDA array!". For this purpose, it is necessary to examine it in detail and rewrite the functions in llama.cpp file.

For now, the previous huggingface/candle Rust framework implementation will be more comfortable to use. If GPU support comes soon, the model can be used this way easily:
https://colab.research.google.com/drive/1s97zTs8hfT0wyGTDHvs8cVOm9mVgXd9G?usp=sharing (RUST_GGUF_CONVERT)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants