-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: GGUF format support #69
Comments
Hi, thanks for the interest. We're working on it 👍🏼 |
Hello, you can reach out to GGUF support at helizac/TURNA_GGUF and see a usage example at TURNA_GGUF_USAGE.ipynb. Currently, only CPU usage is supported, but CUDA support will be implemented if huggingface/candle supports it. For more information, see this related issue. llama-cpp does not support quantized-t5 model at the moment but will be implemented in case of improvements I recommend using Q8_1 or Q8K models for efficiency. At the moment, these models generate 5-6 tokens per second. |
That's great news! Thank you for your contribution. We look forward to the implementation of CUDA support. |
Thank you @helizac! How did you do this? llama.cpp repo was not supporting T5 models, I see there are some developments yesterday Did you do it yourself, if so where is the code? |
Hello, unfortunately I did not make this development in llama.cpp repo issue mentioned, but I will try this branch and inform this issue. I implemented it in Rust language with the huggingface/candle framework as follows. I saw that CUDA support could be provided in some examples on the framework, but I encountered problems in the implementation part. I think CUDA support can be provided with a few changes via in: Related issue: huggingface/candle#2266 Currently, only CPU supported .gguf conversion process is below. RUST_GGUF_CONVERT: With the methods in this notebook, TURNA can be used in .gguf format. |
So, I tried the edited new t5 branch -> https://github.com/fairydreaming/llama.cpp/tree/t5 but it's not suitable for the TURNA at the moment. At the beginning, t5 branch expects a spiece.model file. But TURNA is using hf tokenizers. I converted the code for hf tokenizer implementation. But I faced a second problem. Due to tensor models are defined by in TENSOR_MODELS it didn't worked. Because TURNA expects: INFO:hf-to-gguf:dec.blk.0.ffn_up.weight, torch.float32 --> F32, shape = {1024, 2816} I defined the TENSORS on my own and I could export a .gguf output but llama won't work with it due to math calculations -> "error loading model vocabulary: Index out of array bounds in XCDA array!". For this purpose, it is necessary to examine it in detail and rewrite the functions in llama.cpp file. For now, the previous huggingface/candle Rust framework implementation will be more comfortable to use. If GPU support comes soon, the model can be used this way easily: |
Hello and thanks for this beautiful repo,
Do you have plans to provide GGUF file? It would be great if we can have it.
Best,
Orkut
The text was updated successfully, but these errors were encountered: