Replies: 4 comments 4 replies
-
I do have some scripts that make setup and quanting a lot more simple, including a version that is able to upload to repos automatically in a similar format that turbo uploads his quants. This should make it much easier for anyone to automate the process and get quants up. Also I think the biggest reason is the fact that exllama is gpu only while gguf has cpu and apple silicon support, which allows people with less vram use bigger models |
Beta Was this translation helpful? Give feedback.
-
Gittb is right that integration is a serious issue. Especially as ooba is struggling with text-generation-webui lately, it's even still on V0.0.20 by default instead of V0.1.5 which doesn't help either. More quants might help a little, but there's already diligent people like Bartowski who upload a lot. But very few people ever use them. Which goes back to limited integration, but also lack of awareness. People often don't know what exl2 really is, the potential benefits or how to use it. Above all though, let's not forget that while turboderp does some real heavy lifting, the support for llama.cpp / gguf is on a completely different level, both financially and community wise. Proposed solution: Fundraiser to clone turboderp. |
Beta Was this translation helpful? Give feedback.
-
@Vhallo I agree on the latter points. No doubt hurdles larger than just quant coverage exist to reach GGUF's share. Awareness and ease of use IMO are strong factors. I think Exl2 format is easy to use, but awareness could be increased with better quant coverage. If @Anthonyg5005 is willing to share some of the automation he has done, I'd consider offering up some recurring computer during off hours to ensure we published out enough quants. While we know making quants is quite easy, I hope my initial post gives some insight into the mindset of a deployer with purely inference in mind navigating the hurdles. I feel it's possible most wont consider rolling their own quants. I have a hunch these are the kind of folks that are applying pressure to the inference libraries for adoption. With better coverage hopefully we can re-frame deployers decision process from feeling like they are making an Availability decision to instead feel like they are making a Utility decision ("What is best for my usecase?") . As it really should be. |
Beta Was this translation helpful? Give feedback.
-
I wouldn't worry too much about TGW right now. TabbyAPI provides an OpenAI-style API and it has up-to-date support for the latest features, most notably continuous batching, concurrent streams and Q6/Q8 cache. Depending on your frontend of choice it's most likely just a better solution right now. Not entirely sure but it might even work as a backend for TGW? Anyway, I honestly don't think too much about widespread adoption. I feel like there are quite a lot of EXL2 models on HF, a lot of users, a lot of members on the Discord server. Can't really complain. The important thing for me is just being able to continue to make some meaningful contribution to open-source AI. I have some neat ideas I'm working on, and I fully expect them to either be forgotten if they don't work out, or incorporated into llama.cpp, Aphrodite or whatever if they do work. So I can help with the overall effort that way. But I can't really afford to think in terms of what would make ExLlama more suitable for deployments, because there just aren't enough hours in a day for me to go down that route and continue to make improvements. |
Beta Was this translation helpful? Give feedback.
-
Could the current lack of comprehensive quant coverage on Hugging Face be a hurdle for ExLLaMAv2's widespread adoption? GGUF's easy availability and creation process might initially seem more attractive, overshadowing the potential of ExLLaMAv2's superior quality preservation.
From personal experience, I've spent time setting up an inference server, initially considering only GGUF-supporting inference libraries due to the abundance of GGUF quants on HF. This underscores a key point: many users rely on pre-quantized models rather than creating their own.
Observing the pressure on vLLM's issues to support GGUF, it's noticeable that ExLLaMAv2 isn't discussed as frequently. Yet, could better quant coverage on Hugging Face tip the scale in favor of ExLLaMAv2 adoption? As users discover the benefits of ExLLaMAv2 over time, the demand for its inclusion in inference engines might well increase.
Beta Was this translation helpful? Give feedback.
All reactions