Barriers to Wider exllamav2 adoption #514

gittb · 2024-06-18T23:19:37Z

gittb
Jun 18, 2024

Could the current lack of comprehensive quant coverage on Hugging Face be a hurdle for ExLLaMAv2's widespread adoption? GGUF's easy availability and creation process might initially seem more attractive, overshadowing the potential of ExLLaMAv2's superior quality preservation.

From personal experience, I've spent time setting up an inference server, initially considering only GGUF-supporting inference libraries due to the abundance of GGUF quants on HF. This underscores a key point: many users rely on pre-quantized models rather than creating their own.

Observing the pressure on vLLM's issues to support GGUF, it's noticeable that ExLLaMAv2 isn't discussed as frequently. Yet, could better quant coverage on Hugging Face tip the scale in favor of ExLLaMAv2 adoption? As users discover the benefits of ExLLaMAv2 over time, the demand for its inclusion in inference engines might well increase.

Anthonyg5005 · 2024-06-19T09:42:07Z

Anthonyg5005
Jun 19, 2024

I do have some scripts that make setup and quanting a lot more simple, including a version that is able to upload to repos automatically in a similar format that turbo uploads his quants. This should make it much easier for anyone to automate the process and get quants up. Also I think the biggest reason is the fact that exllama is gpu only while gguf has cpu and apple silicon support, which allows people with less vram use bigger models

0 replies

Vhallo · 2024-06-19T13:59:29Z

Vhallo
Jun 19, 2024

Gittb is right that integration is a serious issue. Especially as ooba is struggling with text-generation-webui lately, it's even still on V0.0.20 by default instead of V0.1.5 which doesn't help either. More quants might help a little, but there's already diligent people like Bartowski who upload a lot. But very few people ever use them. Which goes back to limited integration, but also lack of awareness. People often don't know what exl2 really is, the potential benefits or how to use it.

Above all though, let's not forget that while turboderp does some real heavy lifting, the support for llama.cpp / gguf is on a completely different level, both financially and community wise. Proposed solution: Fundraiser to clone turboderp.

0 replies

gittb · 2024-06-19T14:50:05Z

gittb
Jun 19, 2024
Author

@Vhallo I agree on the latter points. No doubt hurdles larger than just quant coverage exist to reach GGUF's share.

Awareness and ease of use IMO are strong factors. I think Exl2 format is easy to use, but awareness could be increased with better quant coverage.

If @Anthonyg5005 is willing to share some of the automation he has done, I'd consider offering up some recurring computer during off hours to ensure we published out enough quants.

While we know making quants is quite easy, I hope my initial post gives some insight into the mindset of a deployer with purely inference in mind navigating the hurdles. I feel it's possible most wont consider rolling their own quants. I have a hunch these are the kind of folks that are applying pressure to the inference libraries for adoption.

With better coverage hopefully we can re-frame deployers decision process from feeling like they are making an Availability decision to instead feel like they are making a Utility decision ("What is best for my usecase?") . As it really should be.

1 reply

Anthonyg5005 Jun 19, 2024

Of course, I have them over in my repo at https://huggingface.co/Anthonyg5005/hf-scripts and it's github mirror. The code isn't very organized as I am learning, but it is public. I've also put them in the community-projects channel in the exllama discord server. We've also got a bot that allows anyone to submit any model it'll quant and upload to HF as @blockblockblock

turboderp · 2024-06-19T17:19:25Z

turboderp
Jun 19, 2024
Maintainer

I wouldn't worry too much about TGW right now. TabbyAPI provides an OpenAI-style API and it has up-to-date support for the latest features, most notably continuous batching, concurrent streams and Q6/Q8 cache. Depending on your frontend of choice it's most likely just a better solution right now. Not entirely sure but it might even work as a backend for TGW?

Anyway, I honestly don't think too much about widespread adoption. I feel like there are quite a lot of EXL2 models on HF, a lot of users, a lot of members on the Discord server. Can't really complain. The important thing for me is just being able to continue to make some meaningful contribution to open-source AI.

I have some neat ideas I'm working on, and I fully expect them to either be forgotten if they don't work out, or incorporated into llama.cpp, Aphrodite or whatever if they do work. So I can help with the overall effort that way. But I can't really afford to think in terms of what would make ExLlama more suitable for deployments, because there just aren't enough hours in a day for me to go down that route and continue to make improvements.

3 replies

Vhallo Jun 19, 2024

Personally, I've already switched from TWG to TabbyAPI a while back. Can definitely recommend it, it's fast, simple to use and indeed quick to update. When it comes to features, I think it's only missing the recently added dry repetition penalty when compared to TWG.

Sadly, the Tabby extension for SillyTavern seems to not be actively maintained anymore, but I at least tried and made a couple pull requests to improve it. I know there was talk about integrating it directly into SillyTavern a while back too, but not sure what the status is on that.

But I absolutely agree, not getting bogged down and instead focusing on improving exl2 is the way to go!

gittb Jun 20, 2024
Author

I hear you turbo. Appreciate what you've done.

TabbyAPI is incredible. It's what I use.

arbi-dev Jun 28, 2024

Count me in as another early adopter of TGW+exl1/exl2 who is moving to TabbyAPI. TabbyAPI's Readme says it's not "production-grade" though. Not sure if that notice is too conservative or out of date (eg. with latest improvements in exllama2). TabbyAPI is fast and easy to use, but I guess I have not tried with a lot of concurrent usage.

Are there reasons why it cannot scale or be robust in that environment?

I guess another issue is the different model architectures always coming out and requiring adapations to work with the exl2 backend.

Anyway, really love the project and making use of it extensively.

If helpful, my company is happy to contribute single/dual 4090 GPU workstation access for any benchmarking or work on this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barriers to Wider exllamav2 adoption #514

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Barriers to Wider exllamav2 adoption #514

gittb Jun 18, 2024

Replies: 4 comments · 4 replies

Anthonyg5005 Jun 19, 2024

Vhallo Jun 19, 2024

gittb Jun 19, 2024 Author

Anthonyg5005 Jun 19, 2024

turboderp Jun 19, 2024 Maintainer

Vhallo Jun 19, 2024

gittb Jun 20, 2024 Author

arbi-dev Jun 28, 2024

gittb
Jun 18, 2024

Replies: 4 comments 4 replies

Anthonyg5005
Jun 19, 2024

Vhallo
Jun 19, 2024

gittb
Jun 19, 2024
Author

turboderp
Jun 19, 2024
Maintainer

gittb Jun 20, 2024
Author