Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

Open
3 tasks done
Ph0rk0z opened this issue Dec 21, 2024 · 2 comments
Open
3 tasks done

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

Ph0rk0z opened this issue Dec 21, 2024 · 2 comments

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 21, 2024

Problem

Sage attention is the new boy on the block. On SD, it saves a lot of time especially with larger outputs. I can shave a few seconds on SDXL and much more on video models. The library is simple to use and with triton appears to function on older cards too. Maybe 10-15% faster than xformers and they claim to be better than flash attention too. They have 4 and 8 bit attention so there is an even bigger speedup on 4090 than other cards.

Solution

Apply sage attention just like xformers as an alternative attention mechanism. I was going to try it myself too since it's drop in, but I wanted to put the idea out there to see if anyone else tried already and found it to ruin outputs or be worse.

Alternatives

No response

Explanation

Universal attention that's like flash attention and claims benefits over it.

Examples

https://github.com/thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@Ph0rk0z
Copy link
Author

Ph0rk0z commented Dec 21, 2024

So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.

Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?

@DocShotgun
Copy link
Contributor

It doesn't seem to support paged attention, so it would only be a suitable substititute for "fallback mode" and wouldn't support continuous batching.

So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.

Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?

TP actually hurts prompt processing speed in exllamav2 - so a fairer performance comparison would be non-TP vs SageAttention2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants