[REQUEST] Sage Attention? Anyone tried it with exllama? #702

Ph0rk0z · 2024-12-21T14:41:45Z

Problem

Sage attention is the new boy on the block. On SD, it saves a lot of time especially with larger outputs. I can shave a few seconds on SDXL and much more on video models. The library is simple to use and with triton appears to function on older cards too. Maybe 10-15% faster than xformers and they claim to be better than flash attention too. They have 4 and 8 bit attention so there is an even bigger speedup on 4090 than other cards.

Solution

Apply sage attention just like xformers as an alternative attention mechanism. I was going to try it myself too since it's drop in, but I wanted to put the idea out there to see if anyone else tried already and found it to ruin outputs or be worse.

Alternatives

No response

Explanation

Universal attention that's like flash attention and claims benefits over it.

Examples

https://github.com/thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

The text was updated successfully, but these errors were encountered:

Ph0rk0z · 2024-12-21T15:44:06Z

So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.

Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?

DocShotgun · 2024-12-22T03:47:12Z

It doesn't seem to support paged attention, so it would only be a suitable substititute for "fallback mode" and wouldn't support continuous batching.

So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.

Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?

TP actually hurts prompt processing speed in exllamav2 - so a fairer performance comparison would be non-TP vs SageAttention2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

Ph0rk0z commented Dec 21, 2024

Ph0rk0z commented Dec 21, 2024 •

edited

Loading

DocShotgun commented Dec 22, 2024

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

[REQUEST] Sage Attention? Anyone tried it with exllama? #702

Comments

Ph0rk0z commented Dec 21, 2024

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

Ph0rk0z commented Dec 21, 2024 • edited Loading

DocShotgun commented Dec 22, 2024

Ph0rk0z commented Dec 21, 2024 •

edited

Loading