You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sage attention is the new boy on the block. On SD, it saves a lot of time especially with larger outputs. I can shave a few seconds on SDXL and much more on video models. The library is simple to use and with triton appears to function on older cards too. Maybe 10-15% faster than xformers and they claim to be better than flash attention too. They have 4 and 8 bit attention so there is an even bigger speedup on 4090 than other cards.
Solution
Apply sage attention just like xformers as an alternative attention mechanism. I was going to try it myself too since it's drop in, but I wanted to put the idea out there to see if anyone else tried already and found it to ruin outputs or be worse.
Alternatives
No response
Explanation
Universal attention that's like flash attention and claims benefits over it.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Additional context
No response
Acknowledgements
I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.
The text was updated successfully, but these errors were encountered:
So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.
Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?
It doesn't seem to support paged attention, so it would only be a suitable substititute for "fallback mode" and wouldn't support continuous batching.
So i swapped it for pytorch attention because that's the only one I could get to work. On sageattn pipeline parallel I get 570t/s while on tensor parallel flash attention I only get 411t/s in prompt processing. No issue using turning cards either.
Just don't know if it supports all the bells and whistles for caching, etc. It processing the same context faster than TP is something, right?
TP actually hurts prompt processing speed in exllamav2 - so a fairer performance comparison would be non-TP vs SageAttention2.
Problem
Sage attention is the new boy on the block. On SD, it saves a lot of time especially with larger outputs. I can shave a few seconds on SDXL and much more on video models. The library is simple to use and with triton appears to function on older cards too. Maybe 10-15% faster than xformers and they claim to be better than flash attention too. They have 4 and 8 bit attention so there is an even bigger speedup on 4090 than other cards.
Solution
Apply sage attention just like xformers as an alternative attention mechanism. I was going to try it myself too since it's drop in, but I wanted to put the idea out there to see if anyone else tried already and found it to ruin outputs or be worse.
Alternatives
No response
Explanation
Universal attention that's like flash attention and claims benefits over it.
Examples
https://github.com/thu-ml/SageAttention
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: