Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable
ORT_ENABLE_CUDNN_FLASH_ATTENTION=1
or setsdpa_kernel=8
cuda provider option to enable it.(3) Only works on devices with compute capability >= 8.0.
Note that some combinations of parameters might be rejected due to limited support of head dimension or sequence lengths.
Future Works:
(1) FP8 APIs. Currently, only API for FP16 and BF16 are exposed in cudnn_flash_attention.h.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH or BNSH format. May do some experiment to see whether converting q to BNSH could be better in some case.
Example Benchmark Results on H100
The following tests are on FP16 MultiHeadAttention operator without attention mask and attention bias.
Test Setting 1
Test Setting 2