[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

tianleiwu · 2024-07-19T16:34:09Z

Description

Rewrite FusedMHARunnerFP16v2 to make it thread-safe.
Add multi-threading tests

Previously, the kernel parameters params is stored as a member of mha runner, which means that different threads might change the params at the same time and impacts the other threads.

For example, if batch_size and seq_len was changed by another thread to larger values in setup(...), buffer overrun might happen in run(...) because a kernel could read/write memory out of range of allocated buffers.

In new implementation, I change the api and remove mutable member variables to make it thread safe. Below is summary of change:

Before:

class FusedMHARunnerFP16v2::mhaImpl {
   void setup(int seq_len, int batch_size) {
      // change scalar params
   }

   void run(input, output) {
      // change params for input and output pointers
      // launch kernel using params
   }

   Fused_multihead_attention_params_v2 params; // mutable, not thread-safe
}

After:

class FusedMHARunnerFP16v2::FmhaImpl {
   void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) {
      // change params
   }

   void run(params, input, output) {
      // change params with input and output pointers
      // launch kernel using params
   }
}

Motivation and Context

#18854
#21413

onnxruntime/test/python/transformers/test_mha.py

+def parity_check_mha_multi_threading(
+    test_inputs: List[Dict],
+    rtol: float = 1e-3,
+    atol: float = 1e-3,
+    sdpa_kernel: int = SdpaKernel.DEFAULT,
+    max_threads: int = 5,
+    verbose: bool = False,
+):


onnxruntime/test/python/transformers/test_mha.py

FusedMHARunnerFP16v2 thread safe

c2ae4d7

tianleiwu marked this pull request as draft July 19, 2024 16:34

tianleiwu added 3 commits July 20, 2024 16:11

refactoring

b3897ad

merge main

f9fc075

multi thread tests for MultiHeadAttention

6f399a6

github-advanced-security bot found potential problems Jul 21, 2024

View reviewed changes

limit sm

1f6738e

tianleiwu marked this pull request as ready for review July 22, 2024 07:06

tianleiwu requested review from wangyems, yufenglee and kunal-vaishnavi July 22, 2024 17:20

kunal-vaishnavi approved these changes Jul 22, 2024

View reviewed changes

tianleiwu merged commit a6c5e2c into main Jul 22, 2024
95 of 98 checks passed

tianleiwu deleted the tlwu/thread_safe_attention branch July 22, 2024 17:41

tianleiwu mentioned this pull request Sep 30, 2024

FusedMHARunnerFP16v2 will cause onnxruntime coredump when multi-host-threads run session.run() #22262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

tianleiwu commented Jul 19, 2024 •

edited

Loading

[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

Conversation

tianleiwu commented Jul 19, 2024 • edited Loading

Description

Motivation and Context

tianleiwu commented Jul 19, 2024 •

edited

Loading