-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add split3inner #19886
add split3inner #19886
Conversation
I think that Split op introduces unnecessary data copies. It can always be done along with BNSH transpose |
For the using of pin memory, I think it is mainly because the split size is varying and so the number of outputs is varying, so it needs to use a vector to store all the split outputs pointer, and this vector should be located in cuda memory, it uses a pin memory vector to store output pointers and then copy into a temp cuda vector. And here I only handle the output size is 3, then I don't need to use pin memory and temp cuda vector, just using 3 output parameters. |
@kailums, is it better to support packed QKV format in attention operators so that there is no need to Split? Currently Attention/MultiHeadAttention supports packed qkv format. We can also support it in other operators if needed. |
yes, most attention ops support packed QKV, but still there has scenario that exported onnx model has split in it, and this prevent it from using cudagraph. Our scenario is vllm+ort, the paged attention op haven't support packed qkv, so we have a split op. I think from a more general perspective, this split op should avoid using pin_memory, but i haven't figure out how to implement a general version of split op that supports any split output sizes without using pin_memory. |
Is there some test case that could cover the new code? If not, please add a new test case. |
using ShapeAndDataT = ShapeAndData<uint8_t>; | ||
std::vector<ShapeAndDataT> outputs; | ||
int64_t num_outputs = -1; // when provides split_sizes, then num_outputs should not be provided | ||
const int batch = 16; |
Check warning
Code scanning / PREfast
The const variable 'batch' can be computed at compile-time. Consider using constexpr (con.5). Warning test
std::vector<ShapeAndDataT> outputs; | ||
int64_t num_outputs = -1; // when provides split_sizes, then num_outputs should not be provided | ||
const int batch = 16; | ||
const int data_len = 96; // should be multiple of 3 |
Check warning
Code scanning / PREfast
The const variable 'data_len' can be computed at compile-time. Consider using constexpr (con.5). Warning test
This PR cause ROCm pipeline unittest start to timeout on |
Description
The split op is using pin_memory when split on different sizes.
But pin_memory is not capable for using cudagraph.
Add a new implementation for only transformer scenarios, it split the qkv_proj into q, k, v, not using pin_memory.
Motivation and Context