Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add split3inner #19886

Merged
merged 15 commits into from
Jun 27, 2024
Merged

add split3inner #19886

merged 15 commits into from
Jun 27, 2024

Conversation

kailums
Copy link
Contributor

@kailums kailums commented Mar 13, 2024

Description

The split op is using pin_memory when split on different sizes.

But pin_memory is not capable for using cudagraph.

Add a new implementation for only transformer scenarios, it split the qkv_proj into q, k, v, not using pin_memory.

Motivation and Context

@wangyems
Copy link
Contributor

I think that Split op introduces unnecessary data copies. It can always be done along with BNSH transpose

@kailums
Copy link
Contributor Author

kailums commented Jun 18, 2024

I think that Split op introduces unnecessary data copies. It can always be done along with BNSH transpose

For the using of pin memory, I think it is mainly because the split size is varying and so the number of outputs is varying, so it needs to use a vector to store all the split outputs pointer, and this vector should be located in cuda memory, it uses a pin memory vector to store output pointers and then copy into a temp cuda vector.

And here I only handle the output size is 3, then I don't need to use pin memory and temp cuda vector, just using 3 output parameters.

@tianleiwu
Copy link
Contributor

@kailums, is it better to support packed QKV format in attention operators so that there is no need to Split?

Currently Attention/MultiHeadAttention supports packed qkv format. We can also support it in other operators if needed.

@kailums
Copy link
Contributor Author

kailums commented Jun 19, 2024

@kailums, is it better to support packed QKV format in attention operators so that there is no need to Split?

Currently Attention/MultiHeadAttention supports packed qkv format. We can also support it in other operators if needed.

yes, most attention ops support packed QKV, but still there has scenario that exported onnx model has split in it, and this prevent it from using cudagraph.

Our scenario is vllm+ort, the paged attention op haven't support packed qkv, so we have a split op.

I think from a more general perspective, this split op should avoid using pin_memory, but i haven't figure out how to implement a general version of split op that supports any split output sizes without using pin_memory.

@tianleiwu
Copy link
Contributor

Is there some test case that could cover the new code? If not, please add a new test case.

@kailums kailums merged commit a1bbfeb into main Jun 27, 2024
92 of 98 checks passed
@kailums kailums deleted the kailums-dev/split3inner branch June 27, 2024 10:53
using ShapeAndDataT = ShapeAndData<uint8_t>;
std::vector<ShapeAndDataT> outputs;
int64_t num_outputs = -1; // when provides split_sizes, then num_outputs should not be provided
const int batch = 16;

Check warning

Code scanning / PREfast

The const variable 'batch' can be computed at compile-time. Consider using constexpr (con.5). Warning test

The const variable 'batch' can be computed at compile-time. Consider using constexpr (con.5).
std::vector<ShapeAndDataT> outputs;
int64_t num_outputs = -1; // when provides split_sizes, then num_outputs should not be provided
const int batch = 16;
const int data_len = 96; // should be multiple of 3

Check warning

Code scanning / PREfast

The const variable 'data_len' can be computed at compile-time. Consider using constexpr (con.5). Warning test

The const variable 'data_len' can be computed at compile-time. Consider using constexpr (con.5).
@cloudhan
Copy link
Contributor

cloudhan commented Jul 2, 2024

This PR cause ROCm pipeline unittest start to timeout on SplitOperatorTest.Axis2EqualSplit. A fix or revert is desired.

cloudhan added a commit that referenced this pull request Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants