Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/webgpu] Optimize Gather op #17625

Merged
merged 1 commit into from
Sep 22, 2023
Merged

[js/webgpu] Optimize Gather op #17625

merged 1 commit into from
Sep 22, 2023

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Sep 20, 2023

Description

This PR optimizes the gather op, which is improved ~6ms in segment anything model in ADL.
The problem in original algorithm is that it includes a for loop to calculate a block size of data. However, the block size may be very large, like 65536. In GPU shader, we should try to avoid large loop in shader and try to use more threads to do it parallelly.

Before:

[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns

After:

[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns

@qjia7
Copy link
Contributor Author

qjia7 commented Sep 20, 2023

@guschmue @fs-eire @satyajandhyala Please take a look, thanks.

@qjia7
Copy link
Contributor Author

qjia7 commented Sep 20, 2023

cc @dakenf

@dakenf
Copy link
Contributor

dakenf commented Sep 20, 2023

LGTM

@fs-eire
Copy link
Contributor

fs-eire commented Sep 22, 2023

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Sep 22, 2023

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@fs-eire fs-eire merged commit 891fba3 into microsoft:main Sep 22, 2023
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This PR optimizes the gather op, which is improved ~6ms in segment
anything model in ADL.
The problem in original algorithm is that it includes a for loop to
calculate a block size of data. However, the block size may be very
large, like `65536`. In GPU shader, we should try to avoid large loop in
shader and try to use more threads to do it parallelly.

Before:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns
```
After:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns
siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
### Description
This PR optimizes the gather op, which is improved ~6ms in segment
anything model in ADL.
The problem in original algorithm is that it includes a for loop to
calculate a block size of data. However, the block size may be very
large, like `65536`. In GPU shader, we should try to avoid large loop in
shader and try to use more threads to do it parallelly.

Before:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns
```
After:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants