Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes #18894

Merged
merged 11 commits into from
Jan 13, 2024
Merged

Conversation

gyagp
Copy link

@gyagp gyagp commented Dec 20, 2023

We submit kernels in a batch (a fixed number 16 is used except for the last batch) for better performance. However, timestamp query support is at pass level so we disable the batch execution in profiling mode in previous implementation. Actually we can have multiple passes in a batch so that we don't have to disable batch execution, which is the first enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside passes, which isn't supported by all the platforms (e.g., Windows supports it, while macOS doesn't). This is expected to have lower cost compared with multiple passes solution. So this PR also introduce this support when available.
This PR also refactors some implementation related to kernelInfo, and try to unify the related kernel names.

@gyagp gyagp closed this Dec 20, 2023
@gyagp gyagp reopened this Dec 23, 2023
@gyagp gyagp marked this pull request as draft December 23, 2023 02:55
@gyagp gyagp changed the title [js/webgpu] Experiment on timestamp query between atPasses and insidePasses [js/webgpu] Introduce timestamp-query-inside-passes Dec 23, 2023
@gyagp
Copy link
Author

gyagp commented Dec 23, 2023

I did some tests to compare between timestamp query at pass and inside pass on Windows, and didn't find obvious difference. More details can be found at https://docs.google.com/document/d/1eAavWUvp2YdvfiR1a2kpUH_BvTjF2APwCuYY18l7G9g/edit.
In later discussion with WebGPU folks, they told me compute pass on Windows is implemented as no-op, so the above observation is expected. On some other platforms, like Vulkan, there might be some difference. Timestamp query inside pass is expected to have lower cost on Vulkan, but it doesn't come totally free (Need to keep the order of commands so some optimizations can't be aggressive). We still need to understand more about the diff between these two, so it's better to keep the both paths. Please note that we can unify the solution, so to support timestamp query inside pass doesn't increase the code a lot.

…Passes

This is to experiment the timestamp query solutions atPasses and
insidePasses to see if insidePasses has obvious lower cost.
@gyagp gyagp marked this pull request as ready for review December 25, 2023 06:30
@gyagp
Copy link
Author

gyagp commented Dec 25, 2023

@qjia, can you take a first look?

Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits.

js/web/lib/wasm/jsep/webgpu/program-manager.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/webgpu/program-manager.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/init.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits.
@fs-eire @guschmue @satyajandhyala Please take a look, thanks.

js/web/lib/wasm/jsep/backend-webgpu.ts Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/init.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
@gyagp gyagp changed the title [js/webgpu] Introduce timestamp-query-inside-passes [js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes Dec 27, 2023
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/init.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webgpu.ts Outdated Show resolved Hide resolved
@gyagp
Copy link
Author

gyagp commented Jan 4, 2024

@qjia7 @fs-eire @guschmue @satyajandhyala
I just merged the recent changes, including the trace framework. Based on it, I added full support of trace in this PR, especially the GPU timeline. Please take another look!

@gyagp
Copy link
Author

gyagp commented Jan 11, 2024

@fs-eire, thanks for the comments! I addressed all your comments and pulled the latest code. Please take another look!

fs-eire
fs-eire previously approved these changes Jan 11, 2024
@fs-eire
Copy link
Contributor

fs-eire commented Jan 11, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Jan 11, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@gyagp
Copy link
Author

gyagp commented Jan 12, 2024

The IO-binding tests are failed: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1264363&view=logs&j=4cf9212c-8936-5e77-cbdb-290c1c5567eb&t=8a4f1cc9-2423-5c1e-71f3-5ad7e67ee036

The root cause is I intentionally removed endComputePass() in flush to avoid duplication. However, in io-binding, runAsync calls flush() without ending the GPUComputePassEncoder. Adding back the endComputePass() to fix the issue. Local tests below are happy now.
npm test -- suite1 -b=webgpu --io-binding=gpu-location
npm test -- suite1 -b=webgpu --io-binding=gpu-tensor

I also renamed the func writeTimeStamp() to writeTimestamp().

@gyagp
Copy link
Author

gyagp commented Jan 12, 2024

run web CI

@fs-eire
Copy link
Contributor

fs-eire commented Jan 12, 2024

/azp run ONNX Runtime Web CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jan 12, 2024
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@gyagp
Copy link
Author

gyagp commented Jan 13, 2024

run web CI

@fs-eire
Copy link
Contributor

fs-eire commented Jan 13, 2024

/azp run ONNX Runtime Web CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fs-eire
Copy link
Contributor

fs-eire commented Jan 13, 2024

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Jan 13, 2024

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@fs-eire
Copy link
Contributor

fs-eire commented Jan 13, 2024

/azp run Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fs-eire fs-eire merged commit e803f8e into microsoft:main Jan 13, 2024
64 checks passed
@gyagp gyagp deleted the timestamp branch January 13, 2024 11:08
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
…side-passes (#18894)

We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
…side-passes (microsoft#18894)

We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants