-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes #18894
Conversation
I did some tests to compare between timestamp query at pass and inside pass on Windows, and didn't find obvious difference. More details can be found at https://docs.google.com/document/d/1eAavWUvp2YdvfiR1a2kpUH_BvTjF2APwCuYY18l7G9g/edit. |
…Passes This is to experiment the timestamp query solutions atPasses and insidePasses to see if insidePasses has obvious lower cost.
@qjia, can you take a first look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some nits.
@fs-eire @guschmue @satyajandhyala Please take a look, thanks.
@qjia7 @fs-eire @guschmue @satyajandhyala |
@fs-eire, thanks for the comments! I addressed all your comments and pulled the latest code. Please take another look! |
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Android CI Pipeline |
Azure Pipelines successfully started running 9 pipeline(s). |
The root cause is I intentionally removed endComputePass() in flush to avoid duplication. However, in io-binding, runAsync calls flush() without ending the GPUComputePassEncoder. Adding back the endComputePass() to fix the issue. Local tests below are happy now. I also renamed the func writeTimeStamp() to writeTimestamp(). |
run web CI |
/azp run ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run ONNX Runtime Web CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
run web CI |
/azp run ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 9 pipeline(s). |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run Windows ARM64 QNN CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
…side-passes (#18894) We submit kernels in a batch (a fixed number 16 is used except for the last batch) for better performance. However, timestamp query support is at pass level so we disable the batch execution in profiling mode in previous implementation. Actually we can have multiple passes in a batch so that we don't have to disable batch execution, which is the first enhancement of this PR. Furthermore, WebGPU has an extension to support timestamp query inside passes, which isn't supported by all the platforms (e.g., Windows supports it, while macOS doesn't). This is expected to have lower cost compared with multiple passes solution. So this PR also introduce this support when available. This PR also refactors some implementation related to kernelInfo, and try to unify the related kernel names.
…side-passes (microsoft#18894) We submit kernels in a batch (a fixed number 16 is used except for the last batch) for better performance. However, timestamp query support is at pass level so we disable the batch execution in profiling mode in previous implementation. Actually we can have multiple passes in a batch so that we don't have to disable batch execution, which is the first enhancement of this PR. Furthermore, WebGPU has an extension to support timestamp query inside passes, which isn't supported by all the platforms (e.g., Windows supports it, while macOS doesn't). This is expected to have lower cost compared with multiple passes solution. So this PR also introduce this support when available. This PR also refactors some implementation related to kernelInfo, and try to unify the related kernel names.
We submit kernels in a batch (a fixed number 16 is used except for the last batch) for better performance. However, timestamp query support is at pass level so we disable the batch execution in profiling mode in previous implementation. Actually we can have multiple passes in a batch so that we don't have to disable batch execution, which is the first enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside passes, which isn't supported by all the platforms (e.g., Windows supports it, while macOS doesn't). This is expected to have lower cost compared with multiple passes solution. So this PR also introduce this support when available.
This PR also refactors some implementation related to kernelInfo, and try to unify the related kernel names.