Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebNN EP] Enable IO Bindings with MLTensor #21301

Merged
merged 21 commits into from
Sep 28, 2024

Conversation

egalli
Copy link
Contributor

@egalli egalli commented Jul 9, 2024

Description

Enables using the MLTensor to pass data between models.

Motivation and Context

Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.

Copy link
Contributor

@Honry Honry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! Great work @egalli

Some first eyes comments, I can't wait to try it, will provide more feedbacks later.

js/common/lib/tensor-impl.ts Outdated Show resolved Hide resolved
js/common/lib/tensor-impl.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webnn.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/jsep/backend-webnn.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/session-handler-inference.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/proxy-messages.ts Outdated Show resolved Hide resolved
@guschmue guschmue added the ep:WebNN WebNN execution provider label Jul 11, 2024
onnxruntime/core/providers/webnn/allocator.cc Outdated Show resolved Hide resolved
onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved
onnxruntime/core/providers/webnn/data_transfer.cc Outdated Show resolved Hide resolved
onnxruntime/core/providers/webnn/data_transfer.cc Outdated Show resolved Hide resolved
js/common/lib/tensor.ts Outdated Show resolved Hide resolved
@egalli egalli force-pushed the create_mlbuffer branch from 85ca43b to 1e07f85 Compare July 17, 2024 22:34
@Honry
Copy link
Contributor

Honry commented Jul 18, 2024

@egalli, I test a simple merged model (with If control flow), and only set the preferredOutputLocation = 'ml-buffer', it will throw
image

Since WebNN doesn't support If, the graph will be partitioned into 2 subgraphs, the ORT requires it to copy outputs across devices, i.e. CopyOutputsAcrossDevices will be called, which will trigger the new ml buffer upload, at this time we don't call ensureMLBuffer() to create the ml buffer, that's why it throws above error.

With pre-allocated output ml buffer, it works.

@egalli
Copy link
Contributor Author

egalli commented Jul 18, 2024

@Honry, I have changed from getMLBuffer to ensureBuffer when retrieving outputs. This changed fixes the issue on partitioned graphs.

@Honry
Copy link
Contributor

Honry commented Jul 19, 2024

@Honry, I have changed from getMLBuffer to ensureBuffer when retrieving outputs. This changed fixes the issue on partitioned graphs.

It works, thanks!

Copy link
Contributor

@Honry Honry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egalli, some final comments. :)

js/web/test/test-runner.ts Outdated Show resolved Hide resolved
js/web/test/test-runner.ts Outdated Show resolved Hide resolved
js/web/test/test-runner.ts Outdated Show resolved Hide resolved
js/web/test/test-runner.ts Outdated Show resolved Hide resolved
js/web/test/test-runner.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/session-handler-inference.ts Outdated Show resolved Hide resolved
js/web/lib/wasm/wasm-types.ts Outdated Show resolved Hide resolved
js/web/test/test-runner.ts Outdated Show resolved Hide resolved
onnxruntime/core/providers/webnn/data_transfer.cc Outdated Show resolved Hide resolved
onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved
@egalli egalli force-pushed the create_mlbuffer branch 2 times, most recently from 0c853c0 to e203298 Compare July 22, 2024 23:16
Copy link
Contributor

@Honry Honry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @egalli, LGTM % two nits.

onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved
onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved
@egalli
Copy link
Contributor Author

egalli commented Jul 29, 2024

MLBuffer specification has changed. createBuffer is now async.

@fs-eire
Copy link
Contributor

fs-eire commented Aug 5, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 5, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 5, 2024

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@huningxin
Copy link

FYI, @qwu16 and @Honry are doing performance measure against a set of models on ORT Web build after and before applying this PR. Once the data is ready, we can review and make the decision.

@fs-eire
Copy link
Contributor

fs-eire commented Aug 7, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 7, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Sep 18, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Sep 18, 2024

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@fs-eire
Copy link
Contributor

fs-eire commented Sep 20, 2024

I am generally OK with this change. I have a question: does this MLTensor only represent a WebNN tensor that on GPU/NPU, or it may also possibly represent a WebNN tensor on CPU?

@egalli
Copy link
Contributor Author

egalli commented Sep 20, 2024

MLTensor can represent data in any of the device types (CPU, GPU, or NPU). Moreover, MLContext.compute() is getting removed from the WebNN specification. There won't be a way to use WebNN without MLTensor.

@fs-eire
Copy link
Contributor

fs-eire commented Sep 20, 2024

MLTensor can represent data in any of the device types (CPU, GPU, or NPU). Moreover, MLContext.compute() is getting removed from the WebNN specification. There won't be a way to use WebNN without MLTensor.

I understand that MLTensor will be the type that WebNN interface uses. My question is, I am not sure how MLTensor manages the location of its data. I have no idea how user can set locations to a model's inputs and outputs when using MLTensor (it's not yet in the spec). If MLTensor is created with explicitly set location, at least for CPU usage, we should allow user to use normal ort.Tensor as model input/output instead of having to use ort.Tensor.fromMLTensor().

@egalli
Copy link
Contributor Author

egalli commented Sep 21, 2024

There is a PR explainer for MLTensor. As for the location, MLTensor are created of an MLContext. MLContexts have a device associated with them, either CPU/GPU/NPU. Therefore, an MLTensor will inherit the context's location/device. Also, MLTensor are only valid when used in the context that created them (i.e. MLTensors are non-transferable between devices/MLContexts) Hopefully that helps.

@bbernhar
Copy link

@fs-eire Thanks for helping us integrate MLTensor into ORT web. To expand on what @egalli said, MLTensor handles the memory allocation for its data on behalf of the web developer. Unless you use createTensor(GPUDevice) to explicitly target GPU memory, the developer can't directly specify the memory location. Instead, createTensor(MLTensorUsageFlags) allows you to influence whether the data should prioritize certain resources (like CPU or GPU), though it doesn't grant full control over the specific device.

We are still working on fully specifying how tensor data will move between devices, so at this stage, it's unclear if the location will move predictably or remain fixed on a given deviceType. Stay tuned for updates as we refine this feature.

Let me know if you'd like more adjustments or details.

@fs-eire
Copy link
Contributor

fs-eire commented Sep 24, 2024

@egalli @bbernhar Thank you for the explaination. I think it is totally fine to consider "MLTensor" as a virtual (logical) location in the context of ort.Tensor.location. There 2 questions:

  • do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)
  • do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

@huningxin
Copy link

@fs-eire

do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)

I suppose this would be a common inference scenario and we should support it. Yulong, what's your guidance?

I understand @egalli also has a plan to improve the performance of the ort.Tensor(CPU) input/output by reusing MLTensor. Enrico, do you happen to create a issue tracking this optimization?

do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

I suppose we should support Tensor.download and Tensor.dispose. Could this be implemented in a follow-up CL?

@egalli
Copy link
Contributor Author

egalli commented Sep 24, 2024

@huningxin I have not created an issue, but I have a preliminary change set ready

do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)

The code currently uses the Allocator and DataTransfer classes in the C++ code. Is this a problem?

do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

Is this different than the currently implemented solution?

btw, keep in the mind that unlike ort.Tensor(CPU), MLTensor(CPU) is not easily accessible to JS/Wasm. In Chromium, MLTensor(CPU) is allocated in by the TFLife backend outside of the render process. Therefore, IPC/Mojo calls are required to read and write to it.

@fs-eire
Copy link
Contributor

fs-eire commented Sep 24, 2024

The basic idea of resource management in onnxruntime-web is:

CPU:

user should never worry about resource management for CPU tensor. ort.Tensor always uses TypedArray (non-string type) or string[] (string type) as underlying data and no resources need to be manually released.

Since no resource management is needed, there is also no lifecycle consideration for CPU tensors.

GPU (and other non-CPU location):

If an instance of non-CPU ort.Tensor is created by user ( via Tensor.fromGpuBuffer or Tensor.fromMLTensor), user need to manage the underlying resource. Specifically:

  • user need to make sure the underlying resource valid during the usage of the ort.Tensor instance.
  • user is responsible for managing the lifecycle of the underlying resource.

If an instance of non-CPU ort.Tensor is created by onnxruntime-web as output (This only happens when sessionOptions.preferredOutputLocation) is correctly set), this means the underlying resource is allocated by onnxruntime-web. In this case, the ort.Tensor instance should contain valid download and dispose properties.

  • user should release the underlying resource after use by calling ort.Tensor.dispose().

@fs-eire
Copy link
Contributor

fs-eire commented Sep 24, 2024

The code currently uses the Allocator and DataTransfer classes in the C++ code. Is this a problem?
no problem.

Is this different than the currently implemented solution?

it should be OK.

btw, keep in the mind that unlike ort.Tensor(CPU), MLTensor(CPU) is not easily accessible to JS/Wasm. In Chromium, MLTensor(CPU) is allocated in by the TFLife backend outside of the render process. Therefore, IPC/Mojo calls are required to read and write to it.

I think if user requires the "location" to be on CPU, it means to use ort.Tensor on CPU; if user want to use a MLTensor(CPU), they should specify location to "MLTensor".

@egalli
Copy link
Contributor Author

egalli commented Sep 24, 2024

I think if user requires the "location" to be on CPU, it means to use ort.Tensor on CPU; if user want to use a MLTensor(CPU), they should specify location to "MLTensor".

Considering that the WebNN WG wants to remove deviceType from the specification, that sounds reasonable to me.

@bbernhar
Copy link

@fs-eire

I don't expect we need to allow mapping ort.Tensor(CPU) using MLTensor(CPU) since it requires tensor data to stay behind a MLContext and not a TypedArray. It does make sense to me, we align non-CPU ort.Tensor resource management, even if it happens to be a CPU device, with MLTensor, by supporting "download" and "dispose" (+1 in a subsequent PR).

Does this answer your question/concern?

@fs-eire
Copy link
Contributor

fs-eire commented Sep 24, 2024

@fs-eire

I don't expect we need to allow mapping ort.Tensor(CPU) using MLTensor(CPU) since it requires tensor data to stay behind a MLContext and not a TypedArray. It does make sense to me, we align non-CPU ort.Tensor resource management, even if it happens to be a CPU device, with MLTensor, by supporting "download" and "dispose" (+1 in a subsequent PR).

Does this answer your question/concern?

Is using ort.Tensor(CPU) as input allowed? Or you want to enforce all places to use ort.Tensor(via ort.Tensor.fromMLTensor) with WebNN EP?

I would prefer that using ort.Tensor(CPU) is allowed and is the default behavior for input/output, because it is much easier and user friendly to use. Of course if I want to use MLTensor I can use it via the corresponding interface.

@egalli
Copy link
Contributor Author

egalli commented Sep 25, 2024

My understanding is that ort.Tensor(CPU) as both inputs and outputs to ort.InferenceSession.run is already supported. It is just using less efficient path of JS -(copy)-> wasm -> webnn::Allocate::Alloc -> webnn::DataTranfer::CopyTensor -(copy)-> MLTensor (2 copies). While this could be simplified to 1 copy, it would require knowing if the first node in the graph is going to run in the WebNN EP or the CPU EP (fallback). Therefore, I would rather tackle this in another PR.

Is there anything else blocking this PR?

@fs-eire
Copy link
Contributor

fs-eire commented Sep 25, 2024

My understanding is that ort.Tensor(CPU) as both inputs and outputs to ort.InferenceSession.run is already supported. It is just using less efficient path of JS -(copy)-> wasm -> webnn::Allocate::Alloc -> webnn::DataTranfer::CopyTensor -(copy)-> MLTensor (2 copies). While this could be simplified to 1 copy, it would require knowing if the first node in the graph is going to run in the WebNN EP or the CPU EP (fallback). Therefore, I would rather tackle this in another PR.

Is there anything else blocking this PR?

No. I think it's all good.

@fs-eire
Copy link
Contributor

fs-eire commented Sep 25, 2024

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-ortmodule-distributed,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@bbernhar
Copy link

bbernhar commented Sep 25, 2024

@fs-eire

Is using ort.Tensor(CPU) as input allowed? Or you want to enforce all places to use ort.Tensor(via ort.Tensor.fromMLTensor) with WebNN EP?

Yes, it can be allowed so long as ort.Tensor(CPU) doesn't assert the final device-location of MLTensor must also be on the CPU (copies are OK).

@bbernhar
Copy link

@fs-eire @fdwr ready to merge this PR? @egalli does not have write access.

@guschmue guschmue merged commit 52a8c1c into microsoft:main Sep 28, 2024
79 checks passed
@fdwr
Copy link
Contributor

fdwr commented Sep 28, 2024

Considering that the WebNN WG wants to remove deviceType from the specification, that sounds reasonable to me.

@egalli That idea was proposed, but there's not consensus on it. More likely it will end up being relaxed to be more of a hint than a requirement given CoreML's MLComputeUnits do not support saying only NPU or only GPU (in CoreML, you always implicitly get CPU too as a fallback). See, there remain ambiguous cases where a power preference alone is inadequate for device selection, such as devices where the GPU is actually slower than the NPU on the same device, or devices with both integrated and discrete GPUs where you use the deviceType to select that you want a GPU along with the power preference to select between them.

@egalli egalli deleted the create_mlbuffer branch October 22, 2024 01:28
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
### Description
Enables using the MLTensor to pass data between models. 


### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebNN WebNN execution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants