Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context-based graph execution methods for different threading models. #257

Merged
merged 13 commits into from
Apr 30, 2022

Conversation

wchao1115
Copy link
Collaborator

@wchao1115 wchao1115 commented Mar 10, 2022

Alternative design to #255 for various graph execution methods, with both immediate and async execution methods folded into the context interface, making it easier for developer to associate execution method to the context created with different options. The queued execution method, however, is left to a separate MLCommandEncoder interface as it requires multi-step calling pattern that ends with the finish method, consistent with the WebGPU GPU command encoder calling pattern.


Preview | Diff

index.bs Outdated

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLCommandEncoder {
undefined initializeGraph(MLGraph graph, MLNamedGPUInputs inputs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to name inputs arguments passed to initializeGraph() and dispatch() differently to make it clearer their semantics differ. I think this would improve developer ergonomics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more descriptive text into describing these two methods in more detail. I dont think just changing the param name will make any material difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'll encourage other reviewers to take a look at this PR and given their opinion on it relative to #255. I hope we can come to an agreement on this important technical topic soon. It has been a while in the making.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at the latest commit. I've removed MLCommandEncoder and let computeAsync call handle both the CPU and GPU async use case per discussion with @RafaelCintron and @bbernhar. Detail please refer to this reply #255 (comment)

index.bs Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately
on the calling thread, which must also be a worker thread. The execution produces the results of the computation
from all the inputs bound to the graph. This type of execution is limited only to when the computational device
bound to the context is a CPU device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted in #255, I'm still unsure why this is limited to CPU execution.

That restriction means the only way to use that API is in a try … catch structure with a fallback to the async version, since the developer has no way to know if the the MLContext it obtained is bound to a CPU or not (afaik).

Given that one of the premises of having the sync-version is to make it usable when converting sync-only code-bases, I don't think this fulfills the requirements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dontcallmedom The GPU execution is inherently async but not on the app's timeline. It's async on the GPU timeline. So having a sync call on a GPU device would mean stalling the CPU until the GPU queue has a chance to be executed. That is undesirable and none of our known use case will do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this would happen on a DedicatedWorker, I don't think having the worker blocked from doing other CPU work is particularly an issue (the developer can choose to use the said worker only for things that need to wait for the ML processing to happen; or they should use async) - it's similar to what happens when using the FileReaderSync methods.

I see that you argue in https://github.com/webmachinelearning/webnn/pull/251/files#r832248425 that device selection should not be a preference or a hint, but be normative; if so, that would address the gist of my concern in terms of API shape, although I'm not sure that interpretation is inline with what we've been discussing so far (or at least with what I've understood of our intents).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TF.js use case on a GPU device is such that the execution call is done on the main thread and not the worker thread. However, the compute result is deferred until the GPU executes the result and the readback occurs. The reason the sync API is limited to only the worker thread is for a different purpose. It's because even for the CPU execution, the main thread should never block.

Again, this change assumes that the app knows what type of MLContext it creates. WebNN makes that assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the app knows what type of MLContext it creates. WebNN makes that assumption.

To clarify my point - I'm not arguing here (as I did in #255) that the MLContext may be coming from a third party library. My assertion is that as a developer, even if I set MLDdevicePreference to CPU, the spec gives no guarantee that I'll get back a CPU-based MLContext, and doesn't even expose if the MLContext I got back is CPU-based or not. So as a developer, I don't know if I can or cannot execute the sync version of the API, except by actually executing it and getting an exception (which I don't think is a practical API to use).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer this to @RafaelCintron who insisted that GPU sync call is bad for user experience.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's bring this to a separate issue - I think I would keep "cpu" as a name and instead make it the WebIDL-default value of MLContextOptions.devicePreference; but my sense is that the primary question around such a change is more about the impact of making devicePreference no longer a hint but a setting.

In the latest commit, I've removed the default option and rename MLDevicePreference to MLDeviceType to make it more clear that it is not merely a hint. @dontcallmedom Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree sync call is not bound to cpu device, I would suggest to keep the default option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you explain it here is not really "default" but rather "auto" option i.e. automatic device selection. This is a hard subject with potential backward and forward compatibility guarantee involved. Also the definition of "best" varies greatly depending on what the application is trying to do. Generally speaking, this type of policy behavior should not belong to the backend component like the WebNN API. Rather, it should be part of the application logic or in the framework on behalf of the application.

"default" and "auto" aren't really the same thing. Default is a well-defined behavior that is chosen to be a default behavior when nothing more explicit is specified, while the behavior of auto is dynamic implying the notion of "best fit", a more open-ended policy.

I prefer that we discuss automatic device selection as a separate issue.

index.bs Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately
on the calling thread, which must also be a worker thread. The execution produces the results of the computation
from all the inputs bound to the graph. This type of execution is limited only to when the computational device
bound to the context is a CPU device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.

explainer.md Outdated Show resolved Hide resolved
explainer.md Outdated Show resolved Hide resolved
@wchao1115 wchao1115 force-pushed the graph_execution_context branch from c0f66c5 to 26d5bc1 Compare March 28, 2022 00:18
index.bs Outdated
MLPowerPreference powerPreference = "default";
GPUDevice gpuDevice = null;
WebGLRenderingContext glContext = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, developer may set conflicting options, e.g., {deviceType: 'cpu', gpuDevice: device}. The existing design won't let developers to make such conflicting case. Actually these options exclusively used to create ML context of three types: default, WebGPU-based, WebGL-based. I suppose the existing design reflects the usages more clearly.

Copy link
Collaborator Author

@wchao1115 wchao1115 Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deviceType is set to anything but "gpu", the gpuDevice even if given is ignored. I don't think it is uncommon to have cross-member dependencies considering that every field in the MLContextOptions is already an optional field. By doing it this way, we avoid having multiple overload createContext methods on the ML interface.

index.bs Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately
on the calling thread, which must also be a worker thread. The execution produces the results of the computation
from all the inputs bound to the graph. This type of execution is limited only to when the computational device
bound to the context is a CPU device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree sync call is not bound to cpu device, I would suggest to keep the default option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.

index.bs Outdated
<tr><td>cpu<td>Yes<td>No<td>No<td>No<td>No
<tr><td>gpu (GPUDevice == null)<td>Yes<td>No<td>No<td>No<td>No
<tr><td>gpu (GPUDevice != null)<td>Yes<td>Yes<td>Yes<td>No<td>No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MLContext is created from a GPUDevice, I would suggest it only accepts the GPUBuffer and GPUTexture. This design would help simplify the spec / implementation of WebNN API and leave the ArrayBufferView uploading / readback to WebGPU API where it is already well defined.

@huningxin
Copy link
Contributor

huningxin commented Mar 28, 2022

Summarize my points, there should be three types of MLContext that support three usages respectively:

Usage Creation Method Device Type Execution Method Buffer Type
Default Create from an MLContextOptions CPU or GPU,
device selected by device and power preferences
async computeAsync (both main thread and worker)
sync compute (worker only)
ArrayBufferView
WebGPU interop Create from
a GPUDevice
GPU,
device selected by WebGPU API
submit MLGraph as a GPUCommandBuffer to GPUQueue GPUBuffer and GPUTexture
WebGL interop Create from a WebGLRenderingContext GPU,
device selected by WebGL API
sync compute (both main thread and worker) ? WebGLBuffer and WebGLTexture

I have an open for the execution method for WebGL interop usage: should it be sync API and be available in both main thread and worker given GPU execution is async by natural? But how would it happen and be implemented? It seems to me there is lack of investigation on WebGL interop. So should we remove the WebGL interop support and leave it for following PR?

@wchao1115
Copy link
Collaborator Author

wchao1115 commented Mar 28, 2022

@huningxin The default device option is problematic for our use case because the caller doesn't know what type of device the user agent has selected. That type of design based on behavior hints is fine as long as the user agent retains full control of all the related behaviors when it decides to select a certain type of device. But in our case, the caller must choose the execution method that is best for their situation. This means that the caller must be conscious of their choice of the device type to use and therefore cannot defer that decision to the user agent. @dontcallmedom expressed this concern in his review, which I agreed.

On a separate issue, a design that allows synchronous GPU execution should be avoided because the CPU could be left needlessly stalled for a long period of time. Similarly, allowing synchronous CPU execution on the main thread should also be avoided as it could disrupt the user experience. The synchronous compute method is carefully designed to only allow CPU execution on a worker thread.

@huningxin
Copy link
Contributor

@wchao1115 , my point is the spec should allow developers to specify their intent that "select the best device automatically". For that use case, the "auto" proposal from @yuhonglin is probably a good fit.

On a separate issue, a design that allows synchronous GPU execution should be avoided because the CPU could be left needlessly stalled for a long period of time.

I don't think this is an issue, because the sync compute method is restricted to be used only in the worker thread. The stall of worker thread won't impact the responsiveness of UI which is running in the main thread. If developers don't want CPU to wait for GPU execution, they are free to use the async compute method.

As my understanding, the sync compute method is required to support a framework code base that expects sync API call in a backend, e.g., TFLite WebNN delegate. The spec shouldn't restrict this type of frameworks only to CPU execution.

@wchao1115
Copy link
Collaborator Author

Can we keep the 'auto' proposal as a separate issue for now? This is a very hard topic that to my knowledge none of the leading cross-vendor ML platform has fully solved. The only known case when this is supported today is CoreML and that only works on Apple M1 hardware platform.

As for supporting sync call on the GPU device on the worker thread, I would like to understand a bit more of the current use case. Since the GPU execution is async by nature (but on the GPU timeline), it seems odd that a worker thread would want to block for it.

@huningxin
Copy link
Contributor

huningxin commented Mar 29, 2022

As for supporting sync call on the GPU device on the worker thread, I would like to understand a bit more of the current use case.

As I mentioned, the sync compute call is required by a framework backend that infers sub-graph synchronously. For example, TFLite WebNN delegate needs to implement the sync SimpleDelegateKernelInterface::Eval method for sub-graph inference. Besides TFLite, as I reported in my presentation for TPAC, the prototype of ONNX Runtime WebNN EP and OpenCV.js WebNN backend prototype use the sync compute method. The spec should allow these frameworks to use GPU in the WebNN backend implementation.

index.bs Outdated Show resolved Hide resolved
@wchao1115
Copy link
Collaborator Author

As I mentioned, the sync compute call is required by a framework backend that infers sub-graph synchronously. For example, TFLite WebNN delegate needs to implement the sync SimpleDelegateKernelInterface::Eval

Thanks @huningxin. I think it makes sense. Can we assume that the TFLite delegate use case only happen in the worker thread?

@huningxin
Copy link
Contributor

@wchao1115

Can we assume that the TFLite delegate use case only happen in the worker thread?

I suppose so. People experienced unresponsiveness issue by inferring large model with TF.js Wasm backend (same applies to TFLite Wasm backend) in main thread and request to move all heavy processing to worker thread. /cc @pyu10055

Once this PR landed, we can update the design of TFLite WebNN delegate to reflect this restriction. WDYT?

@wchao1115
Copy link
Collaborator Author

Ok. Sounds good. I'll make that change.

…text only supports CPU inputs and outputs (automatic upload/download). Reintroduce MLCommandEncoder for WebGPU interop.
@wchao1115
Copy link
Collaborator Author

wchao1115 commented Apr 18, 2022

Commit ef9262b contains the following changes. Hopefully this is the last for this PR.

  1. Allow GPU sync execution (limited to worker thread calling thread). I spoke with @RafaelCintron and he is signed off on that.
  2. Both sync and async execution only accepts ArrayBufferView -- meaning default context (those created with MLContextOptions) always upload/download inputs/outputs if needed by the GPU device.
  3. Device type must be explicit, default to CPU device. We assume that the caller of WebNN always deal with specific device type, as defering that choice to the user agent will only make resource management more complicate and less efficient.
  4. Reintroduce MLCommandEncoder. After much back and forth, I think a separate context type that explicitly deals with WebGPU interop is the most practical and clean approach. An implementation that can't support this requirement should fail context creation. Also, after much deliberation, I prefer to stick with the original proposal of MLCommandEncoder that produces WebGPU-compatible GPUCommandBuffer as opposed to a WebNN-specific command buffer. We need to have a follow-up with the WebGPU CG on how to implement WebNN in a way that produces a compatible GPUCommandBuffer.

This change is unlikely to make everyone happy, that solution probably doesn't exist. But I hope it's a reasonable enough compromise where everyone gets something they want. The net result of completing this change at this stage of the draft spec is certainly going to be far better than nothing.

Copy link
Contributor

@huningxin huningxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much @wchao1115 for this update.

Allow GPU sync execution (limited to worker thread calling thread).

+1

Both sync and async execution only accepts ArrayBufferView

+1

Device type must be explicit, default to CPU device.

+1

Reintroduce MLCommandEncoder.

+1 with an open on constants initialization. Please take a look.

index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
… clarify graph initialization stage and remove the unnecessary second param.
index.bs Outdated Show resolved Hide resolved
@bbernhar
Copy link

@wchao1115

I like the use of MLCommandBuffer but I suggest we offer it for WebNN Interop. WebGPU interop wasn't meant to be a means for gate keeping GPU access on the web. Changing of the WebGPU execution model, by allowing ML work to execute under WebGPU, and requiring a new means of WebGPU interop, by sharing command-buffers with WebNN are just what-ifs and far from practical - does WebGPU do this for every API ?!?

Since WebNN has no actual dependency on WebGPU's execution model, such a justification to WebGPU WG will be very hard and unlikely anyway and WebNN Interop would allow progress...

@huningxin
Copy link
Contributor

huningxin commented Apr 27, 2022

@bbernhar

does WebGPU do this for every API ?!?

I suppose not. I think the use case, such as video background blur on GPU (#249) , would drive the requirement of WebGPU interop. As the investigation of that use case, the apps can do this with WebGPU-only pipeline. They may want to plug WebNN into that pipeline for better ML compute performance. The MLCommandEncoder proposal allows this usage.

I like the use of MLCommandBuffer but I suggest we offer it for WebNN Interop.

WebNN interop is a good perspective. As we discussed in the last WG call, @wchao1115 already did a great job in this PR that gives a clear separation between WebNN standalone usage and WebGPU interop. I believe it would allow more efficient iterations for each usage respectively. So I would suggest to merge this PR first and file a separate issue for WebNN interop proposal.

WDYT?

@bbernhar
Copy link

@huningxin

Interop can go both ways. Plugging WebGPU command buffers into WebNN achieves the exact same result; except, it doesn't break WebGPU and you can progress immediately. If not, I suggest we define it as MLExternalCommandBuffer (read-only) to denote such usage is one-way.

@huningxin
Copy link
Contributor

Thanks @bbernhar for your feedback. I am interested in knowing more. Could you please file a separate issue? So we can focus on it.

Copy link
Contributor

@huningxin huningxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks again for @wchao1115 's great work.

@wchao1115 wchao1115 merged commit 4c5b70b into main Apr 30, 2022
@wchao1115 wchao1115 deleted the graph_execution_context branch April 30, 2022 18:03
github-actions bot added a commit that referenced this pull request Apr 30, 2022
…#257)

SHA: 4c5b70b
Reason: push, by @wchao1115

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@bbernhar
Copy link

bbernhar commented May 2, 2022

@huningxin Done.

#264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants