-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context-based graph execution methods for different threading models. #257
Conversation
index.bs
Outdated
|
||
[SecureContext, Exposed=(Window, DedicatedWorker)] | ||
interface MLCommandEncoder { | ||
undefined initializeGraph(MLGraph graph, MLNamedGPUInputs inputs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to name inputs
arguments passed to initializeGraph()
and dispatch()
differently to make it clearer their semantics differ. I think this would improve developer ergonomics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more descriptive text into describing these two methods in more detail. I dont think just changing the param name will make any material difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I'll encourage other reviewers to take a look at this PR and given their opinion on it relative to #255. I hope we can come to an agreement on this important technical topic soon. It has been a while in the making.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look at the latest commit. I've removed MLCommandEncoder
and let computeAsync
call handle both the CPU and GPU async use case per discussion with @RafaelCintron and @bbernhar. Detail please refer to this reply #255 (comment)
index.bs
Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately | ||
on the calling thread, which must also be a worker thread. The execution produces the results of the computation | ||
from all the inputs bound to the graph. This type of execution is limited only to when the computational device | ||
bound to the context is a CPU device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted in #255, I'm still unsure why this is limited to CPU execution.
That restriction means the only way to use that API is in a try … catch
structure with a fallback to the async version, since the developer has no way to know if the the MLContext
it obtained is bound to a CPU or not (afaik).
Given that one of the premises of having the sync-version is to make it usable when converting sync-only code-bases, I don't think this fulfills the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dontcallmedom The GPU execution is inherently async but not on the app's timeline. It's async on the GPU timeline. So having a sync call on a GPU device would mean stalling the CPU until the GPU queue has a chance to be executed. That is undesirable and none of our known use case will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this would happen on a DedicatedWorker, I don't think having the worker blocked from doing other CPU work is particularly an issue (the developer can choose to use the said worker only for things that need to wait for the ML processing to happen; or they should use async) - it's similar to what happens when using the FileReaderSync methods.
I see that you argue in https://github.com/webmachinelearning/webnn/pull/251/files#r832248425 that device selection should not be a preference or a hint, but be normative; if so, that would address the gist of my concern in terms of API shape, although I'm not sure that interpretation is inline with what we've been discussing so far (or at least with what I've understood of our intents).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TF.js use case on a GPU device is such that the execution call is done on the main thread and not the worker thread. However, the compute result is deferred until the GPU executes the result and the readback occurs. The reason the sync API is limited to only the worker thread is for a different purpose. It's because even for the CPU execution, the main thread should never block.
Again, this change assumes that the app knows what type of MLContext it creates. WebNN makes that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the app knows what type of MLContext it creates. WebNN makes that assumption.
To clarify my point - I'm not arguing here (as I did in #255) that the MLContext
may be coming from a third party library. My assertion is that as a developer, even if I set MLDdevicePreference
to CPU, the spec gives no guarantee that I'll get back a CPU-based MLContext
, and doesn't even expose if the MLContext
I got back is CPU-based or not. So as a developer, I don't know if I can or cannot execute the sync version of the API, except by actually executing it and getting an exception (which I don't think is a practical API to use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer this to @RafaelCintron who insisted that GPU sync call is bad for user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's bring this to a separate issue - I think I would keep
"cpu"
as a name and instead make it the WebIDL-default value ofMLContextOptions.devicePreference
; but my sense is that the primary question around such a change is more about the impact of makingdevicePreference
no longer a hint but a setting.
In the latest commit, I've removed the default
option and rename MLDevicePreference
to MLDeviceType
to make it more clear that it is not merely a hint. @dontcallmedom Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree sync call is not bound to cpu device, I would suggest to keep the default
option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView
as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way you explain it here is not really "default" but rather "auto" option i.e. automatic device selection. This is a hard subject with potential backward and forward compatibility guarantee involved. Also the definition of "best" varies greatly depending on what the application is trying to do. Generally speaking, this type of policy behavior should not belong to the backend component like the WebNN API. Rather, it should be part of the application logic or in the framework on behalf of the application.
"default" and "auto" aren't really the same thing. Default is a well-defined behavior that is chosen to be a default behavior when nothing more explicit is specified, while the behavior of auto
is dynamic implying the notion of "best fit", a more open-ended policy.
I prefer that we discuss automatic device selection as a separate issue.
index.bs
Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately | ||
on the calling thread, which must also be a worker thread. The execution produces the results of the computation | ||
from all the inputs bound to the graph. This type of execution is limited only to when the computational device | ||
bound to the context is a CPU device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.
c0f66c5
to
26d5bc1
Compare
index.bs
Outdated
MLPowerPreference powerPreference = "default"; | ||
GPUDevice gpuDevice = null; | ||
WebGLRenderingContext glContext = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, developer may set conflicting options, e.g., {deviceType: 'cpu', gpuDevice: device}
. The existing design won't let developers to make such conflicting case. Actually these options exclusively used to create ML context of three types: default, WebGPU-based, WebGL-based. I suppose the existing design reflects the usages more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When deviceType
is set to anything but "gpu", the gpuDevice
even if given is ignored. I don't think it is uncommon to have cross-member dependencies considering that every field in the MLContextOptions
is already an optional field. By doing it this way, we avoid having multiple overload createContext
methods on the ML
interface.
index.bs
Outdated
{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately | ||
on the calling thread, which must also be a worker thread. The execution produces the results of the computation | ||
from all the inputs bound to the graph. This type of execution is limited only to when the computational device | ||
bound to the context is a CPU device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree sync call is not bound to cpu device, I would suggest to keep the default
option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView
as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.
index.bs
Outdated
<tr><td>cpu<td>Yes<td>No<td>No<td>No<td>No | ||
<tr><td>gpu (GPUDevice == null)<td>Yes<td>No<td>No<td>No<td>No | ||
<tr><td>gpu (GPUDevice != null)<td>Yes<td>Yes<td>Yes<td>No<td>No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the MLContext
is created from a GPUDevice
, I would suggest it only accepts the GPUBuffer
and GPUTexture
. This design would help simplify the spec / implementation of WebNN API and leave the ArrayBufferView
uploading / readback to WebGPU API where it is already well defined.
Summarize my points, there should be three types of
I have an open for the execution method for WebGL interop usage: should it be sync API and be available in both main thread and worker given GPU execution is async by natural? But how would it happen and be implemented? It seems to me there is lack of investigation on WebGL interop. So should we remove the WebGL interop support and leave it for following PR? |
@huningxin The On a separate issue, a design that allows synchronous GPU execution should be avoided because the CPU could be left needlessly stalled for a long period of time. Similarly, allowing synchronous CPU execution on the main thread should also be avoided as it could disrupt the user experience. The synchronous |
@wchao1115 , my point is the spec should allow developers to specify their intent that "select the best device automatically". For that use case, the "auto" proposal from @yuhonglin is probably a good fit.
I don't think this is an issue, because the sync As my understanding, the sync |
Can we keep the 'auto' proposal as a separate issue for now? This is a very hard topic that to my knowledge none of the leading cross-vendor ML platform has fully solved. The only known case when this is supported today is CoreML and that only works on Apple M1 hardware platform. As for supporting sync call on the GPU device on the worker thread, I would like to understand a bit more of the current use case. Since the GPU execution is async by nature (but on the GPU timeline), it seems odd that a worker thread would want to block for it. |
As I mentioned, the sync |
Thanks @huningxin. I think it makes sense. Can we assume that the TFLite delegate use case only happen in the worker thread? |
I suppose so. People experienced unresponsiveness issue by inferring large model with TF.js Wasm backend (same applies to TFLite Wasm backend) in main thread and request to move all heavy processing to worker thread. /cc @pyu10055 Once this PR landed, we can update the design of TFLite WebNN delegate to reflect this restriction. WDYT? |
Ok. Sounds good. I'll make that change. |
…text only supports CPU inputs and outputs (automatic upload/download). Reintroduce MLCommandEncoder for WebGPU interop.
Commit ef9262b contains the following changes. Hopefully this is the last for this PR.
This change is unlikely to make everyone happy, that solution probably doesn't exist. But I hope it's a reasonable enough compromise where everyone gets something they want. The net result of completing this change at this stage of the draft spec is certainly going to be far better than nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much @wchao1115 for this update.
Allow GPU sync execution (limited to worker thread calling thread).
+1
Both sync and async execution only accepts ArrayBufferView
+1
Device type must be explicit, default to CPU device.
+1
Reintroduce MLCommandEncoder.
+1 with an open on constants initialization. Please take a look.
… clarify graph initialization stage and remove the unnecessary second param.
…efault GPU context.
I like the use of Since WebNN has no actual dependency on WebGPU's execution model, such a justification to WebGPU WG will be very hard and unlikely anyway and WebNN Interop would allow progress... |
I suppose not. I think the use case, such as video background blur on GPU (#249) , would drive the requirement of WebGPU interop. As the investigation of that use case, the apps can do this with WebGPU-only pipeline. They may want to plug WebNN into that pipeline for better ML compute performance. The
WebNN interop is a good perspective. As we discussed in the last WG call, @wchao1115 already did a great job in this PR that gives a clear separation between WebNN standalone usage and WebGPU interop. I believe it would allow more efficient iterations for each usage respectively. So I would suggest to merge this PR first and file a separate issue for WebNN interop proposal. WDYT? |
Interop can go both ways. Plugging WebGPU command buffers into WebNN achieves the exact same result; except, it doesn't break WebGPU and you can progress immediately. If not, I suggest we define it as |
Thanks @bbernhar for your feedback. I am interested in knowing more. Could you please file a separate issue? So we can focus on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks again for @wchao1115 's great work.
…#257) SHA: 4c5b70b Reason: push, by @wchao1115 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@huningxin Done. |
Alternative design to #255 for various graph execution methods, with both immediate and async execution methods folded into the context interface, making it easier for developer to associate execution method to the context created with different options. The queued execution method, however, is left to a separate
MLCommandEncoder
interface as it requires multi-step calling pattern that ends with thefinish
method, consistent with the WebGPU GPU command encoder calling pattern.Preview | Diff