-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MLBuffer exploration doc #541
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,395 @@ | ||||||||||||||||||
# `MLBuffer` Exploration | ||||||||||||||||||
|
||||||||||||||||||
By @a-sully | ||||||||||||||||||
|
||||||||||||||||||
## What is this? | ||||||||||||||||||
|
||||||||||||||||||
This is an exploration - primarily via code samples of key use cases - of what | ||||||||||||||||||
ML compute might look like using a device-agnostic buffer, as proposed in | ||||||||||||||||||
[#482](https://github.com/webmachinelearning/webnn/issues/482) as `MLBuffer`. | ||||||||||||||||||
|
||||||||||||||||||
This is not intended to be a formal explainer, though it could become one if | ||||||||||||||||||
that would be useful. My intention here is to describe our priorities (such that | ||||||||||||||||||
we can ensure the design satisfies these priorities), bring attention to some | ||||||||||||||||||
open questions and related issues, toss around some ideas, and encourage | ||||||||||||||||||
discussion about how this proposal will be specified. | ||||||||||||||||||
|
||||||||||||||||||
## Goals | ||||||||||||||||||
|
||||||||||||||||||
- Minimize round-trips to JavaScript/CPU needed for synchronization of work on | ||||||||||||||||||
buffers which may not live on the CPU | ||||||||||||||||||
- Minimize buffer copies | ||||||||||||||||||
- In particular, we should support zero-copy buffer sharing between WebNN and | ||||||||||||||||||
WebGPU if this is supported by the underlying hardware | ||||||||||||||||||
- Support the XPU (i.e. CPU, GPU, NPU, TPU, etc...) with one consistent API | ||||||||||||||||||
- Follow recomended [design | ||||||||||||||||||
principles](https://w3ctag.github.io/design-principles/) | ||||||||||||||||||
- In my opinion, this likely entails [mirroring WebGPU's design | ||||||||||||||||||
decisions](https://w3ctag.github.io/design-principles/#naming-consultation), | ||||||||||||||||||
where appropriate | ||||||||||||||||||
|
||||||||||||||||||
## Overarching Questions | ||||||||||||||||||
|
||||||||||||||||||
Many of these questions are not _specific_ to `MLBuffer`, but are important | ||||||||||||||||||
enough that their answers will strongly influence the shape of the `MLBuffer` | ||||||||||||||||||
proposal. | ||||||||||||||||||
|
||||||||||||||||||
- What are WebNN's timelines and how do they interact with WebGPU's timelines? | ||||||||||||||||||
See [#529](https://github.com/webmachinelearning/webnn/issues/529) | ||||||||||||||||||
- Where will an `MLBuffer`'s memory be allocated on systems where an `MLContext` | ||||||||||||||||||
may not be as closely tied to a given physical device as an | ||||||||||||||||||
[`IDMLDevice`](https://learn.microsoft.com/en-us/windows/win32/api/directml/nn-directml-idmldevice)? | ||||||||||||||||||
See [#350](https://github.com/webmachinelearning/webnn/issues/350) | ||||||||||||||||||
- How will errors be surfaced? See | ||||||||||||||||||
[#477](https://github.com/webmachinelearning/webnn/issues/477). Do we need a | ||||||||||||||||||
concept similar to [WebGPU's error | ||||||||||||||||||
scopes](https://www.w3.org/TR/webgpu/#error-scopes)? | ||||||||||||||||||
- Must an `MLBuffer` only be used with an `MLContext` it was created from? | ||||||||||||||||||
(or `MLGraph`s created from that `MLContext`, and so forth) | ||||||||||||||||||
- If what we're building is a device-agnostic buffer, it will surely be used for | ||||||||||||||||||
things other than ML (in the long run). In the spirit of | ||||||||||||||||||
[future-proofing](https://w3ctag.github.io/design-principles/#naming-future-proofing), | ||||||||||||||||||
should we name it something other than `MLBuffer`? | ||||||||||||||||||
|
||||||||||||||||||
## Use Case: Chained Inference | ||||||||||||||||||
|
||||||||||||||||||
Here's a code sample showing how `MLBuffer`s can be used for chained inference | ||||||||||||||||||
and then read back to an `ArrayBuffer`: | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
// Create new MLBuffers to be used for chained inference. | ||||||||||||||||||
const inputMlBuffer = mlContext.createBuffer({inputSize}); | ||||||||||||||||||
const intermediateMlBuffer = mlContext.createBuffer({intermediateSize}); | ||||||||||||||||||
const outputMlBuffer = mlContext.createBuffer({outputSize}); | ||||||||||||||||||
|
||||||||||||||||||
// Copy the contents of an ArrayBuffer into an MLBuffer, to be later used as inputs. | ||||||||||||||||||
mlContext.writeBuffer( | ||||||||||||||||||
inputMlBuffer, | ||||||||||||||||||
/*dstOffset=*/0, | ||||||||||||||||||
/*srcData=*/someJsArrayBuffer, | ||||||||||||||||||
); | ||||||||||||||||||
|
||||||||||||||||||
// Perform some ✧*✧* machine learning *✧*✧ described by `graph`. | ||||||||||||||||||
mlContext.dispatch( | ||||||||||||||||||
graph, | ||||||||||||||||||
/*inputs=*/{buffer: inputMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: intermediateMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
|
||||||||||||||||||
// Feed the output of one execution as the input to the next. Chained inference! | ||||||||||||||||||
mlContext.dispatch( | ||||||||||||||||||
anotherGraph, | ||||||||||||||||||
/*inputs=*/{buffer: intermediateMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: outputMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
|
||||||||||||||||||
// Read back the results to script. | ||||||||||||||||||
const resultBuffer = await outputMlBuffer.mapAsync(); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Let's dive into what happens at each of these steps: | ||||||||||||||||||
|
||||||||||||||||||
### `MLBuffer` creation | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
const inputMlBuffer = mlContext.createBuffer({inputSize}); | ||||||||||||||||||
``` | ||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- Enqueue a request on some WebNN timeline to allocate memory on the device | ||||||||||||||||||
associated with `mlContext` | ||||||||||||||||||
- The memory allocation will be zeroed (as it is for [WebGPU's `createBuffer()` | ||||||||||||||||||
method](https://www.w3.org/TR/webgpu/#dom-gpudevice-createbuffer)) | ||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- Can an `MLBuffer`'s size always be known at the time of buffer allocation? | ||||||||||||||||||
- In this case and many other cases it seems possible; it's presumably a | ||||||||||||||||||
function of the model and/or video input. But since WebNN always rents a | ||||||||||||||||||
buffer to WebGPU - never the other way around - this introduces a constraint | ||||||||||||||||||
that the size of an `MLBuffer` must always be known at the time of buffer | ||||||||||||||||||
allocation | ||||||||||||||||||
- When will `inputMlBuffer` be deallocated if `destroy()` is not called? | ||||||||||||||||||
|
||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we consider create MLBuffer based on its usage? For the chained inference use case, a buffer usage could be one of the following three:
Different backends/devices may arrange/optimize the buffer allocation depending on its usage. For example, the current Chromium WebNN XNNPACK and DirectML backends arrange the input and output buffers as following:
For default buffer:
The following table captures the buffer allocation difference according to usage:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the main question is whether we can keep these details implementation specific, and use MLBuffer to identify the buffers we pass between stages, eventually with dynamic labels such as input/output/intermediate? (in this case we will have to specify behavior in the spec in more details). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
"usage" can mean a lot of different things (readable/writable? input/output/intermediate? usable by WebGPU? mappable in some way?) but in broad strokes, yes I agree :P We should strive to ensure the user agent has enough information when an
IMHO it's critical that these details are opaque to the website. If creating an "input"
Similarly:
Could you elaborate more on this? :) |
||||||||||||||||||
### Writing to an `MLBuffer` | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
mlContext.writeBuffer( | ||||||||||||||||||
inputMlBuffer, | ||||||||||||||||||
/*dstOffset=*/0, | ||||||||||||||||||
/*srcData=*/someJsArrayBuffer, | ||||||||||||||||||
); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- Enqueue a request on some WebNN timeline to copy the contents of | ||||||||||||||||||
`someJsArrayBuffer` to `inputMlBuffer`. This is very similar to [the | ||||||||||||||||||
corresponding WebGPU | ||||||||||||||||||
method](https://www.w3.org/TR/webgpu/#dom-gpuqueue-writebuffer), though the | ||||||||||||||||||
implementation details will vary depending on which device `inputMlBuffer` is | ||||||||||||||||||
allocated on. For example, if allocated on: | ||||||||||||||||||
- a CPU, the buffer contents will be copied directly (i.e. `memcpy()`) | ||||||||||||||||||
- a GPU, the behavior will likely match `GPUQueue.writeBuffer()`. On UMA | ||||||||||||||||||
systems, a `memcpy()` might suffice. Other implementations may use a hidden | ||||||||||||||||||
"upload" buffer to get the data onto the GPU. This implies two copies:\ | ||||||||||||||||||
*  `ArrayBuffer` → "upload" buffer → high-GPU-bandwidth | ||||||||||||||||||
buffer* | ||||||||||||||||||
- an XPU... it depends! | ||||||||||||||||||
- `someJsArrayBuffer` is unaffected, since the bytes are copied | ||||||||||||||||||
- Note that the aforementioned copies are _in addition_ to any copies needed | ||||||||||||||||||
to get the data into the `ArrayBuffer` in the first place. If the data is | ||||||||||||||||||
weights being read from a `File`, for example, this will require first | ||||||||||||||||||
copying the bytes from the `File` into the `ArrayBuffer`. This means | ||||||||||||||||||
**copying the weights into GPU-accessible memory could take as many as four | ||||||||||||||||||
copies!** | ||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- Should there be a corresponding | ||||||||||||||||||
[`mappedAtCreation`](https://www.w3.org/TR/webgpu/#dom-gpubufferdescriptor-mappedatcreation) | ||||||||||||||||||
capability? | ||||||||||||||||||
- If the data is not already in an `ArrayBuffer`, this eliminates the data | ||||||||||||||||||
copy into an `ArrayBuffer` altogether, since we could write to the "upload" | ||||||||||||||||||
buffer directly: | ||||||||||||||||||
```js | ||||||||||||||||||
const mlBuffer = mlContext.createBuffer({size, mappedAtCreation: true}); | ||||||||||||||||||
|
||||||||||||||||||
const floatArray = new Float32Array(mlBuffer.getMappedRange()), | ||||||||||||||||||
|
||||||||||||||||||
// Write to `floatArray` | ||||||||||||||||||
// ... | ||||||||||||||||||
|
||||||||||||||||||
// Write the buffer contents to the XPU | ||||||||||||||||||
mlBuffer.unmap(); | ||||||||||||||||||
``` | ||||||||||||||||||
  Before: *some source → `ArrayBuffer` → "upload" buffer | ||||||||||||||||||
→ high-GPU-bandwidth buffer*\ | ||||||||||||||||||
  After: *some source → "upload" buffer | ||||||||||||||||||
→ high-GPU-bandwidth buffer* | ||||||||||||||||||
- Should there be the equivalent of | ||||||||||||||||||
[`MAP_WRITE`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_write) + | ||||||||||||||||||
[`COPY_SRC`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_src) for | ||||||||||||||||||
`MLBuffer`s? | ||||||||||||||||||
- If we know the usage of `inputMlBuffer` (e.g. that it's read-only by WebNN) | ||||||||||||||||||
then we may be able to eliminate the data copy from the "upload" buffer to | ||||||||||||||||||
the high-GPU-bandwidth buffer in the non-UMA case: | ||||||||||||||||||
```js | ||||||||||||||||||
const mlBuffer = mlContext.createBuffer({size, usage: MAP_WRITE | INPUT_ONLY}); | ||||||||||||||||||
``` | ||||||||||||||||||
This may not make a difference for DirectML, which appears to require [bound | ||||||||||||||||||
resources](https://learn.microsoft.com/en-us/windows/win32/api/directml/nf-directml-idmlbindingtable-bindpersistentresource) | ||||||||||||||||||
to use `D3D12_HEAP_TYPE_DEFAULT`, but it could eliminate a copy on other | ||||||||||||||||||
systems. I'm not familiar enough with other systems to know the answer here! | ||||||||||||||||||
- Combining this with the above techniques brings (as many as) 4 copies down | ||||||||||||||||||
to (as few as) 2:\ | ||||||||||||||||||
  Before: *some source → `ArrayBuffer` → "upload" buffer | ||||||||||||||||||
→ high-GPU-bandwidth buffer*\ | ||||||||||||||||||
  After: *some source → "upload" buffer* | ||||||||||||||||||
|
||||||||||||||||||
### Execute an `MLGraph` | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
mlContext.dispatch( | ||||||||||||||||||
graph, | ||||||||||||||||||
/*inputs=*/{buffer: inputMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: intermediateMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- Enqueues a request to compute the graph onto some WebNN timeline | ||||||||||||||||||
- Execution cannot start until all input and output `MLBuffer`s are available | ||||||||||||||||||
- All input and output `MLBuffer`s are unavailable while execution is in | ||||||||||||||||||
progress | ||||||||||||||||||
- All work submitted after this `dispatch()` call which relies on an input or | ||||||||||||||||||
output `MLBuffer` will be queued behind this execution | ||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- This approach is flexible enough to allow for graph execution on all backends. | ||||||||||||||||||
Do we need a separate `compute()` method? | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They might be different if the graph has multiple outputs. For example, the current Chromium WebNN DirectML backend implementation allocates one default buffer and one readback buffer big enough for all outputs. When For There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This discussion relates to Question 4 from #544:
At the core of this discussion seems to be the question of whether My earlier interpretation was that Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :) To clarify some things on my end... I know that WebGPU has a very specific definition of "buffer mapping". I've been using the term much more loosely here (apologies if that's caused any confusion); in broad strokes to mean "transferring ownership", which may or may not require copies under the hood. For example:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @a-sully, I agree we'll need more than what I proposed to cover all the use-cases here.
Since we require CPU backed It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though. // GPU staging buffers
ml_context = ML.createContext({"gpu"});
input_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_WRITE});
output_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_READ});
// output_buf = ml_context.createBuffer({size: 4}); // pre-allocate on GPU only
// Populate MLBuffer input with source data
await input_buf.mapAsync(MLMapMode.WRITE);
const write_data = input_buf.getMappedRange();
/* use write_data */
input_buf.unmap();
// If dGPU,
// * input requires GPU copy due to MLBufferUsage.MAP_WRITE
// * output requires GPU copy due to MLBufferUsage.MAP_READ.
// If iGPU,
// * neither input or output requires GPU copy.
ml_context.dispatch({inputs: input_buf, outputs: output_buf});
// MLBuffer output was populated by dispatch()
await output_buf.mapAsync(MLMapMode.READ);
const read_data = output_buf.getMappedRange();
/* use read_data */
output_buf.unmap(); The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 I think this answers question 5 from #542:
Now the question becomes what set of usage flags are needed?
+1. And this (familiar to developers!) WebGPU-like interface is flexible enough that we should be able to provide this same interface even if the buffer is not allocated on the GPU
Seems reasonable. Hmmm... some ideas which come to mind:
Since the relationship is asymmetric (in that WebNN always owns the buffer), I think the naming should be, too. Some ideas for returning the buffer to WebNN include:
Feel free to suggest other terms! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the sample code! At a high level, I don't think we can:
I'll give an example:
WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping? Meanwhile, there's a clear use case for allowing dual-mapping if the MLBuffer is backed by the CPU or on an UMA system where buffer copies could be avoided altogether.
With the above "high level" hypothesis in mind, could you provide an example of WebGPU interop, too? :) If the
What exactly does
We could assume that:
but that doesn't hold true in the chained execution use case, where outputs become inputs. In your example above, Lacking explicit input/output/intermediate flags, this implies that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe you can create a
Good point to clarify. I would expect
It tells WebNN to copy data between the context/device and CPU after dispatch executes, if required. So, it depends on the context device-type used to create the buffer and if it needs to bound to the CPU for I/O.
+1. We have a similar situation on GPU backed Admittedly, there are downsides to this approach:
But if the goal is have WebNN mapping behave like WebGPU, it's totally doable IMHO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For completion purposes, here's how the other API approach could work. Let's refer to the previous idea as "storage bindings". Under this design, ml_context = ML.createContext({"gpu"});
// MLBuffer data is created on the device used by the context.
ml_buffer_input = ml_context.createBuffer({size: 4});
ml_buffer_output = ml_context.createBuffer({size: 4});
// Upon calling Unmap(), after a mapAsync(WRITE):
// * if CPU/iGPU, a copy to GPU IS NOT required.
// * Else dGPU/NPU, a copy to GPU IS required.
await ml_buffer_input.mapAsync(MLMapMode.WRITE);
const write_data = ml_buffer_input.getMappedRange();
// Write something into write_data
ml_buffer_input.unmap()
// Or use context.writeBuffer, regardless of device-type.
// Dispatch executes the copy required by calling Unmap().
ml_context.dispatch(graph, inputs, outputs);
// Upon calling mapAsync(READ):
// * if CPU/iGPU/NPU, copy to GPU IS NOT required.
// * Else dGPU, a copy to GPU IS required.
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();
// Read something into Read_data
ml_buffer_output.unmap();
// So long as you don't use mapAsync, re-using outputs as inputs doesn't copy (ie. chained-inference).
ml_context.dispatch(graph, inputs, outputs); Note, since this approach hides device data placement/movement, the WebNN developer shouldn't freely use It's also not obvious to me what happens to the (implicit) staging buffer that's created upon The approach is simpler (for implementations, namely) but comes at the cost of requiring the WebNN developer to reference the spec to actually understand what's going on and not writing code as they could elsewhere. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. General feedback on this line of thought: [@huningxin] Should we consider create MLBuffer based on its usage? Yes, I think it is reasonable to have enums on MLBuffer creation that help the browser optimize memory usage.
I think we should avoid having stateful binding APIs, if possible. They tend to act like global variables and are difficult to integrate with when you have multiple frameworks active in an application. Even in WebGPU, binding objects are created atomically from a dictionary and are immutable, not stateful. Similarly, introducing a I would prefer that we keep things simple and have the inputs and outputs to dispatch be included with the actual dispatch API like what was originally proposed instead of introducing a binding API, especially a stateful binding API.
For what it's worth, the WebGPU CG has still not speced UMA buffer transfers. See the outstanding issue Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory, which has been active since late 2021. In general, I tend to agree with Austin. I think we should gain implementation experience with doing copies on discrete hardware in We haven't yet touched on how WebNN interacts with WebGPU for hybrid devices. If the web developer creates a "high performance" WebNN context while the WebGPU device is on the "low power" adapter, the browser will need to copy between the two when transferring buffers between the APIs. Power preference is just an request so the web developer is not guaranteed to get what they request, especially over the entire course of their web session. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The simplicity of overloading I believe that's the primary reason WebGPU and others don't overload it. For example, the WebGPU developer doesn't need to worry about OOMing when calling |
||||||||||||||||||
- Should this method be on the `MLGraph` (related to | ||||||||||||||||||
[#303](https://github.com/webmachinelearning/webnn/issues/303))? Is there a | ||||||||||||||||||
use case not satisfied by the following? | ||||||||||||||||||
```js | ||||||||||||||||||
graph.dispatch( | ||||||||||||||||||
/*inputs=*/{buffer: inputMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: intermediateMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
``` | ||||||||||||||||||
- Is it valid to pass the same `MLBuffer` as both an input and output of the | ||||||||||||||||||
same `dispatch()` call? e.g. | ||||||||||||||||||
```js | ||||||||||||||||||
graph.dispatch( | ||||||||||||||||||
/*inputs=*/{buffer: someMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: someMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
### Read back data from an `MLBuffer` | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
const resultBuffer = await outputMlBuffer.mapAsync(); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- After the completion of all currently-enqueued operations that use | ||||||||||||||||||
`outputMlBuffer`, WebNN will copy the contents of `outputMlBuffer` to | ||||||||||||||||||
`resultBuffer`. This is very similar to | ||||||||||||||||||
[`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync), | ||||||||||||||||||
with a key difference being that WebGPU only allows `mapAsync()` on buffers | ||||||||||||||||||
which have the `MAP_READ` usage flag. In this case, if using a GPU, we may | ||||||||||||||||||
need to create an intermediate "readback" buffer to facilitate the transfer. | ||||||||||||||||||
This may require two copies:\ | ||||||||||||||||||
*  high-GPU-bandwidth buffer → "readback" buffer → | ||||||||||||||||||
`ArrayBuffer`*\ | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For UMA GPU, the implementation may eliminate the "readback" buffer and just need one copy. |
||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- What should this method be called? I've proposed `mapAsync()` here to mirror | ||||||||||||||||||
WebGPU since the behavior is very similar. | ||||||||||||||||||
- Should there be the equivalent of | ||||||||||||||||||
[`MAP_READ`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_read) + | ||||||||||||||||||
[`COPY_DST`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_dst) for | ||||||||||||||||||
`MLBuffer`s? | ||||||||||||||||||
- If we know the usage of `outputMlBuffer` (e.g. that it's | ||||||||||||||||||
write-once by WebNN) then we could eliminate the data copy from the | ||||||||||||||||||
high-GPU-bandwidth buffer to the "readback" buffer in the non-UMA case | ||||||||||||||||||
```js | ||||||||||||||||||
// The buffer may be allocated on a "readback" buffer | ||||||||||||||||||
const mlBuffer = mlContext.createBuffer({size, usage: MAP_READ | OUTPUT_ONLY}); | ||||||||||||||||||
|
||||||||||||||||||
// `mlBuffer` may be used as an output to MLGraph execution | ||||||||||||||||||
// ... | ||||||||||||||||||
|
||||||||||||||||||
// Read back with fewer data copies! | ||||||||||||||||||
const resultBuffer = await mlBuffer.mapAsync(); | ||||||||||||||||||
``` | ||||||||||||||||||
Again, this will not help DirectML and may or may not help other systems. | ||||||||||||||||||
|
||||||||||||||||||
## Use Case: WebGPU Interop | ||||||||||||||||||
|
||||||||||||||||||
Here’s a code example in which WebNN performs selfie segmentation on a video | ||||||||||||||||||
frame without needing round-trips to JavaScript to synchronize WebNN and WebGPU | ||||||||||||||||||
compute: | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
const applyEffectToFrame = () => { | ||||||||||||||||||
const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); | ||||||||||||||||||
|
||||||||||||||||||
// Create a new MLBuffer to be used to facilitate WebGPU interop. | ||||||||||||||||||
// | ||||||||||||||||||
// Note that a more optimized implementation might allocate this buffer - or a | ||||||||||||||||||
// ring of buffers - ahead of time such that memory can be reused. | ||||||||||||||||||
const tensorizedMlBuffer = mlContext.createBuffer({size: tensorizedBufferSize}); | ||||||||||||||||||
|
||||||||||||||||||
// Rent out the MLBuffer to WebGPU. | ||||||||||||||||||
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); | ||||||||||||||||||
|
||||||||||||||||||
// Create a bind group for `gpuVideoTexture`, create a command encoder, etc. | ||||||||||||||||||
// to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer` | ||||||||||||||||||
// ... | ||||||||||||||||||
|
||||||||||||||||||
gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]); | ||||||||||||||||||
|
||||||||||||||||||
// Return the buffer to WebNN. | ||||||||||||||||||
tensorizedMlBuffer.unmapFromGpuBuffer(); | ||||||||||||||||||
|
||||||||||||||||||
// Perform some inference described by `graph` on the frame | ||||||||||||||||||
// (e.g. selfie segmentation) | ||||||||||||||||||
mlContext.dispatch( | ||||||||||||||||||
graph, | ||||||||||||||||||
/*inputs=*/{buffer: tensorizedMlBuffer}, | ||||||||||||||||||
/*outputs=*/{buffer: tensorizedMlBuffer}, | ||||||||||||||||||
); | ||||||||||||||||||
|
||||||||||||||||||
// Rent the MLBuffer back out to WebGPU. | ||||||||||||||||||
const tensorizedGpuBufferAfterInference = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); | ||||||||||||||||||
|
||||||||||||||||||
// Create a bind group for `tensorizedGpuBufferAfterInference`, | ||||||||||||||||||
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference` | ||||||||||||||||||
// into a GPU shader which may blur the frame or replace background sections | ||||||||||||||||||
// and then render the result | ||||||||||||||||||
// ... | ||||||||||||||||||
|
||||||||||||||||||
gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]); | ||||||||||||||||||
|
||||||||||||||||||
// Call this method for each frame. | ||||||||||||||||||
video.requestVideoFrameCallback(applyEffectToFrame); | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Let's again dive into what happens at each of these steps. Some of these steps | ||||||||||||||||||
are covered above, which I'll skip over here: | ||||||||||||||||||
|
||||||||||||||||||
### Rent out an `MLBuffer` to WebGPU | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- Two fences are created: | ||||||||||||||||||
1. a "start access" fence which is to be signaled by WebNN and waited on by | ||||||||||||||||||
WebGPU | ||||||||||||||||||
2. an "end access" fence which is to be signaled by WebGPU and waited on by | ||||||||||||||||||
WebNN | ||||||||||||||||||
- `gpuDevice` enqueues a command to its `GPUQueue` to wait for the "start | ||||||||||||||||||
access" fence to be signaled | ||||||||||||||||||
- WebNN (on some queue or timeline yet to be specified) will signal the "start | ||||||||||||||||||
access" fence after the completion of all currently-enqueued operations that | ||||||||||||||||||
use `tensorizedMlBuffer`. This is very similar to how `mapAsync()` works | ||||||||||||||||||
- In this case, there is only one currently-enqueued operation: | ||||||||||||||||||
`MLContext.createBuffer()` | ||||||||||||||||||
- In the latter `mapAsGpuBuffer()` call, the "start access" fence will not be | ||||||||||||||||||
signaled by WebNN until the `dispatch()` call is complete. This implicitly | ||||||||||||||||||
blocks execution of the commands in `texturizeAndRenderCommandEncoder` that | ||||||||||||||||||
are enqueued to WebGPU until WebNN is finished with `tensorizedMlBuffer` | ||||||||||||||||||
- WebNN will wait for the "end access" fence to be signaled. In the meantime, | ||||||||||||||||||
all work involving `tensorizedMlBuffer` is blocked | ||||||||||||||||||
- `gpuDevice` has exclusive, read/write access to this memory for as long as the | ||||||||||||||||||
"end access" fence is not signaled | ||||||||||||||||||
- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this | ||||||||||||||||||
will be a zero-copy mapping. Otherwise a new buffer will be allocated on | ||||||||||||||||||
`gpuDevice` and the contents of `tensorizedMlBuffer` will be copied into this | ||||||||||||||||||
buffer | ||||||||||||||||||
- The memory backing `tensorizedMlBuffer` becomes inaccessible to WebNN (or | ||||||||||||||||||
script, or anything else), regardless of whether a copy is made. | ||||||||||||||||||
- Ideally these states and their transitions can be expressed similarly to a | ||||||||||||||||||
`GPUBuffer`'s [internal | ||||||||||||||||||
state](https://www.w3.org/TR/webgpu/#buffer-internals-state) | ||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- What are the usage flags of `tensorizedGpuBuffer`? | ||||||||||||||||||
- While `tensorizedMlBuffer` is rented out to WebGPU as `tensorizedGpuBuffer`: | ||||||||||||||||||
- What happens if `destroy()` is called on `tensorizedMlBuffer`? | ||||||||||||||||||
- What happens if `destroy()` is called on `tensorizedGpuBuffer`? | ||||||||||||||||||
|
||||||||||||||||||
### Return a rented-out `MLBuffer` back to WebNN | ||||||||||||||||||
|
||||||||||||||||||
```js | ||||||||||||||||||
tensorizedMlBuffer.unmapFromGpuBuffer(); | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
#### How it works: | ||||||||||||||||||
|
||||||||||||||||||
- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this | ||||||||||||||||||
will be a zero-copy unmapping. Otherwise the contents of `tensorizedGpuBuffer` | ||||||||||||||||||
will be copied into `tensorizedMlBuffer` | ||||||||||||||||||
- Informs `gpuDevice` to signal the "end access" fence created in the | ||||||||||||||||||
`mapAsGpuBuffer()` method after the completion of currently-enqueued | ||||||||||||||||||
operations that use `tensorizedGpuBuffer`. This is very similar to how | ||||||||||||||||||
`mapAsync()` works | ||||||||||||||||||
- The WebNN timeline receives the signal and may resume execution | ||||||||||||||||||
- WebNN has exclusive, read/write access to this memory until further notice | ||||||||||||||||||
- `tensorizerGpuBuffer` is expired | ||||||||||||||||||
https://gpuweb.github.io/gpuweb/#dom-gpuexternaltexture-expired-slot | ||||||||||||||||||
|
||||||||||||||||||
#### Questions: | ||||||||||||||||||
|
||||||||||||||||||
- What happens to `tensorizedMlBuffer` if this method is never called? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the size should be known before the allocation. WebNN only supports static shape tensor (i.e. the shape can be quired by
MLOperand.shape()
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems reasonable that the size should always be known. I mostly just wanted to explicitly call it out as a constraint of this proposal :)