Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLBuffer exploration doc #541

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
395 changes: 395 additions & 0 deletions mlbuffer-exploration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# `MLBuffer` Exploration

By @a-sully

## What is this?

This is an exploration - primarily via code samples of key use cases - of what
ML compute might look like using a device-agnostic buffer, as proposed in
[#482](https://github.com/webmachinelearning/webnn/issues/482) as `MLBuffer`.

This is not intended to be a formal explainer, though it could become one if
that would be useful. My intention here is to describe our priorities (such that
we can ensure the design satisfies these priorities), bring attention to some
open questions and related issues, toss around some ideas, and encourage
discussion about how this proposal will be specified.

## Goals

- Minimize round-trips to JavaScript/CPU needed for synchronization of work on
buffers which may not live on the CPU
- Minimize buffer copies
- In particular, we should support zero-copy buffer sharing between WebNN and
WebGPU if this is supported by the underlying hardware
- Support the XPU (i.e. CPU, GPU, NPU, TPU, etc...) with one consistent API
- Follow recomended [design
principles](https://w3ctag.github.io/design-principles/)
- In my opinion, this likely entails [mirroring WebGPU's design
decisions](https://w3ctag.github.io/design-principles/#naming-consultation),
where appropriate

## Overarching Questions

Many of these questions are not _specific_ to `MLBuffer`, but are important
enough that their answers will strongly influence the shape of the `MLBuffer`
proposal.

- What are WebNN's timelines and how do they interact with WebGPU's timelines?
See [#529](https://github.com/webmachinelearning/webnn/issues/529)
- Where will an `MLBuffer`'s memory be allocated on systems where an `MLContext`
may not be as closely tied to a given physical device as an
[`IDMLDevice`](https://learn.microsoft.com/en-us/windows/win32/api/directml/nn-directml-idmldevice)?
See [#350](https://github.com/webmachinelearning/webnn/issues/350)
- How will errors be surfaced? See
[#477](https://github.com/webmachinelearning/webnn/issues/477). Do we need a
concept similar to [WebGPU's error
scopes](https://www.w3.org/TR/webgpu/#error-scopes)?
- Must an `MLBuffer` only be used with an `MLContext` it was created from?
(or `MLGraph`s created from that `MLContext`, and so forth)
- If what we're building is a device-agnostic buffer, it will surely be used for
things other than ML (in the long run). In the spirit of
[future-proofing](https://w3ctag.github.io/design-principles/#naming-future-proofing),
should we name it something other than `MLBuffer`?

## Use Case: Chained Inference

Here's a code sample showing how `MLBuffer`s can be used for chained inference
and then read back to an `ArrayBuffer`:

```js
// Create new MLBuffers to be used for chained inference.
const inputMlBuffer = mlContext.createBuffer({inputSize});
const intermediateMlBuffer = mlContext.createBuffer({intermediateSize});
const outputMlBuffer = mlContext.createBuffer({outputSize});

// Copy the contents of an ArrayBuffer into an MLBuffer, to be later used as inputs.
mlContext.writeBuffer(
inputMlBuffer,
/*dstOffset=*/0,
/*srcData=*/someJsArrayBuffer,
);

// Perform some ✧*✧* machine learning *✧*✧ described by `graph`.
mlContext.dispatch(
graph,
/*inputs=*/{buffer: inputMlBuffer},
/*outputs=*/{buffer: intermediateMlBuffer},
);

// Feed the output of one execution as the input to the next. Chained inference!
mlContext.dispatch(
anotherGraph,
/*inputs=*/{buffer: intermediateMlBuffer},
/*outputs=*/{buffer: outputMlBuffer},
);

// Read back the results to script.
const resultBuffer = await outputMlBuffer.mapAsync();
```

Let's dive into what happens at each of these steps:

### `MLBuffer` creation

```js
const inputMlBuffer = mlContext.createBuffer({inputSize});
```
#### How it works:

- Enqueue a request on some WebNN timeline to allocate memory on the device
associated with `mlContext`
- The memory allocation will be zeroed (as it is for [WebGPU's `createBuffer()`
method](https://www.w3.org/TR/webgpu/#dom-gpudevice-createbuffer))

#### Questions:

- Can an `MLBuffer`'s size always be known at the time of buffer allocation?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the size should be known before the allocation. WebNN only supports static shape tensor (i.e. the shape can be quired by MLOperand.shape()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems reasonable that the size should always be known. I mostly just wanted to explicitly call it out as a constraint of this proposal :)

- In this case and many other cases it seems possible; it's presumably a
function of the model and/or video input. But since WebNN always rents a
buffer to WebGPU - never the other way around - this introduces a constraint
that the size of an `MLBuffer` must always be known at the time of buffer
allocation
- When will `inputMlBuffer` be deallocated if `destroy()` is not called?

Copy link
Contributor

@huningxin huningxin Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider create MLBuffer based on its usage?

For the chained inference use case, a buffer usage could be one of the following three:

  1. Input: ArrayBufferMLBufferMLGraph
  2. Output: MLGraphMLBufferArrayBuffer
  3. Default: MLGraphMLBufferMLGraph

Different backends/devices may arrange/optimize the buffer allocation depending on its usage. For example, the current Chromium WebNN XNNPACK and DirectML backends arrange the input and output buffers as following:

  • XNNPACK backend allocates input buffer with XNN_EXTRA_BYTES because XNNPACK may read beyond array bounds for performance reason. The output buffer doesn't need to allocate extra bytes, because XNNPACK never write beyond array bounds.
  • For UMA (unified memory architecture) GPU, DirectML backend allocates input CPU buffer / output CPU buffer and bind them as GPU graph execution input / output. DirectML backend also sets appropriate CPU page property optimized for CPU writing or reading according to the cache architecture.
  • For NUMA (Non-UMA) GPU, DirectML backend always allocates default GPU buffers as GPU graph execution input and output. For input data uploading, it allocates dedicated upload CPU buffer and does two-steps uploading from CPU to GPU. Output data reading back is through a dedicated readback CPU buffer and two-steps downloading from GPU to CPU.

For default buffer:

  • XNNPACK backend may still need to allocate extra bytes because XNNPACK would read data from it.
  • DirectML backend may just allocate buffer in DEFAULT_HEAP because it doesn't need CPU access.

The following table captures the buffer allocation difference according to usage:

Buffer Usage XNNPACK / CPU DirectML / UMA GPU DirectML / NUMA GPU
Input buffer with XNN_EXTRA_BYTES CUSTOM buffer
(CPUPageProperty = CacheCoherentUMA ? WRITE_BACK : WRITE_COMBINE,
MEMORY_POOL_L0)
DEFAULT buffer + UPLOAD buffer
Output buffer without XNN_EXTRA_BYTES CUSTOM buffer
(CPUPageProperty = WRITE_BACK,
MEMORY_POOL_L0)
DEFAULT buffer + READBACK buffer
default buffer with XNN_EXTRA_BYTES DEFAULT buffer (MEMORY_POOL_L0) DEFAULT buffer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question is whether we can keep these details implementation specific, and use MLBuffer to identify the buffers we pass between stages, eventually with dynamic labels such as input/output/intermediate? (in this case we will have to specify behavior in the spec in more details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider create MLBuffer based on its usage?

"usage" can mean a lot of different things (readable/writable? input/output/intermediate? usable by WebGPU? mappable in some way?) but in broad strokes, yes I agree :P

We should strive to ensure the user agent has enough information when an MLBuffer is created to allocate its buffer in the most optimal place, according to how it will be used. It's worth enumerating those uses in detail. The table is a very useful start - thanks!

I think the main question is whether we can keep these details implementation specific

IMHO it's critical that these details are opaque to the website. If creating an "input" MLBuffer for the XNNPack backend, the website should not be able to detect that XNN_EXTRA_BYTES have been added. For example:

const mlBuffer = mlContext.createBuffer({ usage: INPUT, size: 100 });
console.assert(mlBuffer.size, 100);

Similarly:

const gpuBuffer = mlBuffer.mapAsGpuBuffer(gpuDevice);
console.assert(gpuBuffer.size, mlBuffer.size);

// ...

const arrayBuffer = mlBuffer.mapAsync();
console.assert(arrayBuffer.byteLength, mlBuffer.size);

eventually with dynamic labels such as input/output/intermediate

Could you elaborate more on this? :)

### Writing to an `MLBuffer`

```js
mlContext.writeBuffer(
inputMlBuffer,
/*dstOffset=*/0,
/*srcData=*/someJsArrayBuffer,
);
```

#### How it works:

- Enqueue a request on some WebNN timeline to copy the contents of
`someJsArrayBuffer` to `inputMlBuffer`. This is very similar to [the
corresponding WebGPU
method](https://www.w3.org/TR/webgpu/#dom-gpuqueue-writebuffer), though the
implementation details will vary depending on which device `inputMlBuffer` is
allocated on. For example, if allocated on:
- a CPU, the buffer contents will be copied directly (i.e. `memcpy()`)
- a GPU, the behavior will likely match `GPUQueue.writeBuffer()`. On UMA
systems, a `memcpy()` might suffice. Other implementations may use a hidden
"upload" buffer to get the data onto the GPU. This implies two copies:\
*  `ArrayBuffer` → "upload" buffer → high-GPU-bandwidth
buffer*
- an XPU... it depends!
- `someJsArrayBuffer` is unaffected, since the bytes are copied
- Note that the aforementioned copies are _in addition_ to any copies needed
to get the data into the `ArrayBuffer` in the first place. If the data is
weights being read from a `File`, for example, this will require first
copying the bytes from the `File` into the `ArrayBuffer`. This means
**copying the weights into GPU-accessible memory could take as many as four
copies!**

#### Questions:

- Should there be a corresponding
[`mappedAtCreation`](https://www.w3.org/TR/webgpu/#dom-gpubufferdescriptor-mappedatcreation)
capability?
- If the data is not already in an `ArrayBuffer`, this eliminates the data
copy into an `ArrayBuffer` altogether, since we could write to the "upload"
buffer directly:
```js
const mlBuffer = mlContext.createBuffer({size, mappedAtCreation: true});

const floatArray = new Float32Array(mlBuffer.getMappedRange()),

// Write to `floatArray`
// ...

// Write the buffer contents to the XPU
mlBuffer.unmap();
```
  Before: *some source → `ArrayBuffer` → "upload" buffer
→ high-GPU-bandwidth buffer*\
  After: *some source → "upload" buffer
→ high-GPU-bandwidth buffer*
- Should there be the equivalent of
[`MAP_WRITE`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_write) +
[`COPY_SRC`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_src) for
`MLBuffer`s?
- If we know the usage of `inputMlBuffer` (e.g. that it's read-only by WebNN)
then we may be able to eliminate the data copy from the "upload" buffer to
the high-GPU-bandwidth buffer in the non-UMA case:
```js
const mlBuffer = mlContext.createBuffer({size, usage: MAP_WRITE | INPUT_ONLY});
```
This may not make a difference for DirectML, which appears to require [bound
resources](https://learn.microsoft.com/en-us/windows/win32/api/directml/nf-directml-idmlbindingtable-bindpersistentresource)
to use `D3D12_HEAP_TYPE_DEFAULT`, but it could eliminate a copy on other
systems. I'm not familiar enough with other systems to know the answer here!
- Combining this with the above techniques brings (as many as) 4 copies down
to (as few as) 2:\
  Before: *some source → `ArrayBuffer` → "upload" buffer
→ high-GPU-bandwidth buffer*\
  After: *some source → "upload" buffer*

### Execute an `MLGraph`

```js
mlContext.dispatch(
graph,
/*inputs=*/{buffer: inputMlBuffer},
/*outputs=*/{buffer: intermediateMlBuffer},
);
```

#### How it works:

- Enqueues a request to compute the graph onto some WebNN timeline
- Execution cannot start until all input and output `MLBuffer`s are available
- All input and output `MLBuffer`s are unavailable while execution is in
progress
- All work submitted after this `dispatch()` call which relies on an input or
output `MLBuffer` will be queued behind this execution

#### Questions:

- This approach is flexible enough to allow for graph execution on all backends.
Do we need a separate `compute()` method?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They might be different if the graph has multiple outputs. For example, the current Chromium WebNN DirectML backend implementation allocates one default buffer and one readback buffer big enough for all outputs. When compute() is done by GPU, it copies all results from the default buffer to readback buffer, and then to array buffers in one shot and resolves the promise.

For dispatch(), a naive implementation may need to do that (enqueue a copy from default buffer to readback buffer, wait for GPU copy done, copy data from readback buffer to array buffer, resolve promise) for each buffer upon user calls MLBuffer.mapAsync().

Copy link

@bbernhar bbernhar Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from MLBuffer using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from MLBuffer using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.

This discussion relates to Question 4 from #544:

  1. Can dispatch be exclusive to MLBuffer bindings? @bbernhar

At the core of this discussion seems to be the question of whether MLBuffer should be a device-agnostic buffer, or whether it's only expected to be allocated on the GPU or NPU for chained execution, or even just on the GPU in the case where we want WebGPU interop (effectively making it a GPUBuffer in disguise).

My earlier interpretation was that MLBuffer was proposing to be the former. In the case of WebGPU interop, for example, I would naively expect that as long as the appropriate usage flags are set, the user agent should be able to share memory if the memory can be shared (e.g. UMA GPUs); otherwise a copy would be made on both mapAsGpuBuffer() and unmapFromGpuBuffer()

Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)

To clarify some things on my end... I know that WebGPU has a very specific definition of "buffer mapping". I've been using the term much more loosely here (apologies if that's caused any confusion); in broad strokes to mean "transferring ownership", which may or may not require copies under the hood. For example:

  • For a CPU-based MLBuffer, "mapping" the MLBuffer to an ArrayBuffer (and vice versa) would behave very similarly to how compute() works today (see https://webmachinelearning.github.io/webnn/#mlnamedarraybufferviews-transfer). The user agent can make this "mapping" very cheap
  • For a dNPU-based MLBuffer, "mapping" the MLBuffer to a dGPU would require
    • on mapAsGpuBuffer(), copy the data into a new buffer on the GPU and set flags to e.g. mark the MLBuffer as inaccessible
    • on unmapFromGpuBuffer(), copy back the contents of the GPU buffer into the original buffer on the NPU and set flags to e.g. mark the MLBuffer as accessible once again

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-sully, I agree we'll need more than what I proposed to cover all the use-cases here.

Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)

Since we require CPU backed MLBuffer(s) to be equally efficient (avoid copies), I don't think my original proposal to have MLBuffer be a (simple) control (vs storage) object will work =(.

It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.

// GPU staging buffers
ml_context = ML.createContext({"gpu"});
input_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_WRITE});
output_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_READ});
// output_buf = ml_context.createBuffer({size: 4}); // pre-allocate on GPU only

// Populate MLBuffer input with source data
await input_buf.mapAsync(MLMapMode.WRITE);
const write_data = input_buf.getMappedRange();
/* use write_data */
input_buf.unmap();

// If dGPU, 
//  * input requires GPU copy due to MLBufferUsage.MAP_WRITE
//  * output requires GPU copy due to MLBufferUsage.MAP_READ.
// If iGPU,
//  * neither input or output requires GPU copy.
ml_context.dispatch({inputs: input_buf, outputs: output_buf});

// MLBuffer output was populated by dispatch()
await output_buf.mapAsync(MLMapMode.READ);
const read_data = output_buf.getMappedRange();
/* use read_data */
output_buf.unmap();

The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?

Copy link
Contributor Author

@a-sully a-sully Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we require CPU backed MLBuffer(s) to be equally efficient (avoid copies), I don't think my original proposal to have MLBuffer be a (simple) control (vs storage) object will work =(.

+1 I think this answers question 5 from #542:

  1. Does MLBuffer require explicit buffer usages (ex. input, output, or both)? @bbernhar

Now the question becomes what set of usage flags are needed?

It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.

+1. And this (familiar to developers!) WebGPU-like interface is flexible enough that we should be able to provide this same interface even if the buffer is not allocated on the GPU

The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?

Seems reasonable. Hmmm... some ideas which come to mind:

  • rent - suggests that WebNN maintains ownership even when the buffer is used by JS or WebGPU. And the rental may be returned
  • lend - synonym to the above
  • transfer - analogous to ArrayBuffer.transfer(), which also may require a copy but is zero-copy when possible
  • import - parallels WebGPU, though perhaps a bit awkward
  • map - if we want to re-use familiar names like mapAsync() then maybe "map" isn't so bad, even if it hides a copy

Since the relationship is asymmetric (in that WebNN always owns the buffer), I think the naming should be, too. Some ideas for returning the buffer to WebNN include:

  • return
  • restore
  • takeBack
  • unmap (probably only if we go with "map" above)

Feel free to suggest other terms!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sample code!

At a high level, I don't think we can:

  • avoid having hidden data-synchronization across devices, and also
  • have consistent usage patterns of an MLBuffer regardless of which device it's allocated on

I'll give an example:

Setting both [MAP_WRITE and MAP_READ] could be the equivalent of mappedatcreation

WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?

Meanwhile, there's a clear use case for allowing dual-mapping if the MLBuffer is backed by the CPU or on an UMA system where buffer copies could be avoided altogether.


Just to make sure we're on the same page here, could you provide some sample code of what this would look like for both chained inference and WebGPU interop?

With the above "high level" hypothesis in mind, could you provide an example of WebGPU interop, too? :)

If the MLBuffer's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?


// (abbreviated)
ml_bindings.setCopyOutputs({"output1"}); // copy back, explicit
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();

What exactly does setCopyOutputs do? Where is the data copied to/from? For each memory access type?


If you have a tentative set of MLBufferUsageFlags in mind, please do share! :)

I think only two usages are required here: MAP_WRITE and MAP_READ. Setting both could be the equivalent of mappedatcreation.

We could assume that:

  • MAP_WRITE -> input buffers
  • MAP_READ -> output buffers

but that doesn't hold true in the chained execution use case, where outputs become inputs. In your example above, ml_buffer_output is used as both an input and output buffer, which means that e.g. the implementation of createBuffer() must know to allocate XNN_EXTRA_BYTES - and also to chop off those extra bytes when passing it as an input on the second inference.

Lacking explicit input/output/intermediate flags, this implies that MLBuffers to be used with the XNNPACK backend must always allocate XNN_EXTRA_BYTES... which doesn't seem unreasonable. But if there are more quirks like this on other platforms, though, then we may want to consider adding explicit input/output/intermediate flags?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?

I believe you can create a GPUBuffer with combined usages (MAP_READ | MAP_WRITE). It's only when you call mapAsync, it must be READ or WRITE mode, not both. Because this mapping operation works like WebGPU, it occurs on the same device used to create the buffer so by definition, no cross-device copies are allowed. We don't need to copy data until dispatch or import anyway.

If the MLBuffer's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?

Good point to clarify. I would expect importExternalBuffer to perform the copy as an implementation detail. This operation is about transferring a MLBuffer, which a copy is allowed. After importing, I believe we can simply GPUBuffer.destroy() it, which restores access to the shared MLBuffer with WebNN.

What exactly does setCopyOutputs do? Where is the data copied to/from? For each memory access type?

It tells WebNN to copy data between the context/device and CPU after dispatch executes, if required. So, it depends on the context device-type used to create the buffer and if it needs to bound to the CPU for I/O.

must always allocate XNN_EXTRA_BYTES... which doesn't seem unreasonable

+1. We have a similar situation on GPU backed MLBuffers - they also require alignment which always allocates extra bytes anyway.

Admittedly, there are downsides to this approach:

  • Can't call mapAsync anywhere you want.
  • Usages + bindings require validation.

But if the goal is have WebNN mapping behave like WebGPU, it's totally doable IMHO.

Copy link

@bbernhar bbernhar Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completion purposes, here's how the other API approach could work. Let's refer to the previous idea as "storage bindings". Under this design, mapAsync "hides" the copy from the web developer. Note: interop does not change so it's left out here intentionally.

ml_context = ML.createContext({"gpu"});

// MLBuffer data is created on the device used by the context.
ml_buffer_input = ml_context.createBuffer({size: 4});
ml_buffer_output = ml_context.createBuffer({size: 4});

// Upon calling Unmap(), after a mapAsync(WRITE):
// * if CPU/iGPU, a copy to GPU IS NOT required.
// * Else dGPU/NPU, a copy to GPU IS required.
await ml_buffer_input.mapAsync(MLMapMode.WRITE);
const write_data = ml_buffer_input.getMappedRange();
// Write something into write_data
ml_buffer_input.unmap()

// Or use context.writeBuffer, regardless of device-type.

// Dispatch executes the copy required by calling Unmap().
ml_context.dispatch(graph, inputs, outputs);

// Upon calling mapAsync(READ):
// * if CPU/iGPU/NPU, copy to GPU IS NOT required.
// * Else dGPU, a copy to GPU IS required.
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();
// Read something into Read_data
ml_buffer_output.unmap();

// So long as you don't use mapAsync, re-using outputs as inputs doesn't copy (ie. chained-inference).
ml_context.dispatch(graph, inputs, outputs);

Note, since this approach hides device data placement/movement, the WebNN developer shouldn't freely use mapAsync(READ) for output, like they could in WebGPU, as it could (inadvertently) convert into the more expensive readBuffer operation (internally).

It's also not obvious to me what happens to the (implicit) staging buffer that's created upon mapAsync. Similarly, deciding use of writeBuffer vs mapAsync(WRITE) for input is not obvious since there is no mappable/staging buffer usage.

The approach is simpler (for implementations, namely) but comes at the cost of requiring the WebNN developer to reference the spec to actually understand what's going on and not writing code as they could elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback on this line of thought:

[@huningxin] Should we consider create MLBuffer based on its usage?

Yes, I think it is reasonable to have enums on MLBuffer creation that help the browser optimize memory usage.

[@bbernhar] I was thinking we could have a descriptor/binding set [for dispatch], more like what we have in WebGPU.

I think we should avoid having stateful binding APIs, if possible. They tend to act like global variables and are difficult to integrate with when you have multiple frameworks active in an application. Even in WebGPU, binding objects are created atomically from a dictionary and are immutable, not stateful.

Similarly, introducing a MLDispatchMode also makes it challenging for web developers when dispatch code is far away from code which calls mapAsync and writeBuffer. If you submit work with MLDispatchMode.Default and subsequently call mapAsync, the promise will reject because the buffer is in the wrong state. For debugging, it's nice to sprinkle mapAsync calls in the middle of a bunch of chained dispatches (without changing the dispatches themsleves) to see where things have gone south.

I would prefer that we keep things simple and have the inputs and outputs to dispatch be included with the actual dispatch API like what was originally proposed instead of introducing a binding API, especially a stateful binding API.

[@a-sully]
At a high level, I don't think we can:

  • avoid having hidden data-synchronization across devices, and also
  • have consistent usage patterns of an MLBuffer regardless of which device it's allocated on

For what it's worth, the WebGPU CG has still not speced UMA buffer transfers. See the outstanding issue Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory, which has been active since late 2021.

In general, I tend to agree with Austin. I think we should gain implementation experience with doing copies on discrete hardware in unmap (or dispatch) and see where it takes us. Both this and writeBuffer will require temporary buffer memory that needs to be managed by the browser.

We haven't yet touched on how WebNN interacts with WebGPU for hybrid devices. If the web developer creates a "high performance" WebNN context while the WebGPU device is on the "low power" adapter, the browser will need to copy between the two when transferring buffers between the APIs. Power preference is just an request so the web developer is not guaranteed to get what they request, especially over the entire course of their web session.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplicity of overloading mapAsync introduces inconsistent runtime behavior.

I believe that's the primary reason WebGPU and others don't overload it. For example, the WebGPU developer doesn't need to worry about OOMing when calling mapAsync or Umap since mappable buffers don't use a staging buffer and are just mapped. Debugging OOM between WebNN and WebGPU becomes inconsistent, the WebNN developer can't assert when mapAsync behaves like readBuffer or if MLBuffer's memory usage doubled. I could see how more established APIs like ONNX runtime preferred storage bindings, they like WebGPU, also want runtime and API consistency.

- Should this method be on the `MLGraph` (related to
[#303](https://github.com/webmachinelearning/webnn/issues/303))? Is there a
use case not satisfied by the following?
```js
graph.dispatch(
/*inputs=*/{buffer: inputMlBuffer},
/*outputs=*/{buffer: intermediateMlBuffer},
);
```
- Is it valid to pass the same `MLBuffer` as both an input and output of the
same `dispatch()` call? e.g.
```js
graph.dispatch(
/*inputs=*/{buffer: someMlBuffer},
/*outputs=*/{buffer: someMlBuffer},
);
```

### Read back data from an `MLBuffer`

```js
const resultBuffer = await outputMlBuffer.mapAsync();
```

#### How it works:

- After the completion of all currently-enqueued operations that use
`outputMlBuffer`, WebNN will copy the contents of `outputMlBuffer` to
`resultBuffer`. This is very similar to
[`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync),
with a key difference being that WebGPU only allows `mapAsync()` on buffers
which have the `MAP_READ` usage flag. In this case, if using a GPU, we may
need to create an intermediate "readback" buffer to facilitate the transfer.
This may require two copies:\
*  high-GPU-bandwidth buffer → "readback" buffer →
`ArrayBuffer`*\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For UMA GPU, the implementation may eliminate the "readback" buffer and just need one copy.


#### Questions:

- What should this method be called? I've proposed `mapAsync()` here to mirror
WebGPU since the behavior is very similar.
- Should there be the equivalent of
[`MAP_READ`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_read) +
[`COPY_DST`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_dst) for
`MLBuffer`s?
- If we know the usage of `outputMlBuffer` (e.g. that it's
write-once by WebNN) then we could eliminate the data copy from the
high-GPU-bandwidth buffer to the "readback" buffer in the non-UMA case
```js
// The buffer may be allocated on a "readback" buffer
const mlBuffer = mlContext.createBuffer({size, usage: MAP_READ | OUTPUT_ONLY});

// `mlBuffer` may be used as an output to MLGraph execution
// ...

// Read back with fewer data copies!
const resultBuffer = await mlBuffer.mapAsync();
```
Again, this will not help DirectML and may or may not help other systems.

## Use Case: WebGPU Interop

Here’s a code example in which WebNN performs selfie segmentation on a video
frame without needing round-trips to JavaScript to synchronize WebNN and WebGPU
compute:

```js
const applyEffectToFrame = () => {
const gpuVideoTexture = gpuDevice.importExternalTexture({source: video});

// Create a new MLBuffer to be used to facilitate WebGPU interop.
//
// Note that a more optimized implementation might allocate this buffer - or a
// ring of buffers - ahead of time such that memory can be reused.
const tensorizedMlBuffer = mlContext.createBuffer({size: tensorizedBufferSize});

// Rent out the MLBuffer to WebGPU.
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);

// Create a bind group for `gpuVideoTexture`, create a command encoder, etc.
// to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer`
// ...

gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]);

// Return the buffer to WebNN.
tensorizedMlBuffer.unmapFromGpuBuffer();

// Perform some inference described by `graph` on the frame
// (e.g. selfie segmentation)
mlContext.dispatch(
graph,
/*inputs=*/{buffer: tensorizedMlBuffer},
/*outputs=*/{buffer: tensorizedMlBuffer},
);

// Rent the MLBuffer back out to WebGPU.
const tensorizedGpuBufferAfterInference = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);

// Create a bind group for `tensorizedGpuBufferAfterInference`,
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference`
// into a GPU shader which may blur the frame or replace background sections
// and then render the result
// ...

gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]);

// Call this method for each frame.
video.requestVideoFrameCallback(applyEffectToFrame);
}
```

Let's again dive into what happens at each of these steps. Some of these steps
are covered above, which I'll skip over here:

### Rent out an `MLBuffer` to WebGPU

```js
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);
```

#### How it works:

- Two fences are created:
1. a "start access" fence which is to be signaled by WebNN and waited on by
WebGPU
2. an "end access" fence which is to be signaled by WebGPU and waited on by
WebNN
- `gpuDevice` enqueues a command to its `GPUQueue` to wait for the "start
access" fence to be signaled
- WebNN (on some queue or timeline yet to be specified) will signal the "start
access" fence after the completion of all currently-enqueued operations that
use `tensorizedMlBuffer`. This is very similar to how `mapAsync()` works
- In this case, there is only one currently-enqueued operation:
`MLContext.createBuffer()`
- In the latter `mapAsGpuBuffer()` call, the "start access" fence will not be
signaled by WebNN until the `dispatch()` call is complete. This implicitly
blocks execution of the commands in `texturizeAndRenderCommandEncoder` that
are enqueued to WebGPU until WebNN is finished with `tensorizedMlBuffer`
- WebNN will wait for the "end access" fence to be signaled. In the meantime,
all work involving `tensorizedMlBuffer` is blocked
- `gpuDevice` has exclusive, read/write access to this memory for as long as the
"end access" fence is not signaled
- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this
will be a zero-copy mapping. Otherwise a new buffer will be allocated on
`gpuDevice` and the contents of `tensorizedMlBuffer` will be copied into this
buffer
- The memory backing `tensorizedMlBuffer` becomes inaccessible to WebNN (or
script, or anything else), regardless of whether a copy is made.
- Ideally these states and their transitions can be expressed similarly to a
`GPUBuffer`'s [internal
state](https://www.w3.org/TR/webgpu/#buffer-internals-state)

#### Questions:

- What are the usage flags of `tensorizedGpuBuffer`?
- While `tensorizedMlBuffer` is rented out to WebGPU as `tensorizedGpuBuffer`:
- What happens if `destroy()` is called on `tensorizedMlBuffer`?
- What happens if `destroy()` is called on `tensorizedGpuBuffer`?

### Return a rented-out `MLBuffer` back to WebNN

```js
tensorizedMlBuffer.unmapFromGpuBuffer();
```

#### How it works:

- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this
will be a zero-copy unmapping. Otherwise the contents of `tensorizedGpuBuffer`
will be copied into `tensorizedMlBuffer`
- Informs `gpuDevice` to signal the "end access" fence created in the
`mapAsGpuBuffer()` method after the completion of currently-enqueued
operations that use `tensorizedGpuBuffer`. This is very similar to how
`mapAsync()` works
- The WebNN timeline receives the signal and may resume execution
- WebNN has exclusive, read/write access to this memory until further notice
- `tensorizerGpuBuffer` is expired
https://gpuweb.github.io/gpuweb/#dom-gpuexternaltexture-expired-slot

#### Questions:

- What happens to `tensorizedMlBuffer` if this method is never called?
Loading