-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MLBuffer exploration doc #541
Conversation
I updated https://github.com/webmachinelearning/meetings/blob/main/telcons/2024-02-08-wg-agenda.md to include this exploration doc as an additional reference. Also created webgpu interop label for us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explainer is very useful. Thanks for putting this together, @a-sully !
I left some comments for chained inference use case. Please take a look.
|
||
#### Questions: | ||
|
||
- Can an `MLBuffer`'s size always be known at the time of buffer allocation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the size should be known before the allocation. WebNN only supports static shape tensor (i.e. the shape can be quired by MLOperand.shape()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems reasonable that the size should always be known. I mostly just wanted to explicitly call it out as a constraint of this proposal :)
that the size of an `MLBuffer` must always be known at the time of buffer | ||
allocation | ||
- When will `inputMlBuffer` be deallocated if `destroy()` is not called? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider create MLBuffer based on its usage?
For the chained inference use case, a buffer usage could be one of the following three:
- Input:
ArrayBuffer
→MLBuffer
→MLGraph
- Output:
MLGraph
→MLBuffer
→ArrayBuffer
- Default:
MLGraph
→MLBuffer
→MLGraph
Different backends/devices may arrange/optimize the buffer allocation depending on its usage. For example, the current Chromium WebNN XNNPACK and DirectML backends arrange the input and output buffers as following:
- XNNPACK backend allocates input buffer with
XNN_EXTRA_BYTES
because XNNPACK may read beyond array bounds for performance reason. The output buffer doesn't need to allocate extra bytes, because XNNPACK never write beyond array bounds. - For UMA (unified memory architecture) GPU, DirectML backend allocates input CPU buffer / output CPU buffer and bind them as GPU graph execution input / output. DirectML backend also sets appropriate CPU page property optimized for CPU writing or reading according to the cache architecture.
- For NUMA (Non-UMA) GPU, DirectML backend always allocates default GPU buffers as GPU graph execution input and output. For input data uploading, it allocates dedicated upload CPU buffer and does two-steps uploading from CPU to GPU. Output data reading back is through a dedicated readback CPU buffer and two-steps downloading from GPU to CPU.
For default buffer:
- XNNPACK backend may still need to allocate extra bytes because XNNPACK would read data from it.
- DirectML backend may just allocate buffer in DEFAULT_HEAP because it doesn't need CPU access.
The following table captures the buffer allocation difference according to usage:
Buffer Usage | XNNPACK / CPU | DirectML / UMA GPU | DirectML / NUMA GPU |
---|---|---|---|
Input | buffer with XNN_EXTRA_BYTES |
CUSTOM buffer ( CPUPageProperty = CacheCoherentUMA ? WRITE_BACK : WRITE_COMBINE , MEMORY_POOL_L0 ) |
DEFAULT buffer + UPLOAD buffer |
Output | buffer without XNN_EXTRA_BYTES |
CUSTOM buffer ( CPUPageProperty = WRITE_BACK , MEMORY_POOL_L0 ) |
DEFAULT buffer + READBACK buffer |
default | buffer with XNN_EXTRA_BYTES |
DEFAULT buffer (MEMORY_POOL_L0 ) |
DEFAULT buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main question is whether we can keep these details implementation specific, and use MLBuffer to identify the buffers we pass between stages, eventually with dynamic labels such as input/output/intermediate? (in this case we will have to specify behavior in the spec in more details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider create MLBuffer based on its usage?
"usage" can mean a lot of different things (readable/writable? input/output/intermediate? usable by WebGPU? mappable in some way?) but in broad strokes, yes I agree :P
We should strive to ensure the user agent has enough information when an MLBuffer
is created to allocate its buffer in the most optimal place, according to how it will be used. It's worth enumerating those uses in detail. The table is a very useful start - thanks!
I think the main question is whether we can keep these details implementation specific
IMHO it's critical that these details are opaque to the website. If creating an "input" MLBuffer
for the XNNPack backend, the website should not be able to detect that XNN_EXTRA_BYTES
have been added. For example:
const mlBuffer = mlContext.createBuffer({ usage: INPUT, size: 100 });
console.assert(mlBuffer.size, 100);
Similarly:
const gpuBuffer = mlBuffer.mapAsGpuBuffer(gpuDevice);
console.assert(gpuBuffer.size, mlBuffer.size);
// ...
const arrayBuffer = mlBuffer.mapAsync();
console.assert(arrayBuffer.byteLength, mlBuffer.size);
eventually with dynamic labels such as input/output/intermediate
Could you elaborate more on this? :)
#### Questions: | ||
|
||
- This approach is flexible enough to allow for graph execution on all backends. | ||
Do we need a separate `compute()` method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They might be different if the graph has multiple outputs. For example, the current Chromium WebNN DirectML backend implementation allocates one default buffer and one readback buffer big enough for all outputs. When compute()
is done by GPU, it copies all results from the default buffer to readback buffer, and then to array buffers in one shot and resolves the promise.
For dispatch()
, a naive implementation may need to do that (enqueue a copy from default buffer to readback buffer, wait for GPU copy done, copy data from readback buffer to array buffer, resolve promise) for each buffer upon user calls MLBuffer.mapAsync()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from MLBuffer
using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from
MLBuffer
using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.
This discussion relates to Question 4 from #544:
- Can dispatch be exclusive to MLBuffer bindings? @bbernhar
At the core of this discussion seems to be the question of whether MLBuffer
should be a device-agnostic buffer, or whether it's only expected to be allocated on the GPU or NPU for chained execution, or even just on the GPU in the case where we want WebGPU interop (effectively making it a GPUBuffer
in disguise).
My earlier interpretation was that MLBuffer
was proposing to be the former. In the case of WebGPU interop, for example, I would naively expect that as long as the appropriate usage flags are set, the user agent should be able to share memory if the memory can be shared (e.g. UMA GPUs); otherwise a copy would be made on both mapAsGpuBuffer()
and unmapFromGpuBuffer()
Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)
To clarify some things on my end... I know that WebGPU has a very specific definition of "buffer mapping". I've been using the term much more loosely here (apologies if that's caused any confusion); in broad strokes to mean "transferring ownership", which may or may not require copies under the hood. For example:
- For a CPU-based
MLBuffer
, "mapping" theMLBuffer
to anArrayBuffer
(and vice versa) would behave very similarly to howcompute()
works today (see https://webmachinelearning.github.io/webnn/#mlnamedarraybufferviews-transfer). The user agent can make this "mapping" very cheap - For a dNPU-based
MLBuffer
, "mapping" theMLBuffer
to a dGPU would require- on
mapAsGpuBuffer()
, copy the data into a new buffer on the GPU and set flags to e.g. mark theMLBuffer
as inaccessible - on
unmapFromGpuBuffer()
, copy back the contents of the GPU buffer into the original buffer on the NPU and set flags to e.g. mark theMLBuffer
as accessible once again
- on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @a-sully, I agree we'll need more than what I proposed to cover all the use-cases here.
Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)
Since we require CPU backed MLBuffer
(s) to be equally efficient (avoid copies), I don't think my original proposal to have MLBuffer
be a (simple) control (vs storage) object will work =(.
It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.
// GPU staging buffers
ml_context = ML.createContext({"gpu"});
input_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_WRITE});
output_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_READ});
// output_buf = ml_context.createBuffer({size: 4}); // pre-allocate on GPU only
// Populate MLBuffer input with source data
await input_buf.mapAsync(MLMapMode.WRITE);
const write_data = input_buf.getMappedRange();
/* use write_data */
input_buf.unmap();
// If dGPU,
// * input requires GPU copy due to MLBufferUsage.MAP_WRITE
// * output requires GPU copy due to MLBufferUsage.MAP_READ.
// If iGPU,
// * neither input or output requires GPU copy.
ml_context.dispatch({inputs: input_buf, outputs: output_buf});
// MLBuffer output was populated by dispatch()
await output_buf.mapAsync(MLMapMode.READ);
const read_data = output_buf.getMappedRange();
/* use read_data */
output_buf.unmap();
The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we require CPU backed
MLBuffer
(s) to be equally efficient (avoid copies), I don't think my original proposal to haveMLBuffer
be a (simple) control (vs storage) object will work =(.
+1 I think this answers question 5 from #542:
- Does MLBuffer require explicit buffer usages (ex. input, output, or both)? @bbernhar
Now the question becomes what set of usage flags are needed?
It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.
+1. And this (familiar to developers!) WebGPU-like interface is flexible enough that we should be able to provide this same interface even if the buffer is not allocated on the GPU
The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?
Seems reasonable. Hmmm... some ideas which come to mind:
- rent - suggests that WebNN maintains ownership even when the buffer is used by JS or WebGPU. And the rental may be returned
- lend - synonym to the above
- transfer - analogous to
ArrayBuffer.transfer()
, which also may require a copy but is zero-copy when possible - import - parallels WebGPU, though perhaps a bit awkward
- map - if we want to re-use familiar names like
mapAsync()
then maybe "map" isn't so bad, even if it hides a copy
Since the relationship is asymmetric (in that WebNN always owns the buffer), I think the naming should be, too. Some ideas for returning the buffer to WebNN include:
- return
- restore
- takeBack
- unmap (probably only if we go with "map" above)
Feel free to suggest other terms!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the sample code!
At a high level, I don't think we can:
- avoid having hidden data-synchronization across devices, and also
- have consistent usage patterns of an
MLBuffer
regardless of which device it's allocated on
I'll give an example:
Setting both [
MAP_WRITE
andMAP_READ
] could be the equivalent ofmappedatcreation
WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?
Meanwhile, there's a clear use case for allowing dual-mapping if the MLBuffer is backed by the CPU or on an UMA system where buffer copies could be avoided altogether.
Just to make sure we're on the same page here, could you provide some sample code of what this would look like for both chained inference and WebGPU interop?
With the above "high level" hypothesis in mind, could you provide an example of WebGPU interop, too? :)
If the MLBuffer
's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?
// (abbreviated) ml_bindings.setCopyOutputs({"output1"}); // copy back, explicit await ml_buffer_output.mapAsync(MLMapMode.READ); const read_data = ml_buffer_output.getMappedRange();
What exactly does setCopyOutputs
do? Where is the data copied to/from? For each memory access type?
If you have a tentative set of MLBufferUsageFlags in mind, please do share! :)
I think only two usages are required here:
MAP_WRITE
andMAP_READ
. Setting both could be the equivalent ofmappedatcreation
.
We could assume that:
MAP_WRITE
-> input buffersMAP_READ
-> output buffers
but that doesn't hold true in the chained execution use case, where outputs become inputs. In your example above, ml_buffer_output
is used as both an input and output buffer, which means that e.g. the implementation of createBuffer()
must know to allocate XNN_EXTRA_BYTES
- and also to chop off those extra bytes when passing it as an input on the second inference.
Lacking explicit input/output/intermediate flags, this implies that MLBuffer
s to be used with the XNNPACK backend must always allocate XNN_EXTRA_BYTES
... which doesn't seem unreasonable. But if there are more quirks like this on other platforms, though, then we may want to consider adding explicit input/output/intermediate flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?
I believe you can create a GPUBuffer
with combined usages (MAP_READ | MAP_WRITE). It's only when you call mapAsync
, it must be READ or WRITE mode, not both. Because this mapping operation works like WebGPU, it occurs on the same device used to create the buffer so by definition, no cross-device copies are allowed. We don't need to copy data until dispatch or import anyway.
If the MLBuffer's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?
Good point to clarify. I would expect importExternalBuffer
to perform the copy as an implementation detail. This operation is about transferring a MLBuffer, which a copy is allowed. After importing, I believe we can simply GPUBuffer.destroy()
it, which restores access to the shared MLBuffer with WebNN.
What exactly does setCopyOutputs do? Where is the data copied to/from? For each memory access type?
It tells WebNN to copy data between the context/device and CPU after dispatch executes, if required. So, it depends on the context device-type used to create the buffer and if it needs to bound to the CPU for I/O.
must always allocate XNN_EXTRA_BYTES... which doesn't seem unreasonable
+1. We have a similar situation on GPU backed MLBuffers
- they also require alignment which always allocates extra bytes anyway.
Admittedly, there are downsides to this approach:
- Can't call
mapAsync
anywhere you want. - Usages + bindings require validation.
But if the goal is have WebNN mapping behave like WebGPU, it's totally doable IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completion purposes, here's how the other API approach could work. Let's refer to the previous idea as "storage bindings". Under this design, mapAsync
"hides" the copy from the web developer. Note: interop does not change so it's left out here intentionally.
ml_context = ML.createContext({"gpu"});
// MLBuffer data is created on the device used by the context.
ml_buffer_input = ml_context.createBuffer({size: 4});
ml_buffer_output = ml_context.createBuffer({size: 4});
// Upon calling Unmap(), after a mapAsync(WRITE):
// * if CPU/iGPU, a copy to GPU IS NOT required.
// * Else dGPU/NPU, a copy to GPU IS required.
await ml_buffer_input.mapAsync(MLMapMode.WRITE);
const write_data = ml_buffer_input.getMappedRange();
// Write something into write_data
ml_buffer_input.unmap()
// Or use context.writeBuffer, regardless of device-type.
// Dispatch executes the copy required by calling Unmap().
ml_context.dispatch(graph, inputs, outputs);
// Upon calling mapAsync(READ):
// * if CPU/iGPU/NPU, copy to GPU IS NOT required.
// * Else dGPU, a copy to GPU IS required.
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();
// Read something into Read_data
ml_buffer_output.unmap();
// So long as you don't use mapAsync, re-using outputs as inputs doesn't copy (ie. chained-inference).
ml_context.dispatch(graph, inputs, outputs);
Note, since this approach hides device data placement/movement, the WebNN developer shouldn't freely use mapAsync(READ)
for output, like they could in WebGPU, as it could (inadvertently) convert into the more expensive readBuffer
operation (internally).
It's also not obvious to me what happens to the (implicit) staging buffer that's created upon mapAsync
. Similarly, deciding use of writeBuffer
vs mapAsync(WRITE)
for input is not obvious since there is no mappable/staging buffer usage.
The approach is simpler (for implementations, namely) but comes at the cost of requiring the WebNN developer to reference the spec to actually understand what's going on and not writing code as they could elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General feedback on this line of thought:
[@huningxin] Should we consider create MLBuffer based on its usage?
Yes, I think it is reasonable to have enums on MLBuffer creation that help the browser optimize memory usage.
[@bbernhar] I was thinking we could have a descriptor/binding set [for dispatch], more like what we have in WebGPU.
I think we should avoid having stateful binding APIs, if possible. They tend to act like global variables and are difficult to integrate with when you have multiple frameworks active in an application. Even in WebGPU, binding objects are created atomically from a dictionary and are immutable, not stateful.
Similarly, introducing a MLDispatchMode
also makes it challenging for web developers when dispatch code is far away from code which calls mapAsync
and writeBuffer
. If you submit work with MLDispatchMode.Default
and subsequently call mapAsync
, the promise will reject because the buffer is in the wrong state. For debugging, it's nice to sprinkle mapAsync
calls in the middle of a bunch of chained dispatches (without changing the dispatches themsleves) to see where things have gone south.
I would prefer that we keep things simple and have the inputs and outputs to dispatch be included with the actual dispatch API like what was originally proposed instead of introducing a binding API, especially a stateful binding API.
[@a-sully]
At a high level, I don't think we can:
- avoid having hidden data-synchronization across devices, and also
- have consistent usage patterns of an MLBuffer regardless of which device it's allocated on
For what it's worth, the WebGPU CG has still not speced UMA buffer transfers. See the outstanding issue Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory, which has been active since late 2021.
In general, I tend to agree with Austin. I think we should gain implementation experience with doing copies on discrete hardware in unmap
(or dispatch) and see where it takes us. Both this and writeBuffer
will require temporary buffer memory that needs to be managed by the browser.
We haven't yet touched on how WebNN interacts with WebGPU for hybrid devices. If the web developer creates a "high performance" WebNN context while the WebGPU device is on the "low power" adapter, the browser will need to copy between the two when transferring buffers between the APIs. Power preference is just an request so the web developer is not guaranteed to get what they request, especially over the entire course of their web session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplicity of overloading mapAsync
introduces inconsistent runtime behavior.
I believe that's the primary reason WebGPU and others don't overload it. For example, the WebGPU developer doesn't need to worry about OOMing when calling mapAsync
or Umap
since mappable buffers don't use a staging buffer and are just mapped. Debugging OOM between WebNN and WebGPU becomes inconsistent, the WebNN developer can't assert when mapAsync
behaves like readBuffer
or if MLBuffer's memory usage doubled. I could see how more established APIs like ONNX runtime preferred storage bindings, they like WebGPU, also want runtime and API consistency.
need to create an intermediate "readback" buffer to facilitate the transfer. | ||
This may require two copies:\ | ||
*  high-GPU-bandwidth buffer → "readback" buffer → | ||
`ArrayBuffer`*\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For UMA GPU, the implementation may eliminate the "readback" buffer and just need one copy.
Taking a step back, there are a few things I'm noticing. On the one hand, this exploration doc has been very helpful in clarifying ambiguities in this proposal, generating new ideas, and distilling the key unresolved questions. On the other hand, I'm worried that we (myself included!) might be losing sight of the forest for the trees :) I'd like to refocus our approach to designing
My tentative suggestions are:
|
@a-sully I don't have a concern with the proposed plan to use |
@a-sully , your proposed plan sounds good to me. Also +1 to
|
Closing in favor of #754 |
The included doc explores the design of #482. That issue is getting quite long and I believe there's a lot more to say, so for the sake of not overwhelming that issue I've created this doc.
Here's how I'd love for this to go:
MLBuffer
proposal to clarify the expected behaviorI'm eager to hear your thoughts. Thanks!