diff --git a/d3d/WorkGraphs.md b/d3d/WorkGraphs.md index ab5194f..43a6147 100644 --- a/d3d/WorkGraphs.md +++ b/d3d/WorkGraphs.md @@ -1,5 +1,5 @@

D3D12 Work Graphs

-v1.005 4/30/2024 +v1.006 5/2/2024 --- @@ -717,7 +717,13 @@ A Node ID is in the form: `{string name, array index}`. The array index part me In terms of HLSL [shader function attributes](#shader-function-attributes), a node ID can be explicitly defined via `[NodeID("name",arrayIndex)]`, or just `[NodeID("name")]`, which implies array index 0. -In the absence of these attributes for explicitly indicating node IDs, node ID defaults to whatever name is in context (with array index 0) for convenience. For instance in a node shader definition, in the absence of a `[NodeID()]` attribute, the node ID defaults to `{shader export name, 0}`, or in the case of a declaration of a node output to another node to be written to by the body of the shader, in the absence of a `[NodeID()]` attribute, the node ID defaults to `{localVariableName, 0}`. +In the absence of these attributes for explicitly indicating node IDs, node ID defaults to whatever name is in context (with array index 0) for convenience. + +For instance if a [node shader definition](#shader-function-attributes) doesn't specify a `[NodeID()]` attribute, the node ID defaults to `{shader name in HLSL, 0}` in the compiled shader. The NodeID and shader name are separate entities in the shader. From the runtime point of view, it doesn't know whether the NodeID was specified explicitly or was a default assignment. The significance is if the shader export is renamed when importing into a state object, the NodeID doesn't also get renamed. To rename NodeIDs at state object creation, use node overrides such as [D3D12_COMMON_COMPUTE_NODE_OVERRIDES](#d3d12_common_compute_node_overrides) at the API. + +If a [node output declaration](#node-output-declaration) doesn't specify a `[NodeID()]` attribute the node ID defaults to `{localVariableName for the output node, 0}`. + + Given a variable declared in a shader that represents an array of output nodes, say `myOutputArray` having a Node ID `{name,arrayIndex}`, the `arrayIndex` part of the ID serves as the *base* for array indexing. So the shader body can index the node array via `myOutputArray[i]`, which resolves to Node ID `{name,arrayIndex+i}`. @@ -1210,7 +1216,7 @@ Of course the app could figure this out on its own by using an atomic on a scrat Unfortunately [FinishedCrossGroupSharing()](#finishedcrossgroupsharing) can't be called in graph entrypoint nodes, reason discussed in the spec for the [FinishedCrossGroupSharing()](#finishedcrossgroupsharing) intrinsic. -If a node might call [FinishedCrossGroupSharing()](#finishedcrossgroupsharing) it must specify the [[NodeTrackRWInputSharing](#nodetrackrwinputsharing)] attribute on the [record struct](#record-struct). Matching this, the upstream node that produces the input record must specify the [[NodeTrackRWInputSharing](#nodetrackrwinputsharing)] on the corresponding [record struct](#record-struct) as well. The easiest way to do this is to use the same structure declaration with the attribute for both the producer and consumer nodes. +If a node might call [FinishedCrossGroupSharing()](#finishedcrossgroupsharing) it must specify the `[`[NodeTrackRWInputSharing](#nodetrackrwinputsharing)`]` attribute on the [record struct](#record-struct). Matching this, the upstream node that produces the input record must specify the `[`[NodeTrackRWInputSharing](#nodetrackrwinputsharing)`]` on the corresponding [record struct](#record-struct) as well. The easiest way to do this is to use the same structure declaration with the attribute for both the producer and consumer nodes. > If it could be interesting to optionally constrain a node such that only one record that has arrived at the node can be launching a shader at a time, scratch memory wouldn't need to be duplicated in every input record and could instead be associated with the node itself. @@ -1474,7 +1480,7 @@ There is a limit on output record storage that is a function of the following va `MaxRecords_WithTrackRWInputSharing` = Maximum output record count, Like `MaxRecords` above, except only counting outputs that specify the - [[NodeTrackRWInputSharing](#nodetrackrwinputsharing)] attribute on the + `[`[NodeTrackRWInputSharing](#nodetrackrwinputsharing)`]` attribute on the [record struct](#record-struct). Requirements: @@ -2890,7 +2896,7 @@ Work graph specific rules: - Adding nodes to an existing work graph in a state object is accomplished by passing a new work graph definition to `AddToStateObject`, whose name matches the existing one, with just the new nodes listed in it. If the program name is unique in the state object, that means an entirely new work graph is being added (perhaps reusing some existing/new shaders), and later additions can be done to that as well. - Ways additions can fit onto to an existing graph - - Filling holes in output node array ranges of the graph, where [[AllowSparseNodes]](#node-output-attributes) or `[UnboundedSparseNodes]` is specified on the output array. + - Filling holes in output node array ranges of the graph, where `[`[AllowSparseNodes](#node-output-attributes)`]` or `[UnboundedSparseNodes]` is specified on the output array. - Adding new entrypoint nodes. The entrypoint index of existing nodes, as reported by [GetEntrypointIndex()](#getentrypointindex), remain unchanged. New entrypoints will have entrypoint index values that continue past existing entrypoints. - Any of the above additions can be a graph of nodes in itself, like adding a subgraph to the overall graph. - New nodes can output to existing nodes in the graph, as long as the graph addition doesn't cause the maximum depth from an entrypoint of of any existing node to increase. @@ -3844,7 +3850,7 @@ At most one call to `FinishedCrossGroupSharing` can be reached during execution Unfortunately `FinishedCrossGroupSharing()` can't be called on nodes that are entrypoints in a graph. The reason is some implementations need to allocate 4 extra bytes in input records and initialize it to the dispatch grid size to be able to implement `FinishedCrossGroupSharing()`. And for graph entrypoints, it is the app that owns the memory for the input records. A workaround to get this functionality at graph entrypoints is for an app to manually put an extra 4 byte field in input records that are fed into the graph, initialized with the dispatch grid size. The launched thread groups can then atomically decrement the number when they are done. It does mean that implementations that could have done better will be a bit less efficient when this workaround is needed at graph entrypoints. -Both producer and consumer nodes must specify the [[NodeTrackRWInputSharing](#nodetrackrwinputsharing)] attribute on the [record struct](#record-struct) if the consumer might call `FinishedCrossGroupSharing()`. This tells the producer's driver shader compilation to allocate 4 byte of extra space per record if needed by the implementation, discussed above. And this space counts against its [node output limits](#node-output-limits). +Both producer and consumer nodes must specify the `[`[NodeTrackRWInputSharing](#nodetrackrwinputsharing)`]` attribute on the [record struct](#record-struct) if the consumer might call `FinishedCrossGroupSharing()`. This tells the producer's driver shader compilation to allocate 4 byte of extra space per record if needed by the implementation, discussed above. And this space counts against its [node output limits](#node-output-limits). DXIL definition [here](#lowering-finishedcrossgroupsharing). @@ -5893,7 +5899,7 @@ Member | Definition `NodeIOKind` | The class of output. See the output enums in [D3D12DDI_NODE_IO_KIND_0108](#d3d12ddi_node_io_kind_0108). And see [Node output declaration](#node-output-declaration). `NodeIOFlags` | See the flags within `D3D12DDI_NODE_IO_FLAGS_FLAG_MASK` in [D3D12DDI_NODE_IO_FLAGS_0108](#d3d12ddi_node_io_flags_0108). And see [Node input declaration](#node-input-declaration). `RecordSizeInBytes` | Size of the output record. Can be 0 if `NodeIOKind` is [D3D12DDI_NODE_IO_KIND_EMPTY_OUTPUT_0108](#d3d12ddi_node_io_kind_0108). -`bAllowSparseNodes` | Whether sparse nodes are allowed. This comes from the [[AllowSparseNodes]](#node-output-attributes) attribute on a node output, or can be overridden at tha API, so the final status is indicated here. +`bAllowSparseNodes` | Whether sparse nodes are allowed. This comes from the `[`[AllowSparseNodes](#node-output-attributes)`]` attribute on a node output, or can be overridden at tha API, so the final status is indicated here. `pRecordDispatchGrid` | If `nullptr`, the output record doesn't contain [SV_DispatchGrid](#sv_dispatchgrid). Else, points to a description of how [SV_DispatchGrid](#sv_dispatchgrid) appears in the output record. See [D3D12DDI_RECORD_DISPATCH_GRID_0108](#d3d12ddi_record_dispatch_grid_0108). `pMaxRecords` | Maximum number of output records that a thread group will output to this output node/array. If the output record budget for this output is shared with another output, `pMaxRecords` is `nullptr` and `pMaxRecordsSharedWithIndex` is specified instead. If the shader declared `[MaxRecordsSharedWith()]`, it is valid to override it with `pMaxRecords`, which makes the output budget no longer shared. `pMaxRecordsSharedWithIndex` | If this output shares its output record budget with another output, this points to the 0 based index of that output based on the order they are declared, and how they appear in the `pOutputs` arrays in [D3D12DDI_BROADCASTING_LAUNCH_NODE_PROPERTIES_0108](#d3d12ddi_broadcasting_launch_node_properties_0108), [D3D12DDI_COALESCIONG_LAUNCH_NODE_PROPERTIES_0108](#d3d12ddi_coalescing_launch_node_properties_0108) and [D3D12DDI_THREAD_LAUNCH_NODE_PROPERTIES_0108](#d3d12ddi_thread_launch_node_properties_0108). The output that is pointed to will have `pMaxRecords` specified. If the current output does not share its output record budget, `pMaxRecordsSharedWithIndex` is `nullptr`. If the shader declared `[MaxRecords()]`, it is valid to override it with `pMaxRecordsSharedWithIndex`, which makes the output budget now shared with another output. @@ -6513,7 +6519,7 @@ v0.35|8/15/2022|
  • Under [Lowering GetNodeOutputRecord](#lowering-get-nodeoutpu v0.36|9/9/2022|
  • Under [Node output limits](#node-output-limits), clarified that even though outputs that are [EmptyNodeOutput](#node-output-declaration) don't count against node output data size limits, they still need to have [MaxOutputRecords or MaxOutputRecordsSharedWith](#node-output-attributes) declared to help scheduler implementations reason about work expansion potential and also avoid overflowing tracking of live work.
  • DXIL: Under [Lowering IncrementOutputCount](#lowering-incrementoutputcount) corrected parameter from %dx.types.NodeRecordHandle to %dx.types.NodeHandle.
  • DXIL: add `indexNodeRecordHandle` (see [Creating handles to node outputs](#creating-handles-to-node-outputs) and [Lowering GetNodeOutputRecord](#lowering-get-nodeoutputrecords)) and remove index from `getNodeRecordPtr` ([Lowering input/output loads and stores](#lowering-inputoutput-loads-and-stores)), and updated create sequences to add annotate and potential indexing operations
  • Stale mentions of MaxDispatchGrid declaration in various places. This was meant to be removed in the v0.34 update.
  • v0.37|9/22/2022|
  • DXIL: In [Annotating node handles](#annotating-node-handles), minor clarification that second NodeRecordInfo field is 0 for other record types.
  • Brought `[MaxDispatchGrid()]` [shader function attribute](#shader-function-attributes) back after previously cutting it because it appeared nobody needed it. Now it has become clear there is some need for this declaration to give implementations an upper bound on the magnitude of `SV_DispatchGrid` values to expect in records arriving at the node. This is an API/DDI and compiler breaking change.
  • v0.38|2/24/2023|
  • Implicit record casting has been removed, replaced with explicit `Get()` methods, with note regarding future potential language support for `operator->` for convenience. See [Record access](#record-access) section.
  • Added `NodeOutputArray` and `EmptyNodeOutputArray` in place of native array declarations; array size moved to `[NodeArraySize(count)]` attribute. See [Node output declaration](#node-output-declaration).
  • Record objects have been broken down differently, renamed, and functions have moved to methods. Coalescing inputs are the only ones that support array type, and max record/empty input count moved to `[MaxRecords(maxCount)]` attribute. See new [Objects](#objects) table for linked objects and methods.
  • Split NodeInputRecord objects between launch types and combine coalescing single and array types into `{RW}GroupNodeInputRecords`; see [Node input declaration](#node-input-declaration)
  • Merge/rename `{GroupShared}NodeOutputRecord` and `{GroupShared}NodeOutputRecordArray` objects into `{Group\|Thread}NodeOutputRecords`; see [GetThreadNodeOutputRecords](#getthreadnodeoutputrecords) and [GetGroupNodeOutputRecords](#getgroupnodeoutputrecords)
  • Split [IncrementOutputCount](#incrementoutputcount) method into Group and Thread variants to match `Get{Group\|Thread}NodeOutputRecords`
  • Move [NodeTrackRWInputSharing](#nodetrackrwinputsharing) to the record struct declaration, removing it from function, input, and output attributes.
  • Update DXIL operations to eliminate separate record indexing and annotation steps and fold record indexing back into `dx.op.getNodeRecordPtr`; see [Creating handles to node input records](#creating-handles-to-node-input-records), [Lowering GetThreadNodeOutputRecords](#lowering-get-nodeoutputrecords) and [Lowering input/output loads and stores](#lowering-inputoutput-loads-and-stores)
  • Clarifications for non-uniform parameters and empty allocation for [GetThreadNodeOutputRecords](#getthreadnodeoutputrecords)
  • Corresponding updates to examples and miscellaneous corrections
  • Updated [`dx.types.Node{Record}Info`](#annotating-node-handles). For `dx.types.NodeInfo`: removed `OutputArraySize`. For `dx.types.NodeRecordInfo`: removed `MaxArraySize`, added `RecordSize`
  • [NodeIOFlags](#nodeioflags-and-nodeiokind-encoding) have been updated with new a `DispatchRecord` value.
  • Updated [NodeIOKind enum](#nodeioflags-and-nodeiokind-encoding) for each HLSL object kind
  • Added [Node Shader Parameters](#node-shader-parameters) section with table for supported compute system values by launch mode.
  • -v0.39|5/5/2023|
  • In [Node output attributes](#node-output-attributes), renamed `[MaxOutputRecords()]` and `[MaxOutputRecordsSharedWith()]` to `[MaxRecords()]` and `[MaxRecordsSharedWith()]`. This allows reusing the same `[MaxRecords()]` [Node input attribute](#node-input-attributes). It is clear from the context whether it applies to input or output. Made the same name changes at the API, to [D3D12_NODE_OUTPUT_OVERRIDES](#d3d12_node_output_overrides) and DDI, to [D3D12DDI_NODE_OUTPUT_0108](#d3d12ddi_node_output_0108).
  • Made similar renames to the DXIL [Node input and output metadata table](#node-input-and-output-metadata-table), where `NodeInputMaxRecordArraySize` and `NodeMaxOutputRecords` are collapsed into `NodeMaxRecords`. Removing an entry in the table also means renumbering the remaining fields to remove the gap. Also renamed `NodeMaxOutputRecordsSharedWith` to `NodeMaxRecordsSharedWith`
  • Also in DXIL [Node input and output metadata table](#node-input-and-output-metadata-table), added `NodeAllowSparseNodes`.
  • For Broadcasting launch nodes, reduced the maximum value of any component of the dispatch grid size from 2^22-1 to 65535 (just like vanilla Compute). It turns out not all hardware can support the raised limit. The larger limit could be reconsidered in some future version of Work Graphs / Compute.
  • In [SV_DispatchGrid](#sv_dispatchgrid), clarified that for a broadcasting launch node that doesn't specify a fixed grid size, if the shader doens't declare any input, it is implied the shader inputs a 12 byte uint3 `SV_DispatchGrid` record.
  • Large batch of wording cleanups in various sections, particularly near the beginning of the spec, intro sections.
  • Added [graphics nodes](#graphics-nodes). This is a future addition to work graphs, not currently supported.
  • Also added proposed, not yet supported, way to grow an already created state object, applying the existing [AddToStateObject()](#addtostateobject) API from DXR to [programs](#program) and work graphs. Subobjects and programs can be added incrementally, and using those (and/or building blocks already in the state object), new leaf nodes can be addded to leaf node arrays of a work graph (where [AllowSparseNodes](#node-output-attributes) has been declared by the writer to the array) by adding new programs to already created state objects.
  • Added restriction on [sharing input records across nodes](#sharing-input-records-across-nodes) such that if N nodes share records produced by a node, the cost of that record is multiplied by N towards [node output limits](#node-output-limits). See [node output limits](#node-output-limits) for more detail.
  • Breaking update to [D3D12DDI_NODE_IO_FLAGS_0108](#d3d12ddi_node_io_flags_0108) and [D3D12DDI_NODE_IO_KIND_0108](#d3d12ddi_node_io_kind_0108) to match the equivalents in DXIL in [NodeIOFlags and NodeIOKind encoding](#nodeioflags-and-nodeiokind-encoding).
  • In [D3D12DDI_NODE_OUTPUT_0108](#d3d12ddi_node_output_0108) added `bool bAllowSparseNodes` to indicate the final status of the [[AllowSparseNodes]](#node-output-attributes) attribute, which may have been overridden at the API. This used to be part of [D3D12DDI_NODE_IO_FLAGS_0108](#d3d12ddi_node_io_flags_0108), but was removed (and from the DXIL equivalent), so the `bool` parameter was needed.
  • In [Lowering IncrementOutputCount](#lowering-incrementoutputcount), merged Thread and Group DXIL intrinsics into one using a PerThread bool, for consistency with [`dx.op.allocateNodeOutputRecords`](#lowering-get-nodeoutputrecords).
  • Added [Joins - synchronizing within the graph](#joins---synchronizing-within-the-graph) to discuss how synchronization within a graph can be done manually now, with the possibility adding new node types in the future that could support more bulk static dependencies between them.
  • +v0.39|5/5/2023|
  • In [Node output attributes](#node-output-attributes), renamed `[MaxOutputRecords()]` and `[MaxOutputRecordsSharedWith()]` to `[MaxRecords()]` and `[MaxRecordsSharedWith()]`. This allows reusing the same `[MaxRecords()]` [Node input attribute](#node-input-attributes). It is clear from the context whether it applies to input or output. Made the same name changes at the API, to [D3D12_NODE_OUTPUT_OVERRIDES](#d3d12_node_output_overrides) and DDI, to [D3D12DDI_NODE_OUTPUT_0108](#d3d12ddi_node_output_0108).
  • Made similar renames to the DXIL [Node input and output metadata table](#node-input-and-output-metadata-table), where `NodeInputMaxRecordArraySize` and `NodeMaxOutputRecords` are collapsed into `NodeMaxRecords`. Removing an entry in the table also means renumbering the remaining fields to remove the gap. Also renamed `NodeMaxOutputRecordsSharedWith` to `NodeMaxRecordsSharedWith`
  • Also in DXIL [Node input and output metadata table](#node-input-and-output-metadata-table), added `NodeAllowSparseNodes`.
  • For Broadcasting launch nodes, reduced the maximum value of any component of the dispatch grid size from 2^22-1 to 65535 (just like vanilla Compute). It turns out not all hardware can support the raised limit. The larger limit could be reconsidered in some future version of Work Graphs / Compute.
  • In [SV_DispatchGrid](#sv_dispatchgrid), clarified that for a broadcasting launch node that doesn't specify a fixed grid size, if the shader doens't declare any input, it is implied the shader inputs a 12 byte uint3 `SV_DispatchGrid` record.
  • Large batch of wording cleanups in various sections, particularly near the beginning of the spec, intro sections.
  • Added [graphics nodes](#graphics-nodes). This is a future addition to work graphs, not currently supported.
  • Also added proposed, not yet supported, way to grow an already created state object, applying the existing [AddToStateObject()](#addtostateobject) API from DXR to [programs](#program) and work graphs. Subobjects and programs can be added incrementally, and using those (and/or building blocks already in the state object), new leaf nodes can be addded to leaf node arrays of a work graph (where [AllowSparseNodes](#node-output-attributes) has been declared by the writer to the array) by adding new programs to already created state objects.
  • Added restriction on [sharing input records across nodes](#sharing-input-records-across-nodes) such that if N nodes share records produced by a node, the cost of that record is multiplied by N towards [node output limits](#node-output-limits). See [node output limits](#node-output-limits) for more detail.
  • Breaking update to [D3D12DDI_NODE_IO_FLAGS_0108](#d3d12ddi_node_io_flags_0108) and [D3D12DDI_NODE_IO_KIND_0108](#d3d12ddi_node_io_kind_0108) to match the equivalents in DXIL in [NodeIOFlags and NodeIOKind encoding](#nodeioflags-and-nodeiokind-encoding).
  • In [D3D12DDI_NODE_OUTPUT_0108](#d3d12ddi_node_output_0108) added `bool bAllowSparseNodes` to indicate the final status of the `[`[AllowSparseNodes](#node-output-attributes)`]` attribute, which may have been overridden at the API. This used to be part of [D3D12DDI_NODE_IO_FLAGS_0108](#d3d12ddi_node_io_flags_0108), but was removed (and from the DXIL equivalent), so the `bool` parameter was needed.
  • In [Lowering IncrementOutputCount](#lowering-incrementoutputcount), merged Thread and Group DXIL intrinsics into one using a PerThread bool, for consistency with [`dx.op.allocateNodeOutputRecords`](#lowering-get-nodeoutputrecords).
  • Added [Joins - synchronizing within the graph](#joins---synchronizing-within-the-graph) to discuss how synchronization within a graph can be done manually now, with the possibility adding new node types in the future that could support more bulk static dependencies between them.
  • v0.40|5/24/2023|
  • Fixed typos in DDI defines: [D3D12DDI_NODE_IO_KIND_0108](#d3d12ddi_node_io_kind_0108) and [D3D12DDI_NODE_OUTPUT_0108](#d3d12ddi_node_output_0108).
  • v0.41|5/26/2023|
  • In DXIL [Node input and output metadata table](#node-input-and-output-metadata-table), `NodeMaxRecords` and `NodeRecordType` entries were reversed from what the compiler implemented, so swapped the entries in the spec to match the code.
  • In [Example of creating node input and output handles](#example-of-creating-node-input-and-output-handles) and [Example DXIL metadata diagram](#example-dxil-metadata-diagram) fixed metadata encoding values, several of which were stale.
  • In [Node input](#node-input) fixed examples that used old GetInputRecordCount() to be .count().
  • In DXIL [NodeIOFlags and NodeIOKind encoding](#nodeioflags-and-nodeiokind-encoding), there was a typo `GroupNodeOutputRecord` fixed to `GroupNodeOutputRecords` and `ThreadNodeOutputRecord` fixed to `ThreadNodeOutputRecords`. Similar typo in the DDI header under [D3D12DDI_NODE_IO_KIND_0108](#d3d12ddi_node_io_kind_0108). This will require new headers and driver recompile but fortunately not a binary breaking change.
  • v0.42|6/12/2023|
  • In [Shader function attributes](#shader-function-attributes) section removed stale text under `[NodeShareInputOf()]` that said that nodes sharing input need to have the same node type and dispatch grid size - these constraints were never actually needed.
  • @@ -6536,3 +6542,4 @@ v1.002|3/21/2024|
  • In [D3D12_STATE_OBJECT_TYPE](#d3d12_state_object_type) clar v1.003|3/25/2024|
  • In [shader function attributes](#shader-function-attributes), for `[NodeShareInputOf(nodeID)]`, corrected the semantics around node renaming to match shipped behavior. The text used to say that the specified nodeID is before any node renames, so if the target gets renamed, the sharing connection based on old name stays intact. This was incorrect. Instead, if the node to share from gets renamed, the `[NodeShareInputOf(nodeID)]` attribute can be overridden at the API to point to the new nodeID.
  • v1.004|4/22/2024|
  • In [Barrier](#barrier) and [Lowering Barrier](#lowering-barrier), clarify that barrier translation between old/new DXIL ops depending on shader model is not implemented.
  • v1.005|4/30/2024|
  • Fixed minor typos in various mesh node DDIs names where version was 0108 but meant to be 0110.
  • +v1.006|5/02/2024|
  • In [NodeID](#node-id) section, clarified some semantics around default NodeID naming. Specifically, if a [node shader definition](#shader-function-attributes) doesn't specify a `[NodeID()]` attribute, the node ID defaults to `{shader name in HLSL, 0}` in the compiled shader. The NodeID and shader name are separate entities in the shader. From the runtime point of view, it doesn't know whether the NodeID was specified explicitly or was a default assignment. The significance is if the shader export is renamed when importing into a state object, the NodeID doesn't also get renamed. To rename NodeIDs at state object creation, use node overrides such as [D3D12_COMMON_COMPUTE_NODE_OVERRIDES](#d3d12_common_compute_node_overrides) at the API.