You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cuDNN graph API allows fusion of convolution or matmul operations with generalized prologue and/or epilogue DAGs consisting of pointwise (and one reduction) operations. Currently XLA fuser does not have the capability to perform such fusions and integration of cuDNN graph API support makes it a great opportunity to optimize the graph further and enhance performance.
Scope
The scope of this document is limited to convolution/matmul with pointwise (and reduction) fusions only. Note that these graph APIs can be used to fuse other operation modules such as multi-headed attention in transformers. Moreover, cuDNN graph APIs support pre-compiled single operation engines, runtime fusion engines and pre-compiled specialized engines. This document discusses the high level design of integration of runtime fusion engines in XLA.
Background
Shown below are a couple of fusion patterns supported by cuDNN where g1 and g2 are DAGs.
g1 is an epilogue DAG that can consist of zero or any number of the following operation:
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
g2 is a prologue DAG that can consist of zero or any number of the following operations:
In a nutshell, the idea is to lower the conv/matmul+pw fused operation through a custom-call with variadic inputs based on the number of inputs to the DAGs g1 and g2 and invoke the corresponding custom-call thunk at runtime.
The design can be broadly divided into three phases:
HLO Fusion: This phase identifies the prologue and epilogue DAG along with the conv/matmul operation and creates the custom-call HLO for the fused op graph. There are two options to identify the prologue and epilogue DAGs. Discussed below are the options in the increasing order of their complexity:
Leverage the existing XLA fuser as is: This option would entail running the XLA fusion as is and add a new fused cuDNN graph rewriter pass after FusionMerger or GpuMultiOutputFusion. This new rewriter pass just looks at the generated fused computations and conv/matmul calls to find the following pattern
and fuses them to create a new custom-call. The autotuner must run after this pass.
New fusion pass: Create a new cuDNNGraphFusion pass to identify and fuse the prologue and epilogue DAG fusion patterns based on the constraints imposed by cuDNN graph API. The fused computations generated can be registered as “cuDNN_fused_computation” and hence the rewriter needs to identify the pattern
and fuse it to a custom call. There are two ways to phase order this new fusion pass:
A. cuDNNGraphFusion pass to generate g1 and g2 -> rewriter to identify the above fusion pattern and create a new custom call -> autotuner -> existing fuser passes run as is. Note that this would require two fusion phases, the first phase only contains cuDNNGraphFusion pass that will generate cuDNN prologue and epilogue fusions.
B. Add cuDNNGraphFusion pass in the middle of the existing fusion pipeline. In this case, the rewriter and the autotuner should run at a later stage after fusion.
A self-contained cuDNN fused graph emitter phase: A separate pass, independent of the XLA fuser, can be implemented to identify the complete pattern. This pass (similar to a rewriter) will have the burden of identifying g1 and g2 and then fusing them with conv/matmul. This will be a complex rewriter with shared concepts from the XLA fuser but the upside of this approach is that the XLA fuser remains completely independent and doesn’t need to have any additional complexities.
Options ( b ) and ( c ) share the same design philosophy - create new fusion patterns and then match these fused DAGs with a conv/matmul custom call to create the fused cuDNN custom call. Option (a) simply tries to identify fusions generated by the existing XLA fuser (which are prologues and epilogues of a conv/matmul op) and create fused cuDNN custom calls. Option (a), although has a lower probability of triggering cuDNN fusions, has a significantly low probability of performance regressions. Options ( b ) or ( c ) would almost guarantee more instances of cuDNN fusions, and may also increase the chances of unforeseen regressions, apart from having greater implementation complexity. Hence, Option ( b ) or ( c ) can be seen as an enhancement to Option (a). Note that it is also possible for both options to co-exist. A debug flag can be used to switch between the two to evaluate performance.
The custom-call needs to somehow serialize the DAG patterns in the backend config or in 2 new attributes, prolog_function and epilog_function. They would point to the g1 and g2 graph. The mechanism for this is TBD. Also, note that the cuDNN Matmul fusions need to be registered separately from the cublas_gemm custom-calls.
Autotuner: At the autotuner stage,
The DAG patterns embedded in the custom-call (or serialized) need to be parsed and translated into a cuDNN op graph.
Autotune for the op graph and store the engine and knob configuration in the custom-call. Note that it is now possible to serialize the plan itself and not rebuild it at runtime. However this capability does not currently exist in XLA. This is another opportunity for enhancement but is outside the scope of this document.
The autotuner calls into the XLA runtime through the gpu conv/gemm runner which is described in the upcoming section.
Runtime: XLA runtime already supports cuDNN frontend APIs for conv+bias+act i.e, uses cuDNN op graphs. However, that support is limited to fixed patterns (signatures) only. There are a couple of design options:
Extend the existing fused convolution runner and thunk to support fused Conv thunks with variable number of inputs (i.e, generalized signatures) and create a new gpu_matmul_runner and matmul_thunks corresponding to custom-calls registered against cuDNN Matmul fusion instructions.
Create a generalized thunk for cuDNN op graphs that would handle Conv/Matmul + pw fusions as well as any other generalized fusion patterns lowered through a custom call HLO. However, this option needs further analysis to determine its feasibility.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Objective & Motivation
cuDNN graph API allows fusion of convolution or matmul operations with generalized prologue and/or epilogue DAGs consisting of pointwise (and one reduction) operations. Currently XLA fuser does not have the capability to perform such fusions and integration of cuDNN graph API support makes it a great opportunity to optimize the graph further and enhance performance.
Scope
The scope of this document is limited to convolution/matmul with pointwise (and reduction) fusions only. Note that these graph APIs can be used to fuse other operation modules such as multi-headed attention in transformers. Moreover, cuDNN graph APIs support pre-compiled single operation engines, runtime fusion engines and pre-compiled specialized engines. This document discusses the high level design of integration of runtime fusion engines in XLA.
Background
Shown below are a couple of fusion patterns supported by cuDNN where g1 and g2 are DAGs.
g1 is an epilogue DAG that can consist of zero or any number of the following operation:
g2 is a prologue DAG that can consist of zero or any number of the following operations:
For details on additional fusion patterns and constraints on g1 and g2, refer to cuDNN graph API documentation.
High-level Design
In a nutshell, the idea is to lower the conv/matmul+pw fused operation through a custom-call with variadic inputs based on the number of inputs to the DAGs g1 and g2 and invoke the corresponding custom-call thunk at runtime.
The design can be broadly divided into three phases:
Leverage the existing XLA fuser as is: This option would entail running the XLA fusion as is and add a new fused cuDNN graph rewriter pass after FusionMerger or GpuMultiOutputFusion. This new rewriter pass just looks at the generated fused computations and conv/matmul calls to find the following pattern
and fuses them to create a new custom-call. The autotuner must run after this pass.
New fusion pass: Create a new cuDNNGraphFusion pass to identify and fuse the prologue and epilogue DAG fusion patterns based on the constraints imposed by cuDNN graph API. The fused computations generated can be registered as “cuDNN_fused_computation” and hence the rewriter needs to identify the pattern
and fuse it to a custom call. There are two ways to phase order this new fusion pass:
A. cuDNNGraphFusion pass to generate g1 and g2 -> rewriter to identify the above fusion pattern and create a new custom call -> autotuner -> existing fuser passes run as is. Note that this would require two fusion phases, the first phase only contains cuDNNGraphFusion pass that will generate cuDNN prologue and epilogue fusions.
B. Add cuDNNGraphFusion pass in the middle of the existing fusion pipeline. In this case, the rewriter and the autotuner should run at a later stage after fusion.
A self-contained cuDNN fused graph emitter phase: A separate pass, independent of the XLA fuser, can be implemented to identify the complete pattern. This pass (similar to a rewriter) will have the burden of identifying g1 and g2 and then fusing them with conv/matmul. This will be a complex rewriter with shared concepts from the XLA fuser but the upside of this approach is that the XLA fuser remains completely independent and doesn’t need to have any additional complexities.
Options ( b ) and ( c ) share the same design philosophy - create new fusion patterns and then match these fused DAGs with a conv/matmul custom call to create the fused cuDNN custom call. Option (a) simply tries to identify fusions generated by the existing XLA fuser (which are prologues and epilogues of a conv/matmul op) and create fused cuDNN custom calls. Option (a), although has a lower probability of triggering cuDNN fusions, has a significantly low probability of performance regressions. Options ( b ) or ( c ) would almost guarantee more instances of cuDNN fusions, and may also increase the chances of unforeseen regressions, apart from having greater implementation complexity. Hence, Option ( b ) or ( c ) can be seen as an enhancement to Option (a). Note that it is also possible for both options to co-exist. A debug flag can be used to switch between the two to evaluate performance.
The custom-call needs to somehow serialize the DAG patterns in the backend config or in 2 new attributes, prolog_function and epilog_function. They would point to the g1 and g2 graph. The mechanism for this is TBD. Also, note that the cuDNN Matmul fusions need to be registered separately from the cublas_gemm custom-calls.
The autotuner calls into the XLA runtime through the gpu conv/gemm runner which is described in the upcoming section.
Beta Was this translation helpful? Give feedback.
All reactions