Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (

#17127) ### Description - Enables int32 support for com.microsoft.DequantizeLinear (contrib op) - Makes the `zero_point` input optional for Quantize/Dequantize contrib ops - Enables QDQ transformations with the Quantize/Dequantize contrib ops - Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests, TransposeOptimizerTests ### Testing List of tested graph transformations: - [x] QDQSelectorActionTransformer - qdq_transformer_test.cc - [x] QDQS8ToU8Transformer - qdq_transformer_test.cc - [x] DoubleQDQPairsRemover - qdq_transformer_test.cc - [x] IdenticalChildrenConsolidation - qdq_transformer_test.cc - [x] QDQPropagation - qdq_transformer_test.cc - [x] QDQFinalCleanup - qdq_transformer_test.cc - [x] CliQuantFusion - qdq_transformer_test.cc - [x] ReluQuantFusion - qdq_transformer_test.cc - [x] EnsureUniqueDQForNodeUnit - ensure_unique_dq_for_node_unit_test.cc - [x] TransposeOptimizer - transpose_optimizer_test.cc - [x] CommonSubexpressionElimination - graph_transform_test.cc - [x] ConstantFolding - graph_transform_test.cc ### Motivation and Context We need to [support mixed 16-bit/8-bit precision QDQ models](#17015). This PR is the first step in achieving this goal: we need to make QDQ contrib ops work with our optimizations/transformations. --------- Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Scott McKay <[email protected]>
microsoft · Aug 25, 2023 · 5a83a67 · 5a83a67
1 parent 79c4ed9
commit 5a83a67
Show file tree

Hide file tree

Showing 32 changed files with 1,656 additions and 917 deletions.
diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md
@@ -1330,15 +1330,15 @@ This version of the operator has been available since version 1 of the 'com.micr
 <dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
 </dl>
 
-#### Inputs
+#### Inputs (2 - 3)
 
 <dl>
 <dt><tt>x</tt> : T1</dt>
 <dd>N-D quantized Input tensor to be de-quantized.</dd>
 <dt><tt>x_scale</tt> : T2</dt>
-<dd>Scale for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
-<dt><tt>x_zero_point</tt> : T1</dt>
-<dd>Zero point for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
+<dd>Scale for input 'x'. It can be a scalar, which means a per-tensor/layer dequantization, or a 1-D tensor for per-axis dequantization.</dd>
+<dt><tt>x_zero_point</tt> (optional) : T1</dt>
+<dd>Zero point for input 'x'. Shape must match x_scale. It's optional. Zero point is 0 when it's not specified.</dd>
 </dl>
 
 #### Outputs
@@ -1351,8 +1351,8 @@ This version of the operator has been available since version 1 of the 'com.micr
 #### Type Constraints
 
 <dl>
-<dt><tt>T1</tt> : tensor(int8), tensor(uint8)</dt>
-<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors.</dd>
+<dt><tt>T1</tt> : tensor(int8), tensor(uint8), tensor(int32)</dt>
+<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit signed integer tensors.</dd>
 <dt><tt>T2</tt> : tensor(float16), tensor(float)</dt>
 <dd>Constrain 'y', 'x_scale' to float tensors.</dd>
 </dl>
@@ -4209,15 +4209,15 @@ This version of the operator has been available since version 1 of the 'com.micr
 <dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
 </dl>
 
-#### Inputs
+#### Inputs (2 - 3)
 
 <dl>
 <dt><tt>x</tt> : T1</dt>
 <dd>N-D full precision Input tensor to be quantized.</dd>
 <dt><tt>y_scale</tt> : T1</dt>
-<dd>Scale for doing quantization to get 'y'. It could be a scalar or a 1-D tensor,which means a per-tensor or per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
-<dt><tt>y_zero_point</tt> : T2</dt>
-<dd>Zero point for doing quantization to get 'y'. It could be a scalar or a 1-D tensor, which means a per-tensoror per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
+<dd>Scale for doing quantization to get 'y'. It can be a scalar, which means per-tensor/layer quantization, or a 1-D tensor for per-axis quantization.</dd>
+<dt><tt>y_zero_point</tt> (optional) : T2</dt>
+<dd>Zero point for doing quantization to get 'y'. Shape must match y_scale. Default is uint8 with zero point of 0 if it's not specified.</dd>
 </dl>
 
 #### Outputs

diff --git a/docs/OperatorKernels.md b/docs/OperatorKernels.md
@@ -439,7 +439,7 @@ Do not modify directly.*
 |CDist|*in* A:**T**<br> *in* B:**T**<br> *out* C:**T**|1+|**T** = tensor(double), tensor(float)|
 |ConvTransposeWithDynamicPads|*in* X:**T**<br> *in* W:**T**<br> *in* Pads:**tensor(int64)**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
 |CropAndResize|*in* X:**T1**<br> *in* rois:**T1**<br> *in* batch_indices:**T2**<br> *in* crop_size:**T2**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int32)|
-|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int8), tensor(uint8)<br/> **T2** = tensor(float)|
+|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int32), tensor(int8), tensor(uint8)<br/> **T2** = tensor(float)|
 |DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
 |DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
 |EmbedLayerNormalization|*in* input_ids:**T1**<br> *in* segment_ids:**T1**<br> *in* word_embedding:**T**<br> *in* position_embedding:**T**<br> *in* segment_embedding:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* mask:**T1**<br> *in* position_ids:**T1**<br> *out* output:**T**<br> *out* mask_index:**T1**<br> *out* embedding_sum:**T**|1+|**T** = tensor(float)|

diff --git a/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc b/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
@@ -56,6 +56,7 @@ class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLine
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu);
@@ -190,6 +191,7 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu)>,

diff --git a/onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc b/onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc
@@ -25,6 +25,15 @@ ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
         .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
     DequantizeLinear<int8_t>);
 
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    int32_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int32_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<int32_t>);
+
 ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
     QuantizeLinear,
     1,

diff --git a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
@@ -152,23 +152,24 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
               AttributeProto::INT, false)
         .Input(0, "x", "N-D full precision Input tensor to be quantized.", "T1")
         .Input(1, "y_scale",
-               "Scale for doing quantization to get 'y'. It could be a scalar or a 1-D tensor,"
-               "which means a per-tensor or per-axis quantization. If it's a 1-D tensor, "
-               "its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.",
+               "Scale for doing quantization to get 'y'. It can be a scalar, which means per-tensor/layer "
+               "quantization, or a 1-D tensor for per-axis quantization.",
                "T1")
         .Input(2, "y_zero_point",
-               "Zero point for doing quantization to get 'y'. It could be a scalar or a 1-D tensor, which means a "
-               "per-tensor"
-               "or per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the "
-               "dimension value of 'axis' dimension of input 'x'.",
-               "T2")
+               "Zero point for doing quantization to get 'y'. Shape must match y_scale. Default is "
+               "uint8 with zero point of 0 if it's not specified.",
+               "T2", OpSchema::Optional)
         .Output(0, "y", "N-D quantized output tensor. It has same shape as input 'x'.", "T2")
         .TypeConstraint("T1", {"tensor(float16)", "tensor(float)"}, "Constrain 'x', 'y_scale' to float tensors.")
         .TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)"},
                         "Constrain 'y_zero_point' and 'y' to 8-bit integer tensors.")
         .SetDoc(QuantizeLinear_ver1_doc)
         .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
-          propagateElemTypeFromInputToOutput(ctx, 2, 0);
+          if (ctx.getNumInputs() == 3 && ctx.getInputType(2) != nullptr) {
+            propagateElemTypeFromInputToOutput(ctx, 2, 0);
+          } else {
+            updateOutputElemType(ctx, 0, ONNX_NAMESPACE::TensorProto::UINT8);
+          }
 
           if (!hasInputShape(ctx, 0)) return;
 
@@ -192,21 +193,18 @@ ONNX_MS_OPERATOR_SET_SCHEMA(DequantizeLinear, 1,
                                       AttributeProto::INT, false)
                                 .Input(0, "x", "N-D quantized Input tensor to be de-quantized.", "T1")
                                 .Input(1, "x_scale",
-                                       "Scale for input 'x'. It could be a scalar or a 1-D tensor, which means a "
-                                       "per-tensor or per-axis quantization."
-                                       "If it's a 1-D tensor, its number of elements should be equal to the dimension "
-                                       "value of 'axis' dimension of input 'x'.",
+                                       "Scale for input 'x'. It can be a scalar, which means a per-tensor/layer "
+                                       "dequantization, or a 1-D tensor for per-axis dequantization.",
                                        "T2")
                                 .Input(2, "x_zero_point",
-                                       "Zero point for input 'x'. It could be a scalar or a 1-D tensor, which means a "
-                                       "per-tensor or per-axis quantization."
-                                       "If it's a 1-D tensor, its number of elements should be equal to the dimension "
-                                       "value of 'axis' dimension of input 'x'.",
-                                       "T1")
+                                       "Zero point for input 'x'. Shape must match x_scale. It's optional. "
+                                       "Zero point is 0 when it's not specified.",
+                                       "T1", OpSchema::Optional)
                                 .Output(0, "y", "N-D full precision output tensor. It has same shape as input 'x'.",
                                         "T2")
-                                .TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)"},
-                                                "Constrain 'x' and 'x_zero_point' to 8-bit integer tensors.")
+                                .TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)", "tensor(int32)"},
+                                                "Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit "
+                                                "signed integer tensors.")
                                 .TypeConstraint("T2", {"tensor(float16)", "tensor(float)"},
                                                 "Constrain 'y', 'x_scale' to float tensors.")
                                 .SetDoc(DequantizeLinear_ver1_doc)

diff --git a/onnxruntime/core/optimizer/common_subexpression_elimination.cc b/onnxruntime/core/optimizer/common_subexpression_elimination.cc
@@ -324,7 +324,8 @@ bool IsNodeSupported(const Node& node) {
   // would result in it having multiple consumers for its output, and it being used in multiple QDQ node groups.
   return !node.ContainsSubgraph() &&
          optimizer_utils::IsOperationDeterministic(node.Domain(), node.OpType()) &&
-         !(node.Domain() == kOnnxDomain && node.OpType() == "DequantizeLinear");
+         !(node.Domain() == kOnnxDomain && node.OpType() == "DequantizeLinear") &&
+         !(node.Domain() == kMSDomain && node.OpType() == "DequantizeLinear");
 }
 }  // namespace
 

diff --git a/onnxruntime/core/optimizer/constant_folding.cc b/onnxruntime/core/optimizer/constant_folding.cc
@@ -123,17 +123,6 @@ Status ConstantFolding::ApplyImpl(Graph& graph, bool& modified, int graph_level,
     } else {
       InitializedTensorSet constant_inputs;
 
-      // we currently constant fold using the CPU EP only.
-      // if the node is assigned to a different EP we can run it if it's an ONNX op as we have CPU based
-      // implementations for all ONNX ops. If the node/op is from a different op domain or if the CPU implementation
-      // does not support the specific input type(s) required by the node (currently we only support a subset of
-      // types in some CPU kernels) then we can't proceed with constant folding for the node.
-      auto ep_type = node->GetExecutionProviderType();
-      bool cpu_ep = ep_type == kCpuExecutionProvider;
-      if (!cpu_ep && node->Domain() != kOnnxDomain) {
-        continue;
-      }
-
       // Check if constant folding can be applied on this node.
       const auto can_constant_fold_node = [&](const Node& n, bool skip_inputs_constant_check = false) {
         return graph_utils::IsSupportedProvider(n, GetCompatibleExecutionProviders()) &&
@@ -196,18 +185,26 @@ Status ConstantFolding::ApplyImpl(Graph& graph, bool& modified, int graph_level,
         fetch_mlvalue_idxs.push_back(info.GetMLValueIndex(node_out->Name()));
       }
 
+      auto& ep_type = node->GetExecutionProviderType();
+      const bool node_on_cpu_ep = ep_type == kCpuExecutionProvider;
+
       // override the EP assigned to the node so that it will use the CPU kernel for Compute.
-      if (!cpu_ep) {
+      if (!node_on_cpu_ep) {
         node->SetExecutionProviderType(kCpuExecutionProvider);
       }
 
       auto kernel = info.CreateKernel(node);
 
       // undo the EP change to the value that was assigned at graph partitioning time
-      if (!cpu_ep) {
+      if (!node_on_cpu_ep) {
         node->SetExecutionProviderType(ep_type);
       }
 
+      // We currently constant fold using the CPU EP only.
+      // If we can't find a CPU kernel for this node, then we can't proceed with constant folding.
+      //
+      // TODO(adrianlizarraga): Support constant folding with other execution providers. For example, we may be able
+      // to use a CUDA kernel to constant fold operators with data types not supported by the CPU EP kernel.
       if (kernel == nullptr) {
         LOGS(logger, WARNING) << "Could not find a CPU kernel and hence "
                               << "can't constant fold " << node->OpType() << " node '" << node->Name() << "'";

diff --git a/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc b/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc
@@ -51,7 +51,7 @@ bool DoubleQDQPairsRemover::IsNodeRemovable(
   }
 
   // Type is either "tensor(uint8)" or  "tensor(int8)"
-  const auto self_zp_type = *self->InputDefs()[InputIndex::ZERO_POINT_ID]->Type();
+  const auto& self_zp_type = *self->InputDefs()[InputIndex::ZERO_POINT_ID]->Type();
   // child should be a Q, and have only one child, have the same type as self, and cannot be a graph output
   child_index = self->OutputEdgesBegin()->GetNode().Index();
   const Node* child = graph.GetNode(child_index);

diff --git a/onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc b/onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc
@@ -150,7 +150,7 @@ Status TransformLayoutForEP(Graph& graph, bool& modified, const IExecutionProvid
 
   const auto max_node_idx = graph.MaxNodeIndex();
   OptimizeResult result = onnx_transpose_optimization::Optimize(*api_graph, execution_provider.Type(),
-                                                                PostLayoutTransformCostCheck);
+                                                                PostLayoutTransformCostCheck, OrtExtendedHandlers());
 
   if (result.error_msg) {
     return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Layout/Transpose optimization for ", execution_provider.Type(),

diff --git a/onnxruntime/core/optimizer/qdq_transformer/clip_quantizelinear.cc b/onnxruntime/core/optimizer/qdq_transformer/clip_quantizelinear.cc
@@ -3,6 +3,7 @@
 
 #include "core/optimizer/initializer.h"
 #include "core/optimizer/qdq_transformer/clip_quantizelinear.h"
+#include "core/optimizer/qdq_transformer/qdq_util.h"
 #include "core/optimizer/utils.h"
 #include "core/graph/graph_utils.h"
 
@@ -73,7 +74,7 @@ bool ClipQuantFusion::SatisfyCondition(const Graph& graph, const Node& node, con
 
   // if Clip is followed by QuantizeLinear, it can be fused into QuantizeLinear potentially
   const auto& next_node = *node.OutputNodesBegin();
-  if (!graph_utils::IsSupportedOptypeVersionAndDomain(next_node, "QuantizeLinear", {10, 13, 19})) {
+  if (!QDQ::MatchQNode(next_node)) {
     return false;
   }
 

diff --git a/onnxruntime/core/optimizer/qdq_transformer/ensure_unique_dq_for_node_unit.cc b/onnxruntime/core/optimizer/qdq_transformer/ensure_unique_dq_for_node_unit.cc
@@ -52,7 +52,9 @@ Status DuplicateDQForOutputEdge(const graph_utils::GraphEdge& original_dq_output
                                     QDQ::DQOpName,
                                     MakeString("Added by ", kTransformerName),
                                     dq_inputs,
-                                    {&new_dq_output_nodearg});
+                                    {&new_dq_output_nodearg},
+                                    nullptr,  // attributes
+                                    original_dq_node.Domain());
 
   // set up edges
   // remove DQ -> Y