Skip to content

Commit

Permalink
Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (
Browse files Browse the repository at this point in the history
#17127)

### Description
- Enables int32 support for com.microsoft.DequantizeLinear (contrib op)
- Makes the `zero_point` input optional for Quantize/Dequantize contrib
ops
- Enables QDQ transformations with the Quantize/Dequantize contrib ops
- Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests,
TransposeOptimizerTests

### Testing
List of tested graph transformations:
- [x] QDQSelectorActionTransformer
  - qdq_transformer_test.cc
- [x] QDQS8ToU8Transformer
  - qdq_transformer_test.cc
- [x] DoubleQDQPairsRemover
  - qdq_transformer_test.cc
- [x] IdenticalChildrenConsolidation
  - qdq_transformer_test.cc
- [x] QDQPropagation
  - qdq_transformer_test.cc
- [x] QDQFinalCleanup
  - qdq_transformer_test.cc
- [x] CliQuantFusion
  - qdq_transformer_test.cc
- [x] ReluQuantFusion
  - qdq_transformer_test.cc
- [x] EnsureUniqueDQForNodeUnit 
  - ensure_unique_dq_for_node_unit_test.cc
- [x] TransposeOptimizer 
  - transpose_optimizer_test.cc
- [x] CommonSubexpressionElimination
  - graph_transform_test.cc
- [x] ConstantFolding
  - graph_transform_test.cc

### Motivation and Context
We need to [support mixed 16-bit/8-bit precision QDQ
models](#17015). This PR is
the first step in achieving this goal: we need to make QDQ contrib ops
work with our optimizations/transformations.

---------

Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
  • Loading branch information
3 people authored Aug 25, 2023
1 parent 79c4ed9 commit 5a83a67
Show file tree
Hide file tree
Showing 32 changed files with 1,656 additions and 917 deletions.
20 changes: 10 additions & 10 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -1330,15 +1330,15 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
</dl>

#### Inputs
#### Inputs (2 - 3)

<dl>
<dt><tt>x</tt> : T1</dt>
<dd>N-D quantized Input tensor to be de-quantized.</dd>
<dt><tt>x_scale</tt> : T2</dt>
<dd>Scale for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dt><tt>x_zero_point</tt> : T1</dt>
<dd>Zero point for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dd>Scale for input 'x'. It can be a scalar, which means a per-tensor/layer dequantization, or a 1-D tensor for per-axis dequantization.</dd>
<dt><tt>x_zero_point</tt> (optional) : T1</dt>
<dd>Zero point for input 'x'. Shape must match x_scale. It's optional. Zero point is 0 when it's not specified.</dd>
</dl>

#### Outputs
Expand All @@ -1351,8 +1351,8 @@ This version of the operator has been available since version 1 of the 'com.micr
#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(int8), tensor(uint8)</dt>
<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors.</dd>
<dt><tt>T1</tt> : tensor(int8), tensor(uint8), tensor(int32)</dt>
<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit signed integer tensors.</dd>
<dt><tt>T2</tt> : tensor(float16), tensor(float)</dt>
<dd>Constrain 'y', 'x_scale' to float tensors.</dd>
</dl>
Expand Down Expand Up @@ -4209,15 +4209,15 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
</dl>

#### Inputs
#### Inputs (2 - 3)

<dl>
<dt><tt>x</tt> : T1</dt>
<dd>N-D full precision Input tensor to be quantized.</dd>
<dt><tt>y_scale</tt> : T1</dt>
<dd>Scale for doing quantization to get 'y'. It could be a scalar or a 1-D tensor,which means a per-tensor or per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dt><tt>y_zero_point</tt> : T2</dt>
<dd>Zero point for doing quantization to get 'y'. It could be a scalar or a 1-D tensor, which means a per-tensoror per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dd>Scale for doing quantization to get 'y'. It can be a scalar, which means per-tensor/layer quantization, or a 1-D tensor for per-axis quantization.</dd>
<dt><tt>y_zero_point</tt> (optional) : T2</dt>
<dd>Zero point for doing quantization to get 'y'. Shape must match y_scale. Default is uint8 with zero point of 0 if it's not specified.</dd>
</dl>

#### Outputs
Expand Down
2 changes: 1 addition & 1 deletion docs/OperatorKernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ Do not modify directly.*
|CDist|*in* A:**T**<br> *in* B:**T**<br> *out* C:**T**|1+|**T** = tensor(double), tensor(float)|
|ConvTransposeWithDynamicPads|*in* X:**T**<br> *in* W:**T**<br> *in* Pads:**tensor(int64)**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
|CropAndResize|*in* X:**T1**<br> *in* rois:**T1**<br> *in* batch_indices:**T2**<br> *in* crop_size:**T2**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int32)|
|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int8), tensor(uint8)<br/> **T2** = tensor(float)|
|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int32), tensor(int8), tensor(uint8)<br/> **T2** = tensor(float)|
|DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|EmbedLayerNormalization|*in* input_ids:**T1**<br> *in* segment_ids:**T1**<br> *in* word_embedding:**T**<br> *in* position_embedding:**T**<br> *in* segment_embedding:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* mask:**T1**<br> *in* position_ids:**T1**<br> *out* output:**T**<br> *out* mask_index:**T1**<br> *out* embedding_sum:**T**|1+|**T** = tensor(float)|
Expand Down
2 changes: 2 additions & 0 deletions onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLine
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu);
Expand Down Expand Up @@ -190,6 +191,7 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu)>,
Expand Down
9 changes: 9 additions & 0 deletions onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,15 @@ ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
.TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
DequantizeLinear<int8_t>);

ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
DequantizeLinear,
1,
int32_t,
KernelDefBuilder()
.TypeConstraint("T1", DataTypeImpl::GetTensorType<int32_t>())
.TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
DequantizeLinear<int32_t>);

ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
QuantizeLinear,
1,
Expand Down
38 changes: 18 additions & 20 deletions onnxruntime/core/graph/contrib_ops/quantization_defs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -152,23 +152,24 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
AttributeProto::INT, false)
.Input(0, "x", "N-D full precision Input tensor to be quantized.", "T1")
.Input(1, "y_scale",
"Scale for doing quantization to get 'y'. It could be a scalar or a 1-D tensor,"
"which means a per-tensor or per-axis quantization. If it's a 1-D tensor, "
"its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.",
"Scale for doing quantization to get 'y'. It can be a scalar, which means per-tensor/layer "
"quantization, or a 1-D tensor for per-axis quantization.",
"T1")
.Input(2, "y_zero_point",
"Zero point for doing quantization to get 'y'. It could be a scalar or a 1-D tensor, which means a "
"per-tensor"
"or per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the "
"dimension value of 'axis' dimension of input 'x'.",
"T2")
"Zero point for doing quantization to get 'y'. Shape must match y_scale. Default is "
"uint8 with zero point of 0 if it's not specified.",
"T2", OpSchema::Optional)
.Output(0, "y", "N-D quantized output tensor. It has same shape as input 'x'.", "T2")
.TypeConstraint("T1", {"tensor(float16)", "tensor(float)"}, "Constrain 'x', 'y_scale' to float tensors.")
.TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)"},
"Constrain 'y_zero_point' and 'y' to 8-bit integer tensors.")
.SetDoc(QuantizeLinear_ver1_doc)
.TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
propagateElemTypeFromInputToOutput(ctx, 2, 0);
if (ctx.getNumInputs() == 3 && ctx.getInputType(2) != nullptr) {
propagateElemTypeFromInputToOutput(ctx, 2, 0);
} else {
updateOutputElemType(ctx, 0, ONNX_NAMESPACE::TensorProto::UINT8);
}

if (!hasInputShape(ctx, 0)) return;

Expand All @@ -192,21 +193,18 @@ ONNX_MS_OPERATOR_SET_SCHEMA(DequantizeLinear, 1,
AttributeProto::INT, false)
.Input(0, "x", "N-D quantized Input tensor to be de-quantized.", "T1")
.Input(1, "x_scale",
"Scale for input 'x'. It could be a scalar or a 1-D tensor, which means a "
"per-tensor or per-axis quantization."
"If it's a 1-D tensor, its number of elements should be equal to the dimension "
"value of 'axis' dimension of input 'x'.",
"Scale for input 'x'. It can be a scalar, which means a per-tensor/layer "
"dequantization, or a 1-D tensor for per-axis dequantization.",
"T2")
.Input(2, "x_zero_point",
"Zero point for input 'x'. It could be a scalar or a 1-D tensor, which means a "
"per-tensor or per-axis quantization."
"If it's a 1-D tensor, its number of elements should be equal to the dimension "
"value of 'axis' dimension of input 'x'.",
"T1")
"Zero point for input 'x'. Shape must match x_scale. It's optional. "
"Zero point is 0 when it's not specified.",
"T1", OpSchema::Optional)
.Output(0, "y", "N-D full precision output tensor. It has same shape as input 'x'.",
"T2")
.TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)"},
"Constrain 'x' and 'x_zero_point' to 8-bit integer tensors.")
.TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)", "tensor(int32)"},
"Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit "
"signed integer tensors.")
.TypeConstraint("T2", {"tensor(float16)", "tensor(float)"},
"Constrain 'y', 'x_scale' to float tensors.")
.SetDoc(DequantizeLinear_ver1_doc)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,8 @@ bool IsNodeSupported(const Node& node) {
// would result in it having multiple consumers for its output, and it being used in multiple QDQ node groups.
return !node.ContainsSubgraph() &&
optimizer_utils::IsOperationDeterministic(node.Domain(), node.OpType()) &&
!(node.Domain() == kOnnxDomain && node.OpType() == "DequantizeLinear");
!(node.Domain() == kOnnxDomain && node.OpType() == "DequantizeLinear") &&
!(node.Domain() == kMSDomain && node.OpType() == "DequantizeLinear");
}
} // namespace

Expand Down
23 changes: 10 additions & 13 deletions onnxruntime/core/optimizer/constant_folding.cc
Original file line number Diff line number Diff line change
Expand Up @@ -123,17 +123,6 @@ Status ConstantFolding::ApplyImpl(Graph& graph, bool& modified, int graph_level,
} else {
InitializedTensorSet constant_inputs;

// we currently constant fold using the CPU EP only.
// if the node is assigned to a different EP we can run it if it's an ONNX op as we have CPU based
// implementations for all ONNX ops. If the node/op is from a different op domain or if the CPU implementation
// does not support the specific input type(s) required by the node (currently we only support a subset of
// types in some CPU kernels) then we can't proceed with constant folding for the node.
auto ep_type = node->GetExecutionProviderType();
bool cpu_ep = ep_type == kCpuExecutionProvider;
if (!cpu_ep && node->Domain() != kOnnxDomain) {
continue;
}

// Check if constant folding can be applied on this node.
const auto can_constant_fold_node = [&](const Node& n, bool skip_inputs_constant_check = false) {
return graph_utils::IsSupportedProvider(n, GetCompatibleExecutionProviders()) &&
Expand Down Expand Up @@ -196,18 +185,26 @@ Status ConstantFolding::ApplyImpl(Graph& graph, bool& modified, int graph_level,
fetch_mlvalue_idxs.push_back(info.GetMLValueIndex(node_out->Name()));
}

auto& ep_type = node->GetExecutionProviderType();
const bool node_on_cpu_ep = ep_type == kCpuExecutionProvider;

// override the EP assigned to the node so that it will use the CPU kernel for Compute.
if (!cpu_ep) {
if (!node_on_cpu_ep) {
node->SetExecutionProviderType(kCpuExecutionProvider);
}

auto kernel = info.CreateKernel(node);

// undo the EP change to the value that was assigned at graph partitioning time
if (!cpu_ep) {
if (!node_on_cpu_ep) {
node->SetExecutionProviderType(ep_type);
}

// We currently constant fold using the CPU EP only.
// If we can't find a CPU kernel for this node, then we can't proceed with constant folding.
//
// TODO(adrianlizarraga): Support constant folding with other execution providers. For example, we may be able
// to use a CUDA kernel to constant fold operators with data types not supported by the CPU EP kernel.
if (kernel == nullptr) {
LOGS(logger, WARNING) << "Could not find a CPU kernel and hence "
<< "can't constant fold " << node->OpType() << " node '" << node->Name() << "'";
Expand Down
2 changes: 1 addition & 1 deletion onnxruntime/core/optimizer/double_qdq_pairs_remover.cc
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ bool DoubleQDQPairsRemover::IsNodeRemovable(
}

// Type is either "tensor(uint8)" or "tensor(int8)"
const auto self_zp_type = *self->InputDefs()[InputIndex::ZERO_POINT_ID]->Type();
const auto& self_zp_type = *self->InputDefs()[InputIndex::ZERO_POINT_ID]->Type();
// child should be a Q, and have only one child, have the same type as self, and cannot be a graph output
child_index = self->OutputEdgesBegin()->GetNode().Index();
const Node* child = graph.GetNode(child_index);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ Status TransformLayoutForEP(Graph& graph, bool& modified, const IExecutionProvid

const auto max_node_idx = graph.MaxNodeIndex();
OptimizeResult result = onnx_transpose_optimization::Optimize(*api_graph, execution_provider.Type(),
PostLayoutTransformCostCheck);
PostLayoutTransformCostCheck, OrtExtendedHandlers());

if (result.error_msg) {
return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Layout/Transpose optimization for ", execution_provider.Type(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include "core/optimizer/initializer.h"
#include "core/optimizer/qdq_transformer/clip_quantizelinear.h"
#include "core/optimizer/qdq_transformer/qdq_util.h"
#include "core/optimizer/utils.h"
#include "core/graph/graph_utils.h"

Expand Down Expand Up @@ -73,7 +74,7 @@ bool ClipQuantFusion::SatisfyCondition(const Graph& graph, const Node& node, con

// if Clip is followed by QuantizeLinear, it can be fused into QuantizeLinear potentially
const auto& next_node = *node.OutputNodesBegin();
if (!graph_utils::IsSupportedOptypeVersionAndDomain(next_node, "QuantizeLinear", {10, 13, 19})) {
if (!QDQ::MatchQNode(next_node)) {
return false;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ Status DuplicateDQForOutputEdge(const graph_utils::GraphEdge& original_dq_output
QDQ::DQOpName,
MakeString("Added by ", kTransformerName),
dq_inputs,
{&new_dq_output_nodearg});
{&new_dq_output_nodearg},
nullptr, // attributes
original_dq_node.Domain());

// set up edges
// remove DQ -> Y
Expand Down
Loading

0 comments on commit 5a83a67

Please sign in to comment.