diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md
index 5bd1a89c0dea1..95dc8c3cde46c 100644
--- a/docs/ContribOperators.md
+++ b/docs/ContribOperators.md
@@ -1351,8 +1351,8 @@ This version of the operator has been available since version 1 of the 'com.micr
 #### Type Constraints
 
 <dl>
-<dt><tt>T1</tt> : tensor(int8), tensor(uint8), tensor(int32)</dt>
-<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit signed integer tensors.</dd>
+<dt><tt>T1</tt> : tensor(int8), tensor(uint8), tensor(int16), tensor(uint16), tensor(int32)</dt>
+<dd>Constrain 'x' and 'x_zero_point' to 8-bit integer tensors, 16-bit integer tensors, or 32-bit signed integer tensors.</dd>
 <dt><tt>T2</tt> : tensor(float16), tensor(float)</dt>
 <dd>Constrain 'y', 'x_scale' to float tensors.</dd>
 </dl>
@@ -4194,8 +4194,9 @@ This version of the operator has been available since version 1 of the 'com.micr
 ### <a name="com.microsoft.QuantizeLinear"></a><a name="com.microsoft.quantizelinear">**com.microsoft.QuantizeLinear**</a>
 
   The linear quantization operator. It consumes a full precision data, a scale, a zero point to compute the low precision / quantized tensor.
-  The quantization formula is y = saturate ((x / y_scale) + y_zero_point).For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
-  For (x / y_scale), it's rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
+  The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it's uint8, [-128, 127] if it's int8,
+  [0, 65,535] if it's uint16, and [-32,768, 32,767] if it's int16. For (x / y_scale), it's rounding to nearest ties to even.
+  Refer to https://en.wikipedia.org/wiki/Rounding for details.
   Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per 'axis').
 
 #### Version
@@ -4232,8 +4233,8 @@ This version of the operator has been available since version 1 of the 'com.micr
 <dl>
 <dt><tt>T1</tt> : tensor(float16), tensor(float)</dt>
 <dd>Constrain 'x', 'y_scale' to float tensors.</dd>
-<dt><tt>T2</tt> : tensor(int8), tensor(uint8)</dt>
-<dd>Constrain 'y_zero_point' and 'y' to 8-bit integer tensors.</dd>
+<dt><tt>T2</tt> : tensor(int8), tensor(uint8), tensor(int16), tensor(uint16)</dt>
+<dd>Constrain 'y_zero_point' and 'y' to 8-bit and 16-bit integer tensors.</dd>
 </dl>
 
 
diff --git a/docs/OperatorKernels.md b/docs/OperatorKernels.md
index d46f3ed9bd262..33c187a28b62e 100644
--- a/docs/OperatorKernels.md
+++ b/docs/OperatorKernels.md
@@ -439,7 +439,7 @@ Do not modify directly.*
 |CDist|*in* A:**T**<br> *in* B:**T**<br> *out* C:**T**|1+|**T** = tensor(double), tensor(float)|
 |ConvTransposeWithDynamicPads|*in* X:**T**<br> *in* W:**T**<br> *in* Pads:**tensor(int64)**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
 |CropAndResize|*in* X:**T1**<br> *in* rois:**T1**<br> *in* batch_indices:**T2**<br> *in* crop_size:**T2**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int32)|
-|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int32), tensor(int8), tensor(uint8)<br/> **T2** = tensor(float)|
+|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint8)<br/> **T2** = tensor(float)|
 |DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
 |DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
 |EmbedLayerNormalization|*in* input_ids:**T1**<br> *in* segment_ids:**T1**<br> *in* word_embedding:**T**<br> *in* position_embedding:**T**<br> *in* segment_embedding:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* mask:**T1**<br> *in* position_ids:**T1**<br> *out* output:**T**<br> *out* mask_index:**T1**<br> *out* embedding_sum:**T**|1+|**T** = tensor(float)|
@@ -472,7 +472,7 @@ Do not modify directly.*
 |QLinearSigmoid|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* X_zero_point:**T**<br> *in* Y_scale:**tensor(float)**<br> *in* Y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
 |QLinearSoftmax|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* x_zero_point:**T**<br> *in* y_scale:**tensor(float)**<br> *in* y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
 |QLinearWhere|*in* condition:**B**<br> *in* X:**T**<br> *in* x_scale:**TF**<br> *in* x_zero_point:**T**<br> *in* Y:**T**<br> *in* y_scale:**TF**<br> *in* y_zero_point:**T**<br> *in* z_scale:**TF**<br> *in* z_zero_point:**T**<br> *out* Z:**T**|1+|**T** = tensor(int8), tensor(uint8)|
-|QuantizeLinear|*in* x:**T1**<br> *in* y_scale:**T1**<br> *in* y_zero_point:**T2**<br> *out* y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
+|QuantizeLinear|*in* x:**T1**<br> *in* y_scale:**T1**<br> *in* y_zero_point:**T2**<br> *out* y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int16), tensor(int8), tensor(uint16), tensor(uint8)|
 |QuickGelu|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
 |Range|*in* start:**T**<br> *in* limit:**T**<br> *in* delta:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
 |SampleOp|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
diff --git a/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc b/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
index 660c8bd9e0624..0ec5088808656 100644
--- a/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
+++ b/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
@@ -56,9 +56,13 @@ class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLine
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint16_t, DequantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int16_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint16_t, QuantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int16_t, QuantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid);
@@ -191,9 +195,13 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearAveragePool)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, DequantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint16_t, DequantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int16_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int32_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QuantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QuantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint16_t, QuantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int16_t, QuantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearLeakyRelu)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid)>,
diff --git a/onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc b/onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc
deleted file mode 100644
index 28a304bfc7f0e..0000000000000
--- a/onnxruntime/contrib_ops/cpu/quantization/quantize_ops.cc
+++ /dev/null
@@ -1,56 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT License.
-
-#include "core/providers/cpu/quantization/quantize_linear.h"
-#include "core/providers/common.h"
-
-namespace onnxruntime {
-namespace contrib {
-
-ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
-    DequantizeLinear,
-    1,
-    uint8_t,
-    KernelDefBuilder()
-        .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint8_t>())
-        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
-    DequantizeLinear<uint8_t>);
-
-ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
-    DequantizeLinear,
-    1,
-    int8_t,
-    KernelDefBuilder()
-        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int8_t>())
-        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
-    DequantizeLinear<int8_t>);
-
-ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
-    DequantizeLinear,
-    1,
-    int32_t,
-    KernelDefBuilder()
-        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int32_t>())
-        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
-    DequantizeLinear<int32_t>);
-
-ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
-    QuantizeLinear,
-    1,
-    uint8_t,
-    KernelDefBuilder()
-        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
-        .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint8_t>()),
-    QuantizeLinear<uint8_t>);
-
-ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
-    QuantizeLinear,
-    1,
-    int8_t,
-    KernelDefBuilder()
-        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
-        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int8_t>()),
-    QuantizeLinear<int8_t>);
-
-}  // namespace contrib
-}  // namespace onnxruntime
diff --git a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
index aa2ad9f1ff6b1..4313fae767fe5 100644
--- a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
+++ b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
@@ -136,8 +136,9 @@ Performs element-wise binary {name} on 8 bit data types (with Numpy-style broadc
 
 static const char* QuantizeLinear_ver1_doc = R"DOC(
 The linear quantization operator. It consumes a full precision data, a scale, a zero point to compute the low precision / quantized tensor.
-The quantization formula is y = saturate ((x / y_scale) + y_zero_point).For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
-For (x / y_scale), it's rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
+The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it's uint8, [-128, 127] if it's int8,
+[0, 65,535] if it's uint16, and [-32,768, 32,767] if it's int16. For (x / y_scale), it's rounding to nearest ties to even.
+Refer to https://en.wikipedia.org/wiki/Rounding for details.
 Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per 'axis').)DOC";
 
 ONNX_MS_OPERATOR_SET_SCHEMA(
@@ -161,8 +162,8 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
                "T2", OpSchema::Optional)
         .Output(0, "y", "N-D quantized output tensor. It has same shape as input 'x'.", "T2")
         .TypeConstraint("T1", {"tensor(float16)", "tensor(float)"}, "Constrain 'x', 'y_scale' to float tensors.")
-        .TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)"},
-                        "Constrain 'y_zero_point' and 'y' to 8-bit integer tensors.")
+        .TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)", "tensor(int16)", "tensor(uint16)"},
+                        "Constrain 'y_zero_point' and 'y' to 8-bit and 16-bit integer tensors.")
         .SetDoc(QuantizeLinear_ver1_doc)
         .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
           if (ctx.getNumInputs() == 3 && ctx.getInputType(2) != nullptr) {
@@ -202,9 +203,10 @@ ONNX_MS_OPERATOR_SET_SCHEMA(DequantizeLinear, 1,
                                        "T1", OpSchema::Optional)
                                 .Output(0, "y", "N-D full precision output tensor. It has same shape as input 'x'.",
                                         "T2")
-                                .TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)", "tensor(int32)"},
-                                                "Constrain 'x' and 'x_zero_point' to 8-bit integer tensors or 32-bit "
-                                                "signed integer tensors.")
+                                .TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)", "tensor(int16)",
+                                                       "tensor(uint16)", "tensor(int32)"},
+                                                "Constrain 'x' and 'x_zero_point' to 8-bit integer tensors, "
+                                                "16-bit integer tensors, or 32-bit signed integer tensors.")
                                 .TypeConstraint("T2", {"tensor(float16)", "tensor(float)"},
                                                 "Constrain 'y', 'x_scale' to float tensors.")
                                 .SetDoc(DequantizeLinear_ver1_doc)
diff --git a/onnxruntime/core/mlas/lib/mlasi.h b/onnxruntime/core/mlas/lib/mlasi.h
index f517be185b3fa..b6ac4a1ca1d6c 100644
--- a/onnxruntime/core/mlas/lib/mlasi.h
+++ b/onnxruntime/core/mlas/lib/mlasi.h
@@ -633,6 +633,24 @@ void
     int8_t ZeroPoint
     );
 
+typedef
+void
+(MLASCALL MLAS_QUANTIZE_LINEAR_U16_KERNEL)(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint);
+
+typedef
+void
+(MLASCALL MLAS_QUANTIZE_LINEAR_S16_KERNEL)(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint);
+
 template<typename InputType, typename FilterType>
 struct MLAS_QUANT_KERNEL
 {
@@ -749,6 +767,8 @@ extern "C" {
     MLAS_QLINEAR_BINARY_OP_U8_KERNEL MlasQLinearAddU8Kernel;
     MLAS_QUANTIZE_LINEAR_S8_KERNEL MlasQuantizeLinearS8Kernel;
     MLAS_QUANTIZE_LINEAR_U8_KERNEL MlasQuantizeLinearU8Kernel;
+    MLAS_QUANTIZE_LINEAR_S16_KERNEL MlasQuantizeLinearS16Kernel;
+    MLAS_QUANTIZE_LINEAR_U16_KERNEL MlasQuantizeLinearU16Kernel;
 #if defined(MLAS_TARGET_AMD64)
     MLAS_COMPUTE_UNARY_FLOAT_KERNEL MlasErfKernelFma3;
     MLAS_COMPUTE_UNARY_FLOAT_KERNEL MlasComputeExpF32KernelFma3;
@@ -959,6 +979,8 @@ struct MLAS_PLATFORM {
     const MLAS_GEMM_QUANT_DISPATCH* GemmU8X8Dispatch;
     MLAS_QUANTIZE_LINEAR_S8_KERNEL* QuantizeLinearS8Kernel;
     MLAS_QUANTIZE_LINEAR_U8_KERNEL* QuantizeLinearU8Kernel;
+    MLAS_QUANTIZE_LINEAR_S16_KERNEL* QuantizeLinearS16Kernel;
+    MLAS_QUANTIZE_LINEAR_U16_KERNEL* QuantizeLinearU16Kernel;
 #endif
 #if defined(MLAS_TARGET_AMD64)
     MLAS_SGEMM_KERNEL_M1_ROUTINE* KernelM1Routine;
@@ -986,6 +1008,8 @@ struct MLAS_PLATFORM {
     MLAS_REDUCE_MINIMUM_MAXIMUM_FLOAT_KERNEL* ReduceMinimumMaximumF32Kernel;
     MLAS_QUANTIZE_LINEAR_S8_KERNEL* QuantizeLinearS8Kernel;
     MLAS_QUANTIZE_LINEAR_U8_KERNEL* QuantizeLinearU8Kernel;
+    MLAS_QUANTIZE_LINEAR_S16_KERNEL* QuantizeLinearS16Kernel;
+    MLAS_QUANTIZE_LINEAR_U16_KERNEL* QuantizeLinearU16Kernel;
     uint32_t NchwcBlockSize;
     uint32_t PreferredBufferAlignment;
     int32_t MaximumThreadCount;
diff --git a/onnxruntime/core/mlas/lib/platform.cpp b/onnxruntime/core/mlas/lib/platform.cpp
index 86b7450a7c4e5..7e2b117d6f249 100644
--- a/onnxruntime/core/mlas/lib/platform.cpp
+++ b/onnxruntime/core/mlas/lib/platform.cpp
@@ -230,6 +230,8 @@ Return Value:
     this->QLinearAddU8Kernel = MlasQLinearAddU8Kernel;
     this->QuantizeLinearS8Kernel = MlasQuantizeLinearS8Kernel;
     this->QuantizeLinearU8Kernel = MlasQuantizeLinearU8Kernel;
+    this->QuantizeLinearS16Kernel = MlasQuantizeLinearS16Kernel;
+    this->QuantizeLinearU16Kernel = MlasQuantizeLinearU16Kernel;
 
     this->NchwcBlockSize = 8;
     this->PreferredBufferAlignment = MLAS_DEFAULT_PREFERRED_BUFFER_ALIGNMENT;
@@ -475,6 +477,8 @@ Return Value:
     this->GemmDoubleKernel = MlasDgemmKernel;
     this->QuantizeLinearS8Kernel = MlasQuantizeLinearS8Kernel;
     this->QuantizeLinearU8Kernel = MlasQuantizeLinearU8Kernel;
+    this->QuantizeLinearS16Kernel = MlasQuantizeLinearS16Kernel;
+    this->QuantizeLinearU16Kernel = MlasQuantizeLinearU16Kernel;
 
 #if defined(__linux__)
     unsigned long hwcap2 = getauxval(AT_HWCAP2);
diff --git a/onnxruntime/core/mlas/lib/power/QuantizePower.cpp b/onnxruntime/core/mlas/lib/power/QuantizePower.cpp
index 0d38288c6d42c..830a3a6a492db 100644
--- a/onnxruntime/core/mlas/lib/power/QuantizePower.cpp
+++ b/onnxruntime/core/mlas/lib/power/QuantizePower.cpp
@@ -1,3 +1,4 @@
+#include <type_traits>
 #include "mlasi.h"
 #include <altivec.h>
 
@@ -82,8 +83,15 @@ Return Value:
 
         auto ShortVector0 = vec_pack(IntegerVector0, IntegerVector1);
         auto ShortVector1 = vec_pack(IntegerVector2, IntegerVector3);
-        auto CharVector = vec_pack(ShortVector0, ShortVector1);
-        vec_xst(CharVector, 0, (int8_t *) Output);
+
+        if constexpr (std::is_same_v<OutputType, uint8_t> || std::is_same_v<OutputType, int8_t>) {
+            auto CharVector = vec_pack(ShortVector0, ShortVector1);
+            vec_xst(CharVector, 0, Output);
+        } else {
+            static_assert(std::is_same_v<OutputType, uint16_t> || std::is_same_v<OutputType, int16_t>);
+            vec_xst(ShortVector0, 0, Output);
+            vec_xst(ShortVector1, 0, &Output[8]);
+        }
 
         Output += 16;
         Input += 16;
@@ -124,3 +132,30 @@ MlasQuantizeLinearS8Kernel(
 {
     MlasQuantizeLinearKernel<int8_t>(Input, Output, N, Scale, ZeroPoint);
 }
+
+void
+MLASCALL
+MlasQuantizeLinearU16Kernel(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint
+    )
+{
+    MlasQuantizeLinearKernel<uint16_t>(Input, Output, N, Scale, ZeroPoint);
+}
+
+void
+MLASCALL
+MlasQuantizeLinearS16Kernel(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint
+    )
+{
+    MlasQuantizeLinearKernel<int16_t>(Input, Output, N, Scale, ZeroPoint);
+}
+
diff --git a/onnxruntime/core/mlas/lib/quantize.cpp b/onnxruntime/core/mlas/lib/quantize.cpp
index c6e8af38c0020..133ad79594c55 100644
--- a/onnxruntime/core/mlas/lib/quantize.cpp
+++ b/onnxruntime/core/mlas/lib/quantize.cpp
@@ -21,6 +21,7 @@ Module Name:
 #include "mlasi.h"
 
 #if defined(MLAS_NEON64_INTRINSICS) || defined(MLAS_SSE2_INTRINSICS)
+#include <type_traits>
 
 //
 // QuantizeLinear implementation using NEON or SSE2 intrinsics.
@@ -79,6 +80,20 @@ MlasQuantizeLinearPackBytes(
     MLAS_INT32X4 IntegerVector
     );
 
+template <typename OutputType>
+void
+MlasQuantizeLinearStore4PackedValues(
+    MLAS_INT32X4 IntegerVector,
+    OutputType* Output
+    );
+
+template <typename OutputType>
+void
+MlasQuantizeLinearStoreSingleValue(
+    MLAS_INT32X4 IntegerVector,
+    OutputType* Output
+    );
+
 #if defined(MLAS_NEON64_INTRINSICS)
 
 template<typename OutputType>
@@ -100,6 +115,104 @@ MlasQuantizeLinearPackBytes(
     return vreinterpretq_s32_u8(ByteVector);
 }
 
+template<>
+MLAS_INT32X4
+MlasQuantizeLinearPackBytes<uint16_t>(
+    MLAS_INT32X4 IntegerVector
+    )
+{
+    //
+    // Swizzle the least significant u16 from each int32_t element to the
+    // bottom eight bytes of the vector register.
+    //
+
+    uint16x8_t WordVector = vreinterpretq_u16_s32(IntegerVector);
+    WordVector = vuzp1q_u16(WordVector, WordVector);
+    return vreinterpretq_s32_u16(WordVector);
+}
+
+template<>
+MLAS_INT32X4
+MlasQuantizeLinearPackBytes<int16_t>(
+    MLAS_INT32X4 IntegerVector
+    )
+{
+    //
+    // Swizzle the least significant u16 from each int32_t element to the
+    // bottom eight bytes of the vector register.
+    //
+
+    int16x8_t WordVector = vreinterpretq_s16_s32(IntegerVector);
+    WordVector = vuzp1q_s16(WordVector, WordVector);
+    return vreinterpretq_s32_s16(WordVector);
+}
+
+template <typename OutputType>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStore4PackedValues(
+    MLAS_INT32X4 IntegerVector,
+    OutputType* Output
+    )
+{
+    // Copies the lower 4 packed elements of the vector into memory (Output).
+
+    if constexpr (std::is_same_v<OutputType, uint8_t> || std::is_same_v<OutputType, int8_t>) {
+        vst1q_lane_s32(reinterpret_cast<int32_t*>(Output), IntegerVector, 0);
+    } else {
+        static_assert(std::is_same_v<OutputType, uint16_t> || std::is_same_v<OutputType, int16_t>);
+        vst1q_lane_s64(reinterpret_cast<int64_t*>(Output), vreinterpretq_s64_s32(IntegerVector), 0);
+    }
+}
+
+template <>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStoreSingleValue<uint8_t>(
+    MLAS_INT32X4 IntegerVector,
+    uint8_t* Output
+    )
+{
+    // Copies the lower 8-bit element of the vector into memory (Output).
+    vst1q_lane_u8(Output, vreinterpretq_u8_s32(IntegerVector), 0);
+}
+
+template <>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStoreSingleValue<int8_t>(
+    MLAS_INT32X4 IntegerVector,
+    int8_t* Output
+    )
+{
+    // Copies the lower 8-bit element of the vector into memory (Output).
+    vst1q_lane_s8(Output, vreinterpretq_s8_s32(IntegerVector), 0);
+}
+
+template <>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStoreSingleValue<uint16_t>(
+    MLAS_INT32X4 IntegerVector,
+    uint16_t* Output
+    )
+{
+    // Copies the lower 16-bit element of the vector into memory (Output).
+    vst1q_lane_u16(Output, vreinterpretq_u16_s32(IntegerVector), 0);
+}
+
+template <>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStoreSingleValue<int16_t>(
+    MLAS_INT32X4 IntegerVector,
+    int16_t* Output
+    )
+{
+    // Copies the lower 16-bit element of the vector into memory (Output).
+    vst1q_lane_s16(Output, vreinterpretq_s16_s32(IntegerVector), 0);
+}
+
 #else
 
 template<>
@@ -128,6 +241,86 @@ MlasQuantizeLinearPackBytes<int8_t>(
     return IntegerVector;
 }
 
+template<>
+MLAS_FORCEINLINE
+MLAS_INT32X4
+MlasQuantizeLinearPackBytes<uint16_t>(
+    MLAS_INT32X4 IntegerVector
+    )
+{
+#if defined(MLAS_SSE41_INTRINSICS)
+    IntegerVector = _mm_packus_epi32(IntegerVector, IntegerVector);  // 16-bit values packed in lower 8 bytes.
+#else
+    // Cannot use _mm_packus_epi32 because that was not available until SSE4.1.
+    // Instead, emulate by sign-extending the first 16-bits of each packed 32-bit element.
+    // Afterwards, can use _mm_packs_epi32, which is available on SSE2.
+    // See: https://stackoverflow.com/a/11028244
+
+    IntegerVector = _mm_slli_epi32(IntegerVector, 16);
+    IntegerVector = _mm_srai_epi32(IntegerVector, 16);  // Sign-extend: undo left shift with right arithmetic shift
+    IntegerVector = _mm_packs_epi32(IntegerVector, IntegerVector);  // 16-bit values packed in lower 8 bytes.
+#endif  // defined(MLAS_SSE41_INTRINSICS)
+
+    return IntegerVector;
+}
+
+template<>
+MLAS_FORCEINLINE
+MLAS_INT32X4
+MlasQuantizeLinearPackBytes<int16_t>(
+    MLAS_INT32X4 IntegerVector
+    )
+{
+    IntegerVector = _mm_packs_epi32(IntegerVector, IntegerVector);  // 16-bit values packed in lower 8 bytes.
+
+    return IntegerVector;
+}
+
+template <typename OutputType>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStore4PackedValues(
+    MLAS_INT32X4 IntegerVector,
+    OutputType* Output
+    )
+{
+    // Copies the lower 4 packed elements of the vector into memory (Output).
+
+    if constexpr (std::is_same_v<OutputType, uint8_t> || std::is_same_v<OutputType, int8_t>) {
+        *(reinterpret_cast<int32_t*>(Output)) = _mm_cvtsi128_si32(IntegerVector);
+    } else {
+        static_assert(std::is_same_v<OutputType, uint16_t> || std::is_same_v<OutputType, int16_t>);
+
+#if defined(MLAS_TARGET_IX86)
+        // x86 does not support _mm_cvtsi128_si64, so use _mm_maskmoveu_si128 instead.
+        constexpr uint32_t bytes_high_bit = 0x80808080;
+        const __m128i first_8_bytes_mask = _mm_set_epi32(0, 0, bytes_high_bit, bytes_high_bit);
+        _mm_maskmoveu_si128(IntegerVector, first_8_bytes_mask, reinterpret_cast<char*>(Output));
+#else
+        *(reinterpret_cast<int64_t*>(Output)) = _mm_cvtsi128_si64(IntegerVector);
+#endif  // defined(MLAS_TARGET_IX86)
+    }
+}
+
+template <typename OutputType>
+MLAS_FORCEINLINE
+void
+MlasQuantizeLinearStoreSingleValue(
+    MLAS_INT32X4 IntegerVector,
+    OutputType* Output
+    )
+{
+    static_assert(std::is_same_v<OutputType, uint8_t> ||
+                  std::is_same_v<OutputType, int8_t> ||
+                  std::is_same_v<OutputType, uint16_t> ||
+                  std::is_same_v<OutputType, int16_t>);
+
+    // Copies the lower element of the vector into memory (Output).
+    // Expects that the 32-bit element in lane 0 is already within the valid numerical
+    // range of the OutputType.
+    *Output = static_cast<OutputType>(_mm_cvtsi128_si32(IntegerVector));
+}
+
 #endif
 
 template<typename OutputType>
@@ -180,12 +373,7 @@ Return Value:
             MinimumValueVector, MaximumValueVector, ZeroPointVector);
 
         IntegerVector = MlasQuantizeLinearPackBytes<OutputType>(IntegerVector);
-
-#if defined(MLAS_NEON64_INTRINSICS)
-        vst1q_lane_s32((int32_t*)Output, IntegerVector, 0);
-#else
-        *((int32_t*)Output) = _mm_cvtsi128_si32(IntegerVector);
-#endif
+        MlasQuantizeLinearStore4PackedValues(IntegerVector, Output);
 
         Input += 4;
         Output += 4;
@@ -202,11 +390,7 @@ Return Value:
         auto IntegerVector = MlasQuantizeLinearVector(FloatVector, ScaleVector,
             MinimumValueVector, MaximumValueVector, ZeroPointVector);
 
-#if defined(MLAS_NEON64_INTRINSICS)
-        vst1q_lane_u8((uint8_t*)Output + n, vreinterpretq_u8_s32(IntegerVector), 0);
-#else
-        *((uint8_t*)Output + n) = (uint8_t)_mm_cvtsi128_si32(IntegerVector);
-#endif
+        MlasQuantizeLinearStoreSingleValue(IntegerVector, &Output[n]);
     }
 }
 
@@ -236,6 +420,32 @@ MlasQuantizeLinearU8Kernel(
     MlasQuantizeLinearKernel<uint8_t>(Input, Output, N, Scale, ZeroPoint);
 }
 
+void
+MLASCALL
+MlasQuantizeLinearU16Kernel(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint
+)
+{
+    MlasQuantizeLinearKernel<uint16_t>(Input, Output, N, Scale, ZeroPoint);
+}
+
+void
+MLASCALL
+MlasQuantizeLinearS16Kernel(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint
+)
+{
+    MlasQuantizeLinearKernel<int16_t>(Input, Output, N, Scale, ZeroPoint);
+}
+
 template<>
 void
 MLASCALL
@@ -274,6 +484,44 @@ MlasQuantizeLinear<uint8_t>(
         Input, Output, N, Scale, ZeroPoint);
 }
 
+template<>
+void
+MLASCALL
+MlasQuantizeLinear<uint16_t>(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint
+    )
+{
+#if defined(MLAS_TARGET_AMD64)
+    GetMlasPlatform().QuantizeLinearU16Kernel(
+#else
+    MlasQuantizeLinearU16Kernel(
+#endif
+        Input, Output, N, Scale, ZeroPoint);
+}
+
+template<>
+void
+MLASCALL
+MlasQuantizeLinear<int16_t>(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint
+    )
+{
+#if defined(MLAS_TARGET_AMD64)
+    GetMlasPlatform().QuantizeLinearS16Kernel(
+#else
+    MlasQuantizeLinearS16Kernel(
+#endif
+        Input, Output, N, Scale, ZeroPoint);
+}
+
 #else
 
 #if defined(MLAS_TARGET_POWER)
@@ -306,6 +554,34 @@ MlasQuantizeLinear<uint8_t>(
     GetMlasPlatform().QuantizeLinearU8Kernel(Input, Output, N, Scale, ZeroPoint);
 }
 
+template<>
+void
+MLASCALL
+MlasQuantizeLinear<int16_t>(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint
+    )
+{
+    GetMlasPlatform().QuantizeLinearS16Kernel(Input, Output, N, Scale, ZeroPoint);
+}
+
+template<>
+void
+MLASCALL
+MlasQuantizeLinear<uint16_t>(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint
+    )
+{
+    GetMlasPlatform().QuantizeLinearU16Kernel(Input, Output, N, Scale, ZeroPoint);
+}
+
 #endif
 
 //
@@ -381,6 +657,29 @@ MlasQuantizeLinear<uint8_t>(
     float Scale,
     uint8_t ZeroPoint
     );
+
+template
+void
+MLASCALL
+MlasQuantizeLinear<int16_t>(
+    const float* Input,
+    int16_t* Output,
+    size_t N,
+    float Scale,
+    int16_t ZeroPoint
+    );
+
+template
+void
+MLASCALL
+MlasQuantizeLinear<uint16_t>(
+    const float* Input,
+    uint16_t* Output,
+    size_t N,
+    float Scale,
+    uint16_t ZeroPoint
+    );
+
 #endif
 
 #endif
diff --git a/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc b/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc
index b67f6d6ec0794..624679e7b1b4b 100644
--- a/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc
+++ b/onnxruntime/core/optimizer/double_qdq_pairs_remover.cc
@@ -1,131 +1,37 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 #include "core/optimizer/double_qdq_pairs_remover.h"
+#include <cassert>
 
 #include "core/graph/graph_utils.h"
 #include "core/optimizer/initializer.h"
+#include "core/optimizer/qdq_transformer/qdq_util.h"
 
 namespace onnxruntime {
 
-Status DoubleQDQPairsRemover::ApplyImpl(
-    Graph& graph,
-    bool& modified,
-    int /*graph_level*/,
-    const logging::Logger& /*logger*/) const {
-  const GraphViewer graph_viewer(graph);
-  const auto& node_topology_list = graph_viewer.GetNodesInTopologicalOrder();
-
-  for (const auto& self_index : node_topology_list) {
-    NodeIndex parent_index = 0;
-    NodeIndex child_index = 0;
-    NodeIndex grandchild_index = 0;
-    if (IsNodeRemovable(graph, self_index, parent_index, child_index, grandchild_index)) {
-      graph.RemoveEdge(parent_index, self_index, 0, 0);
-      graph.RemoveEdge(self_index, child_index, 0, 0);
-      graph.RemoveEdge(child_index, grandchild_index, 0, 0);
-      graph_utils::ReplaceNodeInput(*graph.GetNode(grandchild_index), 0, *graph.GetNode(self_index)->MutableInputDefs()[0]);
-      graph.AddEdge(parent_index, grandchild_index, 0, 0);
-      graph.RemoveNode(child_index);
-      graph.RemoveNode(self_index);
-      modified = true;
-    }
-  }
-  return Status::OK();
-}
-
-bool DoubleQDQPairsRemover::IsNodeRemovable(
-    Graph& graph,
-    const NodeIndex& self_index,
-    NodeIndex& parent_index,
-    NodeIndex& child_index,
-    NodeIndex& grandchild_index) {
-  // Check if the self is a DQ, and have one parent and one child, and cannot be a graph output
-  Node* self = graph.GetNode(self_index);
-  if (self == nullptr ||
-      self->OpType() != "DequantizeLinear" ||
-      self->GetInputEdgesCount() != 1 ||
-      self->GetOutputEdgesCount() != 1 ||
-      self->InputDefs().size() != InputIndex::TOTAL_COUNT ||
-      graph.NodeProducesGraphOutput(*self)) {
-    return false;
-  }
-
-  // Type is either "tensor(uint8)" or  "tensor(int8)"
-  const auto& self_zp_type = *self->InputDefs()[InputIndex::ZERO_POINT_ID]->Type();
-  // child should be a Q, and have only one child, have the same type as self, and cannot be a graph output
-  child_index = self->OutputEdgesBegin()->GetNode().Index();
-  const Node* child = graph.GetNode(child_index);
-  if (child == nullptr ||
-      child->OpType() != "QuantizeLinear" ||
-      child->GetOutputEdgesCount() != 1 ||
-      child->InputDefs().size() != InputIndex::TOTAL_COUNT ||
-      *child->InputDefs()[InputIndex::ZERO_POINT_ID]->Type() != self_zp_type ||
-      graph.NodeProducesGraphOutput(*child)) {
-    return false;
-  }
-
-  // parent should be a Q, and have only one output, and cannot be a graph output
-  parent_index = self->InputEdgesBegin()->GetNode().Index();
-  Node* parent = graph.GetNode(parent_index);
-  if (parent == nullptr ||
-      parent->GetOutputEdgesCount() != 1 ||
-      parent->OpType() != "QuantizeLinear" ||
-      graph.NodeProducesGraphOutput(*parent)) {
-    return false;
-  }
-
-  // grandchild should be a DQ
-  grandchild_index = child->OutputEdgesBegin()->GetNode().Index();
-  Node* grandchild = graph.GetNode(grandchild_index);
-  if (grandchild == nullptr ||
-      grandchild->OpType() != "DequantizeLinear") {
-    return false;
-  }
-  const auto get_constant_initializer = [&graph](const std::string& initializer_name) {
-    return graph.GetConstantInitializer(initializer_name, true);
-  };
-  if (!QDQ::IsQDQPairSupported(*parent, *self, get_constant_initializer, graph.ModelPath()) ||
-      !QDQ::IsQDQPairSupported(*child, *grandchild, get_constant_initializer, graph.ModelPath())) {
-    return false;
-  }
-  bool skip_reset = false;
-  float new_scale = 0.0f;
-  if (self_zp_type == "tensor(uint8)") {
-    uint8_t new_zero_point = 0;
-    if (!FindNewZeroPointAndScale(graph, *self, *child, new_scale, new_zero_point, skip_reset)) {
-      return false;
-    }
-    if (skip_reset) {
-      return true;
-    }
-    ApplyNewInputValue(graph, *grandchild, InputIndex::SCALE_ID, new_scale);
-    ApplyNewInputValue(graph, *parent, InputIndex::SCALE_ID, new_scale);
-    ApplyNewInputValue(graph, *grandchild, InputIndex::ZERO_POINT_ID, new_zero_point);
-    ApplyNewInputValue(graph, *parent, InputIndex::ZERO_POINT_ID, new_zero_point);
-  } else {
-    int8_t new_zero_point = 0;
-    if (!FindNewZeroPointAndScale(graph, *self, *child, new_scale, new_zero_point, skip_reset)) {
-      return false;
-    }
-    if (skip_reset) {
-      return true;
-    }
-    ApplyNewInputValue(graph, *grandchild, InputIndex::SCALE_ID, new_scale);
-    ApplyNewInputValue(graph, *parent, InputIndex::SCALE_ID, new_scale);
-    ApplyNewInputValue(graph, *grandchild, InputIndex::ZERO_POINT_ID, new_zero_point);
-    ApplyNewInputValue(graph, *parent, InputIndex::ZERO_POINT_ID, new_zero_point);
-  }
-  return true;
+// Applies a new zero point or scale as the input for a Q/DQ node.
+template <typename T>
+static void ApplyNewInputValue(Graph& graph, Node& node, QDQ::InputIndex index, T value) {
+  const auto* input_tensor = graph_utils::GetConstantInitializer(graph, node.InputDefs()[index]->Name());
+  Initializer input_init{*input_tensor, graph.ModelPath()};
+  ONNX_NAMESPACE::TensorProto new_input_tensor(*input_tensor);
+  input_init.data<T>()[0] = value;
+  input_init.ToProto(new_input_tensor);
+  auto new_name = graph.GenerateNodeArgName("DoubleQDQRemoved_" + node.InputDefs()[index]->Name());
+  new_input_tensor.set_name(new_name);
+  NodeArg& new_input = graph_utils::AddInitializer(graph, new_input_tensor);
+  graph_utils::ReplaceNodeInput(node, index, new_input);
 }
 
+// Returns a new zero point and scale value for the given Q/DQ nodes.
 template <typename T>
-bool DoubleQDQPairsRemover::FindNewZeroPointAndScale(const Graph& graph, const Node& node1, const Node& node2,
-                                                     float& new_scale, T& new_zero_point, bool& skip_reset) {
+static bool FindNewZeroPointAndScale(const Graph& graph, const Node& node1, const Node& node2,
+                                     float& new_scale, T& new_zero_point, bool& skip_reset) {
   // scale & zero point share same initializer, no need to reset the value
-  const std::string& node1_scale_name = node1.InputDefs()[InputIndex::SCALE_ID]->Name();
-  const std::string& node2_scale_name = node2.InputDefs()[InputIndex::SCALE_ID]->Name();
-  const std::string& node1_zp_name = node1.InputDefs()[InputIndex::ZERO_POINT_ID]->Name();
-  const std::string& node2_zp_name = node2.InputDefs()[InputIndex::ZERO_POINT_ID]->Name();
+  const std::string& node1_scale_name = node1.InputDefs()[QDQ::InputIndex::SCALE_ID]->Name();
+  const std::string& node2_scale_name = node2.InputDefs()[QDQ::InputIndex::SCALE_ID]->Name();
+  const std::string& node1_zp_name = node1.InputDefs()[QDQ::InputIndex::ZERO_POINT_ID]->Name();
+  const std::string& node2_zp_name = node2.InputDefs()[QDQ::InputIndex::ZERO_POINT_ID]->Name();
   skip_reset = false;
   if (node1_scale_name == node2_scale_name && node1_zp_name == node2_zp_name) {
     skip_reset = true;
@@ -175,16 +81,141 @@ bool DoubleQDQPairsRemover::FindNewZeroPointAndScale(const Graph& graph, const N
   return true;
 }
 
-template <typename T>
-void DoubleQDQPairsRemover::ApplyNewInputValue(Graph& graph, Node& node, const InputIndex& index, T value) {
-  const auto* input_tensor = graph_utils::GetConstantInitializer(graph, node.InputDefs()[index]->Name());
-  Initializer input_init{*input_tensor, graph.ModelPath()};
-  TensorProto new_input_tensor(*input_tensor);
-  input_init.data<T>()[0] = value;
-  input_init.ToProto(new_input_tensor);
-  auto new_name = graph.GenerateNodeArgName("DoubleQDQRemoved_" + node.InputDefs()[index]->Name());
-  new_input_tensor.set_name(new_name);
-  NodeArg& new_input = graph_utils::AddInitializer(graph, new_input_tensor);
-  graph_utils::ReplaceNodeInput(node, index, new_input);
+// Recomputes the zero point and scale of the outer Q/DQ nodes (i.e., Q1 and DQ2). This is necessary because
+// the original two QDQ pairs may have different zero-points and scales. Ex: Q1 -> DQ1 -> Q2 -> DQ2, where
+// the first pair has (zp1, scale1) and the second pair has (zp2, scale2).
+// After removing the middle two nodes, the zero point and scale of the final (outer) ops must be recomputed
+// for correctness.
+template <typename ZeroPointType>
+static bool RecomputeOuterQDQZeroPointAndScale(Graph& graph, Node& q1, const Node& dq1, const Node& q2, Node& dq2) {
+  bool skip_reset = false;
+  float new_scale = 0.0f;
+  ZeroPointType new_zero_point = 0;
+  if (!FindNewZeroPointAndScale(graph, dq1, q2, new_scale, new_zero_point, skip_reset)) {
+    return false;
+  }
+  if (skip_reset) {
+    return true;
+  }
+  ApplyNewInputValue(graph, dq2, QDQ::InputIndex::SCALE_ID, new_scale);
+  ApplyNewInputValue(graph, q1, QDQ::InputIndex::SCALE_ID, new_scale);
+  ApplyNewInputValue(graph, dq2, QDQ::InputIndex::ZERO_POINT_ID, new_zero_point);
+  ApplyNewInputValue(graph, q1, QDQ::InputIndex::ZERO_POINT_ID, new_zero_point);
+
+  return true;
+}
+
+// Checks if the provided node index (dq1_index) is a part of a valid double QDQ pair sequence
+// (i.e., Q1 -> DQ1 -> Q2 -> DQ2) that can be reduced to the outer Q/DQ nodes (i.e., Q1 -> DQ2).
+// If so, the zero point and scale of the outer Q/DQ nodes are recomputed and the node indices of the other nodes
+// in the sequence (i.e., Q1, Q2, and DQ2) are returned via output parameters.
+static bool IsReducibleDoubleQDQSequence(Graph& graph, NodeIndex& q1_index, NodeIndex dq1_index,
+                                         NodeIndex& q2_index, NodeIndex& dq2_index) {
+  // Ensure that dq1 is a DQ operator, has one parent and one child, and is not a graph output
+  Node* dq1 = graph.GetNode(dq1_index);
+  if (dq1 == nullptr ||
+      dq1->OpType() != "DequantizeLinear" ||
+      dq1->GetInputEdgesCount() != 1 ||
+      dq1->GetOutputEdgesCount() != 1 ||
+      graph.NodeProducesGraphOutput(*dq1)) {
+    return false;
+  }
+
+  // Ensure that q2 is a Q operator, has only one child, and is not a graph output
+  q2_index = dq1->OutputEdgesBegin()->GetNode().Index();
+  const Node* q2 = graph.GetNode(q2_index);
+  if (q2 == nullptr ||
+      q2->OpType() != "QuantizeLinear" ||
+      q2->GetOutputEdgesCount() != 1 ||
+      graph.NodeProducesGraphOutput(*q2)) {
+    return false;
+  }
+
+  // Ensure that q1 is a Q operator, has only one output, and is not a graph output
+  q1_index = dq1->InputEdgesBegin()->GetNode().Index();
+  Node* q1 = graph.GetNode(q1_index);
+  if (q1 == nullptr ||
+      q1->GetOutputEdgesCount() != 1 ||
+      q1->OpType() != "QuantizeLinear" ||
+      graph.NodeProducesGraphOutput(*q1)) {
+    return false;
+  }
+
+  // Ensure the dq2 is a DQ operator.
+  dq2_index = q2->OutputEdgesBegin()->GetNode().Index();
+  Node* dq2 = graph.GetNode(dq2_index);
+  if (dq2 == nullptr ||
+      dq2->OpType() != "DequantizeLinear") {
+    return false;
+  }
+
+  const auto get_constant_initializer = [&graph](const std::string& initializer_name) {
+    return graph.GetConstantInitializer(initializer_name, true);
+  };
+
+  // Each QDQ pair (i.e., q1 -> dq1, q2 -> dq2) has to meet the following additional requirements:
+  // - Scalar/constant zero-point and scale.
+  // - The DQ and Q ops within a pair must have the same scale and zero-point.
+  //   However, each pair is allowed to have different scales and zero-points.
+  //
+  // TODO: IsQDQPairSupported() requires an explicit zero-point input, but technically a default
+  // value of 0 could be fine.
+  if (!QDQ::IsQDQPairSupported(*q1, *dq1, get_constant_initializer, graph.ModelPath()) ||
+      !QDQ::IsQDQPairSupported(*q2, *dq2, get_constant_initializer, graph.ModelPath())) {
+    return false;
+  }
+
+  const auto& dq1_input_defs = dq1->InputDefs();
+  const ONNX_NAMESPACE::TensorProto* dq1_zp_tensor_proto = graph.GetConstantInitializer(
+      dq1_input_defs[QDQ::InputIndex::ZERO_POINT_ID]->Name(), true);
+
+  assert(dq1_zp_tensor_proto != nullptr);  // IsQDQPairSupported should have checked that this exists.
+
+  auto dq1_zp_type = dq1_zp_tensor_proto->data_type();
+
+  if (dq1_zp_type == ONNX_NAMESPACE::TensorProto_DataType_UINT8) {
+    return RecomputeOuterQDQZeroPointAndScale<uint8_t>(graph, *q1, *dq1, *q2, *dq2);
+  }
+
+  if (dq1_zp_type == ONNX_NAMESPACE::TensorProto_DataType_INT8) {
+    return RecomputeOuterQDQZeroPointAndScale<int8_t>(graph, *q1, *dq1, *q2, *dq2);
+  }
+
+  if (dq1_zp_type == ONNX_NAMESPACE::TensorProto_DataType_UINT16) {
+    return RecomputeOuterQDQZeroPointAndScale<uint16_t>(graph, *q1, *dq1, *q2, *dq2);
+  }
+
+  if (dq1_zp_type == ONNX_NAMESPACE::TensorProto_DataType_INT16) {
+    return RecomputeOuterQDQZeroPointAndScale<int16_t>(graph, *q1, *dq1, *q2, *dq2);
+  }
+
+  return false;  // Unsupported zero-point type
+}
+
+Status DoubleQDQPairsRemover::ApplyImpl(
+    Graph& graph,
+    bool& modified,
+    int /*graph_level*/,
+    const logging::Logger& /*logger*/) const {
+  const GraphViewer graph_viewer(graph);
+  const auto& node_topology_list = graph_viewer.GetNodesInTopologicalOrder();
+
+  for (const auto& dq1_index : node_topology_list) {
+    NodeIndex q1_index = 0;
+    NodeIndex q2_index = 0;
+    NodeIndex dq2_index = 0;
+    if (IsReducibleDoubleQDQSequence(graph, q1_index, dq1_index, q2_index, dq2_index)) {
+      graph.RemoveEdge(q1_index, dq1_index, 0, 0);
+      graph.RemoveEdge(dq1_index, q2_index, 0, 0);
+      graph.RemoveEdge(q2_index, dq2_index, 0, 0);
+      graph_utils::ReplaceNodeInput(*graph.GetNode(dq2_index), 0, *graph.GetNode(dq1_index)->MutableInputDefs()[0]);
+      graph.AddEdge(q1_index, dq2_index, 0, 0);
+      graph.RemoveNode(q2_index);
+      graph.RemoveNode(dq1_index);
+      modified = true;
+    }
+  }
+  return Status::OK();
 }
+
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/optimizer/double_qdq_pairs_remover.h b/onnxruntime/core/optimizer/double_qdq_pairs_remover.h
index c016f7181b7fe..1833b007674fd 100644
--- a/onnxruntime/core/optimizer/double_qdq_pairs_remover.h
+++ b/onnxruntime/core/optimizer/double_qdq_pairs_remover.h
@@ -3,19 +3,16 @@
 
 #pragma once
 
-#include "core/common/common.h"
 #include "core/optimizer/graph_transformer.h"
-#include "core/optimizer/qdq_transformer/qdq_util.h"
 
 namespace onnxruntime {
 
-using ONNX_NAMESPACE::TensorProto;
-using ONNX_NAMESPACE::TensorProto_DataType;
-using QDQ::InputIndex;
-
 /**
  * @Class DoubleQDQPairsRemover
  * @brief Remove one pair of Q-DQ from Double Q-DQ pairs.
+ * Specifically, this transformer converts the sequence Q1 -> DQ1 -> Q2 -> DQ2, where the first pair has (zp1, scale1)
+ * and the second pair has (zp2, scale2), into the sequence Q1 -> DQ2 by removing the middle two nodes. The zero-point
+ * and scale of the final QDQ pair is recomputed to preserve equality to the original sequence.
  */
 class DoubleQDQPairsRemover : public GraphTransformer {
  public:
@@ -27,28 +24,5 @@ class DoubleQDQPairsRemover : public GraphTransformer {
       bool& modified,
       int graph_level,
       const logging::Logger& logger) const override;
-
-  static bool IsNodeRemovable(
-      Graph& graph,
-      const NodeIndex& self_index,
-      NodeIndex& parent_index,
-      NodeIndex& child_index,
-      NodeIndex& grandchild_index);
-
-  template <typename T>
-  static bool FindNewZeroPointAndScale(
-      const Graph& graph,
-      const Node& node1,
-      const Node& node2,
-      float& new_scale,
-      T& new_zero_point,
-      bool& skip_reset);
-
-  template <typename T>
-  static void ApplyNewInputValue(
-      Graph& graph,
-      Node& node,
-      const InputIndex& index,
-      T value);
 };
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
index d7039cb4b7cfc..0e383c3031ca6 100644
--- a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
+++ b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
@@ -2,6 +2,7 @@
 // Licensed under the MIT License.
 
 #include "core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.h"
+#include <memory>
 #include "core/mlas/inc/mlas.h"
 
 #include "core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h"
@@ -32,7 +33,8 @@ void SplitQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
 // create rules for ops that don't change the data
 void DropQDQNodesRules(SelectorActionRegistry& qdq_selector_action_registry) {
   // 3 nodes. DQ, target, Q. Merge into target and remove DQ and Q.
-  const std::string action_name{"drop"};
+  const std::string drop_action_name{"drop"};
+  const std::string drop_action_no_int16_name{"drop_no_int16_support"};
   NTO::NodeLocation dq{NTO::NodeType::kInput, 0};
   NTO::NodeLocation q{NTO::NodeType::kOutput, 0};
 
@@ -42,22 +44,33 @@ void DropQDQNodesRules(SelectorActionRegistry& qdq_selector_action_registry) {
       MoveToSlot(dq, ArgType::kInput, 0, ArgType::kInput, 0),
       MoveToSlot(q, ArgType::kOutput, 0, ArgType::kOutput, 0)};
 
-  std::unique_ptr<Action> action = std::make_unique<MergeIntoTarget>(std::move(moves));
+  std::unique_ptr<Action> drop_action_no_int16 = std::make_unique<MergeIntoTarget>(
+      std::vector<NodeAndMoveInfo>(moves));  // Copy before std::move(moves)
+  std::unique_ptr<Action> drop_action = std::make_unique<MergeIntoTarget>(std::move(moves));
 
 #if !defined(ORT_MINIMAL_BUILD)
-  std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::DropQDQNodesSelector>();
-  qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
+  // Use a separate selector + action that disallows 16-bit types for MaxPool and Resize.
+  // int16 MaxPool is not supported by the ONNX specification.
+  // int16 Resize is not supported by the ORT implementation (although allowed by ONNX).
+  std::unique_ptr<NodeSelector> selector_disallow_16bit = std::make_unique<QDQ::DropQDQNodesSelector>(false);
+  qdq_selector_action_registry.RegisterSelectorAndAction(drop_action_no_int16_name,
+                                                         {{"MaxPool", {12}},
+                                                          {"Resize", {}}},
+                                                         std::move(selector_disallow_16bit),
+                                                         std::move(drop_action_no_int16));
+
+  std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::DropQDQNodesSelector>(true);
+  qdq_selector_action_registry.RegisterSelectorAndAction(drop_action_name,
                                                          {{"Gather", {}},
                                                           {"Reshape", {}},
                                                           {"Transpose", {}},
-                                                          {"MaxPool", {12}},
-                                                          {"Resize", {}},
                                                           {"Squeeze", {}},
                                                           {"Unsqueeze", {}}},
                                                          std::move(selector),
-                                                         std::move(action));
+                                                         std::move(drop_action));
 #else
-  qdq_selector_action_registry.RegisterAction(action_name, std::move(action));
+  qdq_selector_action_registry.RegisterAction(drop_action_no_int16_name, std::move(drop_action_no_int16));
+  qdq_selector_action_registry.RegisterAction(drop_action_name, std::move(drop_action));
 #endif
 }
 
@@ -74,6 +87,7 @@ void DropDQNodesRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<MergeIntoTarget>(std::move(moves));
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when ArgMax supports 16-bit integer input tensors.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::DropDQNodesSelector>();
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"ArgMax", {}}},
@@ -91,6 +105,7 @@ void UnaryOpQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<QDQ::UnaryReplaceWithQLinear>(kMSDomain);
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when unary QLinear* ops support 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::UnarySelector>();
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"AveragePool", {}},
@@ -112,6 +127,7 @@ void BinaryOpQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<QDQ::BinaryReplaceWithQLinear>(kMSDomain);
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when binary QLinear* ops support 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::BinarySelector>();
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"Add", {}},
@@ -131,6 +147,7 @@ void VariadicOpQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<QDQ::VariadicReplaceWithQLinear>(kMSDomain);
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when QLinearConcat supports 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::InputVariadicSelector>();
 
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
@@ -152,6 +169,7 @@ void ConvQDQRules(SelectorActionRegistry& qdq_selector_action_registry, bool is_
   std::unique_ptr<Action> action = std::make_unique<QDQ::ConvReplaceWithQLinear>();
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when QLinearConv supports 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::ConvSelector>(is_int8_allowed);
 
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
@@ -174,6 +192,7 @@ void MatMulQDQRules(SelectorActionRegistry& qdq_selector_action_registry, bool i
   std::unique_ptr<Action> action = std::make_unique<QDQ::MatMulReplaceWithQLinear>();
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when QLinearMatMul and MatMulInteger support 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::MatMulSelector>(is_int8_allowed);
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"MatMul", {}}},
@@ -195,6 +214,7 @@ void GemmQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<QDQ::GemmReplaceWithQuant>();
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when QGemm supports 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::GemmSelector>();
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"Gemm", {}}},
@@ -215,6 +235,7 @@ void WhereQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
   std::unique_ptr<Action> action = std::make_unique<QDQ::WhereReplaceWithQLinear>();
 
 #if !defined(ORT_MINIMAL_BUILD)
+  // TODO: Enable 16-bit types in selector when QLinearWhere supports 16-bit.
   std::unique_ptr<NodeSelector> selector = std::make_unique<QDQ::WhereSelector>();
   qdq_selector_action_registry.RegisterSelectorAndAction(action_name,
                                                          {{"Where", {}}},
diff --git a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
index 02a7fb733813c..16c7bd5fce960 100644
--- a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
+++ b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
@@ -14,6 +14,12 @@
 namespace onnxruntime {
 namespace QDQ {
 namespace {
+
+constexpr bool Is16BitIntType(int32_t data_type) {
+  return (data_type == ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_INT16) ||
+         (data_type == ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_UINT16);
+}
+
 // adjust for an optional input/output that has an entry but does not exist
 int NumActualValues(const Node& node, bool input) {
   const auto& defs = input ? node.InputDefs() : node.OutputDefs();
@@ -110,6 +116,17 @@ bool DropQDQNodeGroupSelector::Check(const GraphViewer& graph_viewer,
     return false;
   }
 
+  int32_t dt_input = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
+  int32_t dt_output = q_nodes[0]->OutputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
+
+  if (dt_input != dt_output) {
+    return false;
+  }
+
+  if (!allow_16bit_ && Is16BitIntType(dt_input)) {
+    return false;
+  }
+
   const Node& dq_node = *dq_nodes.front();
   const Node& q_node = *q_nodes.front();
 
@@ -124,7 +141,7 @@ bool DropDQNodeGroupSelector::Check(const GraphViewer& graph_viewer,
                                     const Node& node,
                                     const std::vector<const Node*>& dq_nodes,
                                     const std::vector<const Node*>& q_nodes) const {
-  int num_dq_inputs = NumActualValues(node, true);
+  constexpr int num_dq_inputs = 1;
   if (num_dq_inputs != gsl::narrow_cast<int>(dq_nodes.size())) {
     return false;
   }
@@ -136,6 +153,12 @@ bool DropDQNodeGroupSelector::Check(const GraphViewer& graph_viewer,
 
   (void)q_nodes;
   const Node& dq_node = *dq_nodes.front();
+  const int32_t dt_input = dq_node.InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
+
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && Is16BitIntType(dt_input)) {
+    return false;
+  }
 
   auto get_const_initializer = [&graph_viewer](const std::string& initializer_name) {
     return graph_viewer.GetConstantInitializer(initializer_name, true);
@@ -154,7 +177,16 @@ bool UnaryNodeGroupSelector::Check(const GraphViewer& graph_viewer, const Node&
   int32_t dt_input = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
   int32_t dt_output = q_nodes[0]->OutputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
 
-  return dt_input == dt_output;
+  if (dt_input != dt_output) {
+    return false;
+  }
+
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && Is16BitIntType(dt_input)) {
+    return false;
+  }
+
+  return true;
 }
 
 bool BinaryNodeGroupSelector::Check(const GraphViewer& graph_viewer,
@@ -168,8 +200,18 @@ bool BinaryNodeGroupSelector::Check(const GraphViewer& graph_viewer,
   int32_t dt_input_1 = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
   int32_t dt_input_2 = dq_nodes[1]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
   int32_t dt_output = q_nodes[0]->OutputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
-  return dt_input_1 == dt_input_2 &&
-         dt_input_1 == dt_output;
+
+  // All input and output types must match.
+  if (dt_input_1 != dt_input_2 || dt_input_1 != dt_output) {
+    return false;
+  }
+
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && Is16BitIntType(dt_input_1)) {
+    return false;
+  }
+
+  return true;
 }
 
 bool VariadicNodeGroupSelector::Check(const GraphViewer& graph_viewer,
@@ -194,7 +236,17 @@ bool VariadicNodeGroupSelector::Check(const GraphViewer& graph_viewer,
       return false;
     }
   }
-  return dt_input == dt_output;
+
+  if (dt_input != dt_output) {
+    return false;
+  }
+
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && Is16BitIntType(dt_input)) {
+    return false;
+  }
+
+  return true;
 }
 
 void InputVariadicSelector::UpdateBuilder(NodesToOptimizeIndicesBuilder& builder) const {
@@ -227,12 +279,19 @@ bool ConvNodeGroupSelector::Check(const GraphViewer& graph_viewer,
     }
   }
 
-  if (dq_nodes.size() < 3) {  // no bias
-    return true;
+  if (dq_nodes.size() == 3) {  // has bias
+    int32_t dt_bias = dq_nodes[2]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
+    if (dt_bias != ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_INT32) {
+      return false;
+    }
   }
 
-  int32_t dt_bias = dq_nodes[2]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
-  return dt_bias == ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_INT32;
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && (Is16BitIntType(dt_input) || Is16BitIntType(dt_weight))) {
+    return false;
+  }
+
+  return true;
 }
 
 void ConvSelector::UpdateBuilder(NodesToOptimizeIndicesBuilder& builder) const {
@@ -256,6 +315,11 @@ bool MatMulNodeGroupSelector::Check(const GraphViewer& graph_viewer,
     }
   }
 
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && (Is16BitIntType(dt_input) || Is16BitIntType(dt_weight))) {
+    return false;
+  }
+
   // potential match for QLinearMatMul or MatMulIntegerToFloat
   bool qlinear = !q_nodes.empty();
 
@@ -299,6 +363,11 @@ bool GemmNodeGroupSelector::Check(const GraphViewer& graph_viewer,
     }
   }
 
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && (Is16BitIntType(dt_A) || Is16BitIntType(dt_B))) {
+    return false;
+  }
+
   if (dq_nodes.size() < 3) {  // no bias
     return true;
   }
@@ -326,8 +395,18 @@ bool WhereNodeGroupSelector::Check(const GraphViewer& graph_viewer, const Node&
   const int32_t dt_input_1 = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
   const int32_t dt_input_2 = dq_nodes[1]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
   const int32_t dt_output = q_nodes[0]->OutputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
-  return dt_input_1 == dt_input_2 &&
-         dt_input_1 == dt_output;
+
+  // All input and output types must match.
+  if (dt_input_1 != dt_input_2 || dt_input_1 != dt_output) {
+    return false;
+  }
+
+  // 16-bit int types must be explicitly allowed.
+  if (!allow_16bit_ && Is16BitIntType(dt_input_1)) {
+    return false;
+  }
+
+  return true;
 }
 
 bool PadNodeGroupSelector::Check(const GraphViewer& graph_viewer, const Node& node,
diff --git a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h
index 58ebf81508962..d8fefdd8dc3d9 100644
--- a/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h
+++ b/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h
@@ -52,45 +52,75 @@ class NodeGroupSelector {
 // Single DQ -> node that does not change data -> Q.
 // Zero point and scale are constant scalars and must match
 class DropQDQNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit DropQDQNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
+ private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // Single DQ -> node.
 class DropDQNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit DropDQNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
+ private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // single input. default is to only support uint8.
 class UnaryNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit UnaryNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
+ private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // 2 DQ nodes providing input -> node -> Q
 class BinaryNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit BinaryNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
+ private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // Variadic DQ nodes -> node -> Q
 class VariadicNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit VariadicNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
  private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // DQ nodes for X, W and optionally B -> node -> Q
 class ConvNodeGroupSelector : public NodeGroupSelector {
  public:
   // default to 'true'
-  ConvNodeGroupSelector(bool int8_allowed = true) : int8_allowed_(int8_allowed) {}
+  ConvNodeGroupSelector(bool int8_allowed = true, bool allow_16bit = true)
+      : int8_allowed_(int8_allowed), allow_16bit_(allow_16bit) {}
 
  private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
@@ -98,16 +128,20 @@ class ConvNodeGroupSelector : public NodeGroupSelector {
              const std::vector<const Node*>& q_nodes) const override;
 
   bool int8_allowed_;
+  bool allow_16bit_;
 };
 
 class WhereNodeGroupSelector : public NodeGroupSelector {
  public:
-  WhereNodeGroupSelector() = default;
+  explicit WhereNodeGroupSelector(bool allow_16bit = true)
+      : allow_16bit_(allow_16bit) {}
 
  private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 class PadNodeGroupSelector : public NodeGroupSelector {
@@ -125,9 +159,11 @@ class PadNodeGroupSelector : public NodeGroupSelector {
 class MatMulNodeGroupSelector : public NodeGroupSelector {
  public:
   MatMulNodeGroupSelector(bool int8_allowed = true,
-                          bool matmulintegertofloat_allowed = false)
+                          bool matmulintegertofloat_allowed = false,
+                          bool allow_16bit = true)
       : int8_allowed_(int8_allowed),
-        matmulintegertofloat_allowed_(matmulintegertofloat_allowed) {
+        matmulintegertofloat_allowed_(matmulintegertofloat_allowed),
+        allow_16bit_(allow_16bit) {
   }
 
  private:
@@ -136,15 +172,21 @@ class MatMulNodeGroupSelector : public NodeGroupSelector {
              const std::vector<const Node*>& q_nodes) const override;
   bool int8_allowed_;
   bool matmulintegertofloat_allowed_;
+  bool allow_16bit_;
 };
 
 // Input: DQ nodes for A, B and optional C
 // Output: optional Q node for Y
 class GemmNodeGroupSelector : public NodeGroupSelector {
+ public:
+  explicit GemmNodeGroupSelector(bool allow_16bit = true) : allow_16bit_(allow_16bit) {}
+
  private:
   bool Check(const GraphViewer& graph_viewer, const Node& node,
              const std::vector<const Node*>& dq_nodes,
              const std::vector<const Node*>& q_nodes) const override;
+
+  bool allow_16bit_;
 };
 
 // Input: DQ nodes for input, scale, and B
@@ -207,28 +249,33 @@ class BaseSelector : public NodeSelector {
 
 class DropQDQNodesSelector : public BaseSelector {
  public:
-  DropQDQNodesSelector() : BaseSelector(std::make_unique<DropQDQNodeGroupSelector>()) {}
+  explicit DropQDQNodesSelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<DropQDQNodeGroupSelector>(allow_16bit)) {}
 };
 
 class DropDQNodesSelector : public BaseSelector {
  public:
-  DropDQNodesSelector() : BaseSelector(std::make_unique<DropDQNodeGroupSelector>()) {}
+  explicit DropDQNodesSelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<DropDQNodeGroupSelector>(allow_16bit)) {}
 };
 
 class UnarySelector : public BaseSelector {
  public:
-  UnarySelector() : BaseSelector(std::make_unique<UnaryNodeGroupSelector>()) {}
+  explicit UnarySelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<UnaryNodeGroupSelector>(allow_16bit)) {}
 };
 
 class BinarySelector : public BaseSelector {
  public:
-  BinarySelector() : BaseSelector(std::make_unique<BinaryNodeGroupSelector>()) {}
+  explicit BinarySelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<BinaryNodeGroupSelector>(allow_16bit)) {}
 };
 
 // Variadic DQ nodes -> node -> Q
 class InputVariadicSelector : public BaseSelector {
  public:
-  InputVariadicSelector() : BaseSelector(std::make_unique<VariadicNodeGroupSelector>()) {}
+  explicit InputVariadicSelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<VariadicNodeGroupSelector>(allow_16bit)) {}
 
   void UpdateBuilder(NodesToOptimizeIndicesBuilder&) const override;
 };
@@ -244,46 +291,36 @@ class OutputVariadicSelector : public BaseSelector {
 // DQ nodes for X, W and optionally B -> node -> Q
 class ConvSelector : public BaseSelector {
  public:
-  ConvSelector(bool int8_allowed = false) : BaseSelector(std::make_unique<ConvNodeGroupSelector>(int8_allowed)) {}
+  ConvSelector(bool int8_allowed = false, bool allow_16bit = false)
+      : BaseSelector(std::make_unique<ConvNodeGroupSelector>(int8_allowed, allow_16bit)) {}
 
   void UpdateBuilder(NodesToOptimizeIndicesBuilder&) const override;
 };
+
 class WhereSelector : public BaseSelector {
  public:
-  WhereSelector() : BaseSelector(std::make_unique<WhereNodeGroupSelector>()) {}
+  explicit WhereSelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<WhereNodeGroupSelector>(allow_16bit)) {}
 };
+
 // 2 DQ nodes for input -> node -> optional Q if QLinearMatMul, MatMulIntegerToFloat if not
 class MatMulSelector : public BaseSelector {
  public:
-  MatMulSelector(bool int8_allowed)
-      : BaseSelector(std::make_unique<MatMulNodeGroupSelector>(int8_allowed, /*matmulintegertofloat_allowed*/ true)) {}
+  MatMulSelector(bool int8_allowed, bool allow_16bit = false)
+      : BaseSelector(std::make_unique<MatMulNodeGroupSelector>(int8_allowed, /*matmulintegertofloat_allowed*/ true,
+                                                               allow_16bit)) {}
 };
 
 // Input: DQ nodes for A, B and optional C
 // Output: optional Q node for Y
 class GemmSelector : public BaseSelector {
  public:
-  GemmSelector()
-      : BaseSelector(std::make_unique<GemmNodeGroupSelector>()) {}
+  explicit GemmSelector(bool allow_16bit = false)
+      : BaseSelector(std::make_unique<GemmNodeGroupSelector>(allow_16bit)) {}
 
   void UpdateBuilder(NodesToOptimizeIndicesBuilder&) const override;
 };
 
-// Input: DQ nodes for input, scale, and B (bias)
-// Output: Q node for output
-class InstanceNormalizationSelector : public BaseSelector {
- public:
-  InstanceNormalizationSelector()
-      : BaseSelector(std::make_unique<InstanceAndLayerNormalizationNodeGroupSelector>()) {}
-};
-
-// DQ nodes for X, W and optionally B, (mean, var not required) -> node -> Q
-class BatchNormalizationSelector : public BaseSelector {
- public:
-  BatchNormalizationSelector(bool int8_allowed = false)
-      : BaseSelector(std::make_unique<BatchNormalizationNodeGroupSelector>(int8_allowed)) {}
-};
-
 }  // namespace QDQ
 }  // namespace onnxruntime
 
diff --git a/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc b/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
index 3723ee6032582..2c11bf144999e 100644
--- a/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
+++ b/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
@@ -1195,7 +1195,7 @@ bool TransposeQuantizeDequantizeAxis(const api::GraphRef& graph, const std::vect
 static bool HandleQuantizeDequantizeAxis(const api::GraphRef& graph, const std::vector<int64_t>& perm,
                                          api::NodeRef& node, int64_t opset) {
   if (opset < 13) {
-    // no `axis` value until opset 13
+    // no `axis` attribute until opset 13
     return true;
   }
 
diff --git a/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc b/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc
index 67a9a5991939a..a0d75e8cc0e69 100644
--- a/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc
+++ b/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc
@@ -5,13 +5,47 @@
 #include "core/framework/element_type_lists.h"
 #include "core/framework/float8.h"
 #include "core/framework/float16.h"
-#include "core/providers/cpu/quantization/quantize_linear.h"
+#include "core/framework/op_kernel.h"
 #include "core/providers/common.h"
 #include "core/mlas/inc/mlas.h"
 #include "core/util/qmath.h"
 
 namespace onnxruntime {
 
+template <typename T>
+class DequantizeLinear final : public OpKernel {
+ public:
+  explicit DequantizeLinear(const OpKernelInfo& info) : OpKernel(info) {
+    if (!info.GetAttr<int64_t>("axis", &axis_).IsOK()) {
+      axis_ = 1;
+    }
+  }
+
+  Status Compute(OpKernelContext* context) const override;
+
+ private:
+  int64_t axis_;
+};
+
+template <typename T>
+class QuantizeLinear final : public OpKernel {
+ public:
+  explicit QuantizeLinear(const OpKernelInfo& info) : OpKernel(info) {
+    if (!info.GetAttr<int64_t>("axis", &axis_).IsOK()) {
+      axis_ = 1;
+    }
+    if (!info.GetAttr<int64_t>("saturate", &saturate_).IsOK()) {
+      saturate_ = 1;
+    }
+  }
+
+  Status Compute(OpKernelContext* context) const override;
+
+ private:
+  int64_t axis_;
+  int64_t saturate_;
+};
+
 static void PrepareForQDQ(const TensorShape& input_shape,
                           const Tensor& scale,
                           const Tensor* zero_point_ptr,
@@ -86,6 +120,59 @@ REGISTER_DEQUANTIZELINEAR_VERSIONED(int8_t)
 REGISTER_DEQUANTIZELINEAR_VERSIONED(uint8_t)
 REGISTER_DEQUANTIZELINEAR_VERSIONED(int32_t)
 
+#if !defined(DISABLE_CONTRIB_OPS)
+namespace contrib {
+
+// Register alternate MS domain versions of the DequantizeLinear kernel.
+// The MS domain versions additionally support 16-bit integer quantization types.
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    uint8_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint8_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<uint8_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    int8_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int8_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<int8_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    uint16_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint16_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<uint16_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    int16_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int16_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<int16_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    DequantizeLinear,
+    1,
+    int32_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int32_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<float>()),
+    DequantizeLinear<int32_t>);
+
+}  // namespace contrib
+#endif  // !defined(DISABLE_CONTRIB_OPS)
+
 template <typename T, typename OutT>
 struct DequantizeLinearApply {
   void op(int64_t N, int64_t broadcast_dim, int64_t block_size, const T* input, const OutT* scale, OutT* output, const T* zero_point) {
@@ -220,6 +307,49 @@ REGISTER_QUANTIZELINEAR(Float8E5M2FNUZ)
 REGISTER_QUANTIZELINEAR_VERSIONED(int8_t)
 REGISTER_QUANTIZELINEAR_VERSIONED(uint8_t)
 
+#if !defined(DISABLE_CONTRIB_OPS)
+namespace contrib {
+
+// Register alternate MS domain versions of the QuantizeLinear kernel.
+// The MS domain versions additionally support 16-bit integer quantization types.
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    QuantizeLinear,
+    1,
+    uint8_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint8_t>()),
+    QuantizeLinear<uint8_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    QuantizeLinear,
+    1,
+    int8_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int8_t>()),
+    QuantizeLinear<int8_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    QuantizeLinear,
+    1,
+    uint16_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint16_t>()),
+    QuantizeLinear<uint16_t>);
+
+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    QuantizeLinear,
+    1,
+    int16_t,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int16_t>()),
+    QuantizeLinear<int16_t>);
+}  // namespace contrib
+#endif  // !defined(DISABLE_CONTRIB_OPS)
+
 template <typename InputType, typename OutputType>
 void ParQuantizeLinear(const InputType* Input,
                        OutputType* Output,
@@ -279,5 +409,4 @@ Status QuantizeLinear<T>::Compute(OpKernelContext* ctx) const {
 
   return Status::OK();
 }
-
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/quantization/quantize_linear.h b/onnxruntime/core/providers/cpu/quantization/quantize_linear.h
deleted file mode 100644
index 60e9d09665ab2..0000000000000
--- a/onnxruntime/core/providers/cpu/quantization/quantize_linear.h
+++ /dev/null
@@ -1,45 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT License.
-
-#pragma once
-
-#include "core/common/common.h"
-#include "core/framework/op_kernel.h"
-#include "core/util/math_cpuonly.h"
-
-namespace onnxruntime {
-
-template <typename T>
-class DequantizeLinear final : public OpKernel {
- public:
-  DequantizeLinear(const OpKernelInfo& info) : OpKernel(info) {
-    if (!info.GetAttr<int64_t>("axis", &axis_).IsOK()) {
-      axis_ = 1;
-    }
-  }
-
-  Status Compute(OpKernelContext* context) const override;
-
- private:
-  int64_t axis_;
-};
-
-template <typename T>
-class QuantizeLinear final : public OpKernel {
- public:
-  QuantizeLinear(const OpKernelInfo& info) : OpKernel(info) {
-    if (!info.GetAttr<int64_t>("axis", &axis_).IsOK()) {
-      axis_ = 1;
-    }
-    if (!info.GetAttr<int64_t>("saturate", &saturate_).IsOK()) {
-      saturate_ = 1;
-    }
-  }
-
-  Status Compute(OpKernelContext* context) const override;
-
- private:
-  int64_t axis_;
-  int64_t saturate_;
-};
-}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/qnn/builder/opbuilder/simple_op_builder.cc b/onnxruntime/core/providers/qnn/builder/opbuilder/simple_op_builder.cc
index 556a86bb1519b..8081033c35618 100644
--- a/onnxruntime/core/providers/qnn/builder/opbuilder/simple_op_builder.cc
+++ b/onnxruntime/core/providers/qnn/builder/opbuilder/simple_op_builder.cc
@@ -30,6 +30,12 @@ class SimpleOpBuilder : public BaseOpBuilder {
 
  private:
   Status ExplicitOpCheck(const QnnModelWrapper& qnn_model_wrapper, const NodeUnit& node_unit) const;
+  Status ProcessSigmoidOrTanhOutput(QnnModelWrapper& qnn_model_wrapper,
+                                    const NodeUnit& node_unit,
+                                    std::vector<std::string>&& input_names,
+                                    std::vector<std::string>&& param_tensor_names,
+                                    const logging::Logger& logger,
+                                    bool do_op_validation) const ORT_MUST_USE_RESULT;
 
   static constexpr std::array<std::string_view, 2> gridsample_supported_modes = {"bilinear", "nearest"};
   static constexpr std::array<std::string_view, 3> gridsample_supported_padding_modes = {"zeros", "border", "reflection"};
@@ -279,10 +285,120 @@ Status SimpleOpBuilder::ProcessAttributesAndOutputs(QnnModelWrapper& qnn_model_w
     ORT_RETURN_IF_ERROR(ProcessGridSampleAttributes(qnn_model_wrapper, node_unit, param_tensor_names));
   }
 
-  ORT_RETURN_IF_ERROR(ProcessOutputs(qnn_model_wrapper, node_unit,
-                                     std::move(input_names),
-                                     std::move(param_tensor_names),
-                                     logger, do_op_validation, GetQnnOpType(op_type)));
+  if (op_type == "Sigmoid" || op_type == "Tanh") {
+    // QNN requires 16-bit QDQ Sigmoid and Tanh to use specific output scale and zero-point values
+    // regardless of floating-point range.
+    return ProcessSigmoidOrTanhOutput(qnn_model_wrapper,
+                                      node_unit,
+                                      std::move(input_names),
+                                      std::move(param_tensor_names),
+                                      logger, do_op_validation);
+  }
+
+  return ProcessOutputs(qnn_model_wrapper, node_unit,
+                        std::move(input_names),
+                        std::move(param_tensor_names),
+                        logger, do_op_validation, GetQnnOpType(op_type));
+}
+
+/**
+ * Overrides offset and scale quantization parameters for operators (e.g., Sigmoid or Tanh) that require
+ * specific values. Returns true if the quantization parameters were overridden.
+ *
+ * \param op_type The ONNX operator type.
+ * \param qnn_data_type The QNN tensor data type.
+ * \param quant_params Output scale/offset parameter that may be overridden.
+ * \return True if the offset and scale were overridden.
+ */
+static bool OverrideQuantParams(const std::string& op_type, Qnn_DataType_t qnn_data_type,
+                                Qnn_ScaleOffset_t& quant_params) {
+  const int32_t orig_offset = quant_params.offset;
+  const float orig_scale = quant_params.scale;
+
+  if (op_type == "Sigmoid") {
+    switch (qnn_data_type) {
+      case QNN_DATATYPE_UFIXED_POINT_16:
+        quant_params.offset = 0;
+        quant_params.scale = 1.0f / 65536.0f;
+        break;
+      case QNN_DATATYPE_SFIXED_POINT_16:
+        quant_params.offset = 0;
+        quant_params.scale = 1.0f / 32768.0f;
+        break;
+      default:
+        break;  // Do nothing.
+    }
+  }
+
+  if (op_type == "Tanh") {
+    switch (qnn_data_type) {
+      case QNN_DATATYPE_UFIXED_POINT_16:
+        quant_params.offset = -32768;
+        quant_params.scale = 1.0f / 32768.0f;
+        break;
+      case QNN_DATATYPE_SFIXED_POINT_16:
+        quant_params.offset = 0;
+        quant_params.scale = 1.0f / 32768.0f;
+        break;
+      default:
+        break;  // Do nothing.
+    }
+  }
+
+  return quant_params.offset != orig_offset || quant_params.scale != orig_scale;
+}
+
+/**
+ * Processes the output for Sigmoid or Tanh operators and creates the corresponding QNN operator.
+ * These operator types are handled separately because QNN requires 16-bit QDQ Sigmoid and Tanh operators to use
+ * specific scale and zero-point values regardless of floating-point range.
+ *
+ * \param qnn_model_wrapper The QNN model wrapper object.
+ * \param node_unit The QDQ node unit for the Sigmoid or Tanh node.
+ * \param input_names List of input names.
+ * \param param_tensor_names List of param tensor names.
+ * \param logger Logger used to report information.
+ * \param do_op_validation True if the new QNN node should be validated.
+ */
+Status SimpleOpBuilder::ProcessSigmoidOrTanhOutput(QnnModelWrapper& qnn_model_wrapper,
+                                                   const NodeUnit& node_unit,
+                                                   std::vector<std::string>&& input_names,
+                                                   std::vector<std::string>&& param_tensor_names,
+                                                   const logging::Logger& logger,
+                                                   bool do_op_validation) const {
+  const std::string& op_type = node_unit.OpType();
+  const auto& output = node_unit.Outputs()[0];
+  const std::string& output_name = output.node_arg.Name();
+
+  OnnxInputInfo output_info = {};
+
+  // TODO(adrianlizarraga): Rename GetOnnxInputInfo() since it can be used for outputs as well.
+  ORT_RETURN_IF_ERROR(qnn_model_wrapper.GetOnnxInputInfo(output, output_info));
+
+  if (output_info.quant_param.quantizationEncoding == QNN_QUANTIZATION_ENCODING_SCALE_OFFSET) {
+    if (OverrideQuantParams(op_type, output_info.qnn_data_type, output_info.quant_param.scaleOffsetEncoding)) {
+      const int32_t offset = output_info.quant_param.scaleOffsetEncoding.offset;
+      const float scale = output_info.quant_param.scaleOffsetEncoding.scale;
+
+      LOGS(logger, VERBOSE) << "QNN requires that 16-bit quantized " << op_type << " operators use offset/scale values "
+                            << "of <" << offset << ", " << scale << ">. QNN EP will override the original values.";
+    }
+  }
+
+  Qnn_TensorType_t tensor_type = qnn_model_wrapper.IsGraphOutput(output_name) ? QNN_TENSOR_TYPE_APP_READ
+                                                                              : QNN_TENSOR_TYPE_NATIVE;
+  QnnTensorWrapper output_tensorwrapper(output_name, tensor_type, output_info.qnn_data_type, output_info.quant_param,
+                                        std::move(output_info.shape));
+  ORT_RETURN_IF_NOT(qnn_model_wrapper.AddTensorWrapper(std::move(output_tensorwrapper)), "Failed to add tensor.");
+  ORT_RETURN_IF_NOT(qnn_model_wrapper.CreateQnnNode(GetNodeName(node_unit),
+                                                    QNN_OP_PACKAGE_NAME_QTI_AISW,
+                                                    GetQnnOpType(op_type),
+                                                    std::move(input_names),
+                                                    {output_name},
+                                                    std::move(param_tensor_names),
+                                                    do_op_validation),
+                    "Failed to add node.");
+
   return Status::OK();
 }
 
diff --git a/onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc b/onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc
index eebe75d839b12..9d339387b0a43 100644
--- a/onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc
+++ b/onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc
@@ -301,6 +301,16 @@ bool QnnModelWrapper::ProcessOffset(const std::string& offset_name,
       offset_value = 0 - (uint8_span.data()[0]);
       break;
     }
+    case ONNX_NAMESPACE::TensorProto_DataType_UINT16: {
+      auto uint16_span = ReinterpretAsSpan<const uint16_t>(gsl::make_span(unpacked_tensor));
+      offset_value = -static_cast<int32_t>(uint16_span.data()[0]);
+      break;
+    }
+    case ONNX_NAMESPACE::TensorProto_DataType_INT16: {
+      auto int16_span = ReinterpretAsSpan<const int16_t>(gsl::make_span(unpacked_tensor));
+      offset_value = -static_cast<int32_t>(int16_span.data()[0]);
+      break;
+    }
     case ONNX_NAMESPACE::TensorProto_DataType_INT32: {
       auto int32_span = ReinterpretAsSpan<const int32_t>(gsl::make_span(unpacked_tensor));
       offset_value = -(int32_span.data()[0]);
diff --git a/onnxruntime/python/tools/quantization/onnx_quantizer.py b/onnxruntime/python/tools/quantization/onnx_quantizer.py
index 924d4c72b6390..2d1e418f9d2b4 100644
--- a/onnxruntime/python/tools/quantization/onnx_quantizer.py
+++ b/onnxruntime/python/tools/quantization/onnx_quantizer.py
@@ -104,7 +104,7 @@ def __init__(
         )
         self.q_matmul_const_b_only = "MatMulConstBOnly" in self.extra_options and self.extra_options["MatMulConstBOnly"]
         self.is_weight_symmetric = (
-            weight_qType in (QuantType.QInt8, QuantType.QFLOAT8E4M3FN)
+            weight_qType in (QuantType.QInt8, QuantType.QInt16, QuantType.QFLOAT8E4M3FN)
             if "WeightSymmetric" not in self.extra_options
             else self.extra_options["WeightSymmetric"]
         )
diff --git a/onnxruntime/python/tools/quantization/qdq_quantizer.py b/onnxruntime/python/tools/quantization/qdq_quantizer.py
index f87a9d8228bac..e595b580b20df 100644
--- a/onnxruntime/python/tools/quantization/qdq_quantizer.py
+++ b/onnxruntime/python/tools/quantization/qdq_quantizer.py
@@ -25,6 +25,7 @@
     add_quant_output_suffix,
     add_quant_suffix,
     find_by_name,
+    ms_domain,
 )
 from .registry import CreateQDQQuantizer
 
@@ -119,6 +120,20 @@ def __init__(
             else extra_options["QDQOpTypePerChannelSupportToAxis"]
         )
 
+        self.qdq_op_domain = ms_domain if extra_options.get("UseQDQContribOps", False) else None
+
+        # The ONNX spec does not yet support 16-bit Q/DQ ops. So, must override the Q/DQ op domain to 'com.microsoft'
+        # if the activation or weight types are 16-bit integers.
+        # TODO: Remove this override (and use only the 'UseQDQContribOps' option) if/when ONNX adds 16-bit support.
+        int16_types = (TensorProto.UINT16, TensorProto.INT16)
+        if not self.qdq_op_domain and (self.activation_qType in int16_types or self.weight_qType in int16_types):
+            logging.warning(
+                "ONNX QuantizeLinear and DequantizeLinear operators do not support 16-bit integer quantization types. "
+                f"The domain of QuantizeLinear and DequantizeLinear operators will be set to '{ms_domain}' to "
+                "enable support."
+            )
+            self.qdq_op_domain = ms_domain
+
     def _is_tensor_quantizable(self, tensor_name):
         """
         Check if tensor can be quantized
@@ -249,6 +264,7 @@ def _create_qdq_nodes(
             [q_output],
             quant_node_name,
             axis=axis,
+            domain=self.qdq_op_domain,
         )
         dequant_node = onnx.helper.make_node(
             DEQUANT_OP_NAME,
@@ -256,6 +272,7 @@ def _create_qdq_nodes(
             [dq_output],
             dequant_node_name,
             axis=axis,
+            domain=self.qdq_op_domain,
         )
         self.model.add_nodes([qlinear_node, dequant_node])
 
@@ -300,6 +317,7 @@ def _add_qdq_pair_for_initializer(self, weight_proto, tensor_type, axis=None):
                 [weight_dequant_output],
                 add_dequant_suffix(weight_name),
                 axis=axis,
+                domain=self.qdq_op_domain,
             )
             self.model.add_node(dequant_node)
 
@@ -443,6 +461,7 @@ def _quantize_bias_tensors(self):
                         [bias_name],
                         node_name,
                         axis=quant_value.axis,
+                        domain=self.qdq_op_domain,
                     )
                 else:
                     dequant_node = onnx.helper.make_node(
@@ -450,6 +469,7 @@ def _quantize_bias_tensors(self):
                         inputs,
                         [bias_name],
                         node_name,
+                        domain=self.qdq_op_domain,
                     )
             else:
                 raise RuntimeError(f"Unexpected operator type {quant_value.node_type!r}.")
diff --git a/onnxruntime/python/tools/quantization/quant_utils.py b/onnxruntime/python/tools/quantization/quant_utils.py
index 4d5bcca29618f..74e54c3f1fa37 100644
--- a/onnxruntime/python/tools/quantization/quant_utils.py
+++ b/onnxruntime/python/tools/quantization/quant_utils.py
@@ -72,6 +72,8 @@ class QuantType(Enum):
     QInt8 = 0
     QUInt8 = 1
     QFLOAT8E4M3FN = 2
+    QInt16 = 3
+    QUInt16 = 4
 
     def __str__(self):
         return self.name
@@ -89,6 +91,10 @@ def tensor_type(self):
             return TensorProto.INT8
         if self == QuantType.QUInt8:
             return TensorProto.UINT8
+        if self == QuantType.QUInt16:
+            return TensorProto.UINT16
+        if self == QuantType.QInt16:
+            return TensorProto.INT16
         if self == QuantType.QFLOAT8E4M3FN:
             return TensorProto.FLOAT8E4M3FN
         raise ValueError(f"Unexpected value qtype={self!r}.")
@@ -112,12 +118,35 @@ def from_string(format):
 ONNX_TYPE_TO_NP_TYPE = {
     onnx_proto.TensorProto.INT8: numpy.dtype("int8"),
     onnx_proto.TensorProto.UINT8: numpy.dtype("uint8"),
+    onnx_proto.TensorProto.INT16: numpy.dtype("int16"),
+    onnx_proto.TensorProto.UINT16: numpy.dtype("uint16"),
     onnx_proto.TensorProto.FLOAT8E4M3FN: float8e4m3fn,
 }
 
+ONNX_INT_TYPE_RANGE = {
+    onnx_proto.TensorProto.UINT8: (0, 255),
+    onnx_proto.TensorProto.INT8: (-128, 127),
+    onnx_proto.TensorProto.UINT16: (0, 65535),
+    onnx_proto.TensorProto.INT16: (-32768, 32767),
+}
+
+ONNX_INT_TYPE_SYMMETRIC_RANGE = {
+    onnx_proto.TensorProto.INT8: (-127, 127),
+    onnx_proto.TensorProto.INT16: (-32767, 32767),
+}
+
+ONNX_INT_TYPE_REDUCED_RANGE = {
+    onnx_proto.TensorProto.UINT8: (0, 127),
+    onnx_proto.TensorProto.INT8: (-64, 64),
+    onnx_proto.TensorProto.UINT16: (0, 32767),
+    onnx_proto.TensorProto.INT16: (-16384, 16384),
+}
+
 
 def quantize_nparray(qType, arr, scale, zero_point, low=None, high=None):
-    assert qType in ONNX_TYPE_TO_NP_TYPE, f"Unexpected data type {qType} requested. Only INT8 and UINT8 are supported."
+    assert (
+        qType in ONNX_TYPE_TO_NP_TYPE
+    ), f"Unexpected data type {qType} requested. Only INT8, UINT8, INT16, and UINT16 are supported."
     if qType in (
         onnx_proto.TensorProto.FLOAT8E4M3FN,
         onnx_proto.TensorProto.FLOAT8E4M3FNUZ,
@@ -146,8 +175,10 @@ def quantize_nparray(qType, arr, scale, zero_point, low=None, high=None):
         return ref.run(None, {"X": arr.astype(numpy.float32), "scale": scale.astype(numpy.float32)})[0]
     else:
         dtype = ONNX_TYPE_TO_NP_TYPE[qType]
-        cliplow = max(0 if dtype == numpy.uint8 else -127, -127 if low is None else low)
-        cliphigh = min(255 if dtype == numpy.uint8 else 127, 255 if high is None else high)
+        (qmin, qmax) = get_qmin_qmax_for_qType(qType, reduce_range=False, symmetric=True)
+
+        cliplow = max(qmin, low) if low is not None else qmin
+        cliphigh = min(qmax, high) if high is not None else qmax
         arr_fp32 = numpy.asarray((arr.astype(numpy.float32) / scale).round() + zero_point)
         numpy.clip(arr_fp32, cliplow, cliphigh, out=arr_fp32)
         return arr_fp32.astype(dtype)
@@ -267,7 +298,7 @@ def quantize_data(data, qType, symmetric, reduce_range=False):
             )
         return rmin, rmax, zero_point, scale, quantized_data
 
-    if qType in (TensorProto.INT8, TensorProto.UINT8):
+    if qType in (TensorProto.INT8, TensorProto.UINT8, TensorProto.INT16, TensorProto.UINT16):
         if len(data):
             qmin, qmax = get_qmin_qmax_for_qType(qType, reduce_range, symmetric=symmetric)
             zero_point, scale = compute_scale_zp(rmin, rmax, qmin, qmax, symmetric)
@@ -283,18 +314,22 @@ def get_qmin_qmax_for_qType(qType, reduce_range=False, symmetric=False):  # noqa
     :parameter qType: onnx.onnx_pb.TensorProto.UINT8 or onnx.onnx_pb.TensorProto.UINT8
     :return: qmin, qmax
     """
-    if qType == onnx_proto.TensorProto.UINT8:
-        (qmin, qmax) = (0, 127) if reduce_range else (0, 255)
-    elif qType == onnx_proto.TensorProto.INT8:
-        if symmetric:
-            (qmin, qmax) = (-64, 64) if reduce_range else (-127, 127)
-        else:
-            (qmin, qmax) = (-64, 64) if reduce_range else (-128, 127)
-    elif qType == onnx_proto.TensorProto.FLOAT8E4M3FN:
+    if qType == onnx_proto.TensorProto.FLOAT8E4M3FN:
         raise NotImplementedError("This function is not implemented for float 8 as not needed.")
+
+    qrange = None
+
+    if reduce_range:
+        qrange = ONNX_INT_TYPE_REDUCED_RANGE.get(qType)
+    elif symmetric and qType in ONNX_INT_TYPE_SYMMETRIC_RANGE:
+        qrange = ONNX_INT_TYPE_SYMMETRIC_RANGE[qType]
     else:
-        raise ValueError(f"Unexpected data type {qType} requested. Only INT8 and UINT8 are supported.")
-    return qmin, qmax
+        qrange = ONNX_INT_TYPE_RANGE.get(qType)
+
+    if not qrange:
+        raise ValueError(f"Unexpected data type {qType} requested. Only INT8, UINT8, INT16, and UINT16 are supported.")
+
+    return qrange
 
 
 def get_qrange_for_qType(qType, reduce_range=False, symmetric=False):  # noqa: N802
diff --git a/onnxruntime/python/tools/quantization/quantize.py b/onnxruntime/python/tools/quantization/quantize.py
index 6b1646aec9679..706047fe32400 100644
--- a/onnxruntime/python/tools/quantization/quantize.py
+++ b/onnxruntime/python/tools/quantization/quantize.py
@@ -240,6 +240,11 @@ def check_static_quant_arguments(quant_format: QuantFormat, activation_type: Qua
             f"weight_type={weight_type}!=QuantType.QFLOAT8E4M3FN"
         )
 
+    q16_types = [QuantType.QInt16, QuantType.QUInt16]
+
+    if (activation_type in q16_types or weight_type in q16_types) and quant_format != QuantFormat.QDQ:
+        raise ValueError("Only QuantFormat.QDQ supports 16-bit quantization types.")
+
     if activation_type == QuantType.QInt8 and weight_type == QuantType.QInt8 and quant_format != QuantFormat.QDQ:
         logging.warning(
             "Please use QuantFormat.QDQ for activation type QInt8 and weight type QInt8. "
@@ -356,6 +361,11 @@ def quantize_static(
                 SmoothQuantFolding = True/False :
                     Default is True. It only works if SmoothQuant is True. If enabled, inserted Mul ops during
                     SmoothQuant will be folded into the previous op if the previous op is foldable.
+                UseQDQContribOps = True/False :
+                    Default is False. If enabled, the inserted QuantizeLinear and DequantizeLinear ops will have the
+                    `com.microsoft` domain, which forces use of ONNX Runtime's QuantizeLinear and DequantizeLinear
+                    contrib op implementations. The contrib op implementations may support features not standardized
+                    into the ONNX specification (e.g., 16-bit quantization types).
     """
     if activation_type == QuantType.QFLOAT8E4M3FN or weight_type == QuantType.QFLOAT8E4M3FN:
         if calibrate_method != CalibrationMethod.Distribution:
diff --git a/onnxruntime/test/contrib_ops/quantize_ops_test.cc b/onnxruntime/test/contrib_ops/quantize_ops_test.cc
index af29f972a64cf..64a97ed4f945b 100644
--- a/onnxruntime/test/contrib_ops/quantize_ops_test.cc
+++ b/onnxruntime/test/contrib_ops/quantize_ops_test.cc
@@ -4,6 +4,7 @@
 #include "gtest/gtest.h"
 #include "test/common/tensor_op_test_utils.h"
 #include "test/providers/provider_test_utils.h"
+#include "test/util/include/default_providers.h"
 
 namespace onnxruntime {
 namespace test {
@@ -40,7 +41,31 @@ TEST(DequantizeLinearOpTest, DequantizeLinear_per_tensor_float_int8) {
   test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
 }
 
-// Scalar zero & scale with int32
+// Test int16 com.microsoft.DequantizeLinear (per tensor)
+TEST(DequantizeLinearOpTest, DequantizeLinear_per_tensor_float_int16_cpu) {
+  OpTester test("DequantizeLinear", 1, onnxruntime::kMSDomain);
+  std::vector<int64_t> dims{4};
+  test.AddInput<int16_t>("x", dims, {-300, -30, -1025, 1270});
+  test.AddInput<float>("scale", {}, {2.0f}, true);
+  test.AddInput<int16_t>("zero_point", {}, {-1024}, true);
+  test.AddOutput<float>("y", dims, {1448.0f, 1988.0f, -2.0f, 4588.0f});
+  // Disable Tensorrt EP due to error: unsupported data type
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
+}
+
+// Test uint16 com.microsoft.DequantizeLinear (per tensor)
+TEST(DequantizeLinearOpTest, DequantizeLinear_per_tensor_float_uint16_cpu) {
+  OpTester test("DequantizeLinear", 1, onnxruntime::kMSDomain);
+  std::vector<int64_t> dims{4};
+  test.AddInput<uint16_t>("x", dims, {30000, 31000, 32768, 33000});
+  test.AddInput<float>("scale", {}, {2.0f}, true);
+  test.AddInput<uint16_t>("zero_point", {}, {32767}, true);
+  test.AddOutput<float>("y", dims, {-5534.0f, -3534.0f, 2.0f, 466.0f});
+  // Disable Tensorrt EP due to error: unsupported data type
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
+}
+
+// Test int32 DequantizeLinear with scalar zero-point & scale.
 TEST(DequantizeLinearOpTest, DequantizeLinear_per_tensor_float_int32_cpu) {
   OpTester test("DequantizeLinear", 1, onnxruntime::kMSDomain);
   std::vector<int64_t> dims{4};
@@ -256,6 +281,60 @@ TEST(QuantizeLinearContribOpTest, QuantizeLinear_per_tensor_float_int8) {
   test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
 }
 
+// Test uint16 com.microsoft.QuantizeLinear (per tensor)
+TEST(QuantizeLinearContribOpTest, QuantizeLinear_per_tensor_float_uint16) {
+  OpTester test("QuantizeLinear", 1, onnxruntime::kMSDomain);
+  std::vector<int64_t> dims{12};
+  test.AddInput<float>("x", dims, {
+                                      0.f, -128.f, 3.f, -3.f,  // rounding half to even
+                                      2.9f, -2.9f,             // round < .5
+                                      3.1f, -3.1f,             // round > .5
+                                      65536.f, -65534.f,       // critical point
+                                      70000.f, -70000.f        // saturate case
+                                  });
+  test.AddInput<float>("scale", {}, {2.0f}, true);
+  test.AddInput<uint16_t>("zero_point", {}, {32767}, true);
+  test.AddOutput<uint16_t>("y", dims,
+                           {32767, 32703,
+                            32769, 32765,
+                            32768, 32766,
+                            32769, 32765,
+                            65535, 0,
+                            65535, 0});
+
+  // Disable Tensorrt EP due to error: unsupported data type
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
+}
+
+// Test int16 com.microsoft.QuantizeLinear (per tensor)
+TEST(QuantizeLinearContribOpTest, QuantizeLinear_per_tensor_float_int16) {
+  OpTester test("QuantizeLinear", 1, onnxruntime::kMSDomain);
+  std::vector<int64_t> dims{16};
+  test.AddInput<float>("x", dims, {
+                                      0.f, -514.f, 3.f, -3.f,  // rounding half to even
+                                      2.9f, -2.9f,             // round < .5
+                                      3.1f, -3.1f,             // round > .5
+                                      65022.f, -66046.f,       // critical point
+                                      65023.f, -66047.f,       // critical point
+                                      65024.f, -66048.f,       // critical point
+                                      70000.f, -70000.f        // saturate case
+                                  });
+  test.AddInput<float>("scale", {}, {2.0f}, true);
+  test.AddInput<int16_t>("zero_point", {}, {256}, true);
+  test.AddOutput<int16_t>("y", dims,
+                          {256, -1,
+                           258, 254,
+                           257, 255,
+                           258, 254,
+                           32767, -32767,
+                           32767, -32768,
+                           32767, -32768,
+                           32767, -32768});
+
+  // Disable Tensorrt EP due to error: unsupported data type
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
+}
+
 #ifdef USE_CUDA
 TEST(QuantizeLinearContribOpTest, QuantizeLinear_per_tensor_half_uint8) {
   OpTester test("QuantizeLinear", 1, onnxruntime::kMSDomain);
diff --git a/onnxruntime/test/mlas/unittest/test_quantizelinear.cpp b/onnxruntime/test/mlas/unittest/test_quantizelinear.cpp
index 55d1a2f4f3608..2832598fef1a9 100644
--- a/onnxruntime/test/mlas/unittest/test_quantizelinear.cpp
+++ b/onnxruntime/test/mlas/unittest/test_quantizelinear.cpp
@@ -3,26 +3,26 @@
 
 #include "test_util.h"
 
-template <typename xint8_t>
+template <typename QuantInt>
 class MlasQuantizeLinearTest : public MlasTestBase {
  private:
   MatrixGuardBuffer<float> BufferInput;
-  MatrixGuardBuffer<xint8_t> BufferOutput;
-  MatrixGuardBuffer<xint8_t> BufferOutputReference;
+  MatrixGuardBuffer<QuantInt> BufferOutput;
+  MatrixGuardBuffer<QuantInt> BufferOutputReference;
 
-  void GenerateReference(const float* Input, xint8_t* OutputReference, size_t N, float Scale, xint8_t ZeroPoint) {
+  void GenerateReference(const float* Input, QuantInt* OutputReference, size_t N, float Scale, QuantInt ZeroPoint) {
     for (size_t n = 0; n < N; n++) {
       float FloatValue = std::nearbyintf(Input[n] / Scale) + float(ZeroPoint);
-      FloatValue = std::max(FloatValue, float(std::numeric_limits<xint8_t>::min()));
-      FloatValue = std::min(FloatValue, float(std::numeric_limits<xint8_t>::max()));
-      OutputReference[n] = (xint8_t)FloatValue;
+      FloatValue = std::max(FloatValue, static_cast<float>(std::numeric_limits<QuantInt>::min()));
+      FloatValue = std::min(FloatValue, static_cast<float>(std::numeric_limits<QuantInt>::max()));
+      OutputReference[n] = static_cast<QuantInt>(FloatValue);
     }
   }
 
   void Test(size_t N) {
     float* Input = BufferInput.GetBuffer(N);
-    xint8_t* Output = BufferOutput.GetBuffer(N);
-    xint8_t* OutputReference = BufferOutputReference.GetBuffer(N);
+    QuantInt* Output = BufferOutput.GetBuffer(N);
+    QuantInt* OutputReference = BufferOutputReference.GetBuffer(N);
 
     std::default_random_engine generator(static_cast<unsigned>(N));
 
@@ -34,8 +34,9 @@ class MlasQuantizeLinearTest : public MlasTestBase {
 
     float Scale = (MaximumValue - MinimumValue) / 512.f;
 
-    std::uniform_int_distribution<int32_t> zp_distribution(std::numeric_limits<xint8_t>::min(), std::numeric_limits<xint8_t>::max());
-    xint8_t ZeroPoint = static_cast<xint8_t>(zp_distribution(generator));
+    std::uniform_int_distribution<int32_t> zp_distribution(std::numeric_limits<QuantInt>::min(),
+                                                           std::numeric_limits<QuantInt>::max());
+    QuantInt ZeroPoint = static_cast<QuantInt>(zp_distribution(generator));
 
     std::uniform_real_distribution<float> distribution(MinimumValue, MaximumValue);
     for (size_t n = 0; n < N; n++) {
@@ -52,8 +53,15 @@ class MlasQuantizeLinearTest : public MlasTestBase {
 
  public:
   static const char* GetTestSuiteName() {
-    static const std::string suite_name(std::is_signed<xint8_t>::value ? "QuantizeLinearS8" : "QuantizeLinearU8");
-    return suite_name.c_str();
+    if constexpr (std::is_same_v<QuantInt, int8_t>) {
+      return "QuantizeLinearS8";
+    } else if (std::is_same_v<QuantInt, uint8_t>) {
+      return "QuantizeLinearU8";
+    } else if (std::is_same_v<QuantInt, int16_t>) {
+      return "QuantizeLinearS16";
+    } else {
+      return "QuantizeLinearU16";
+    }
   }
 
   void ExecuteShort(void) override {
@@ -67,12 +75,18 @@ template <>
 MlasQuantizeLinearTest<int8_t>* MlasTestFixture<MlasQuantizeLinearTest<int8_t>>::mlas_tester(nullptr);
 template <>
 MlasQuantizeLinearTest<uint8_t>* MlasTestFixture<MlasQuantizeLinearTest<uint8_t>>::mlas_tester(nullptr);
+template <>
+MlasQuantizeLinearTest<int16_t>* MlasTestFixture<MlasQuantizeLinearTest<int16_t>>::mlas_tester(nullptr);
+template <>
+MlasQuantizeLinearTest<uint16_t>* MlasTestFixture<MlasQuantizeLinearTest<uint16_t>>::mlas_tester(nullptr);
 
 static UNUSED_VARIABLE bool added_to_main = AddTestRegister([](bool is_short_execute) {
   size_t count = 0;
   if (is_short_execute) {
     count += MlasDirectShortExecuteTests<MlasQuantizeLinearTest<int8_t>>::RegisterShortExecute();
     count += MlasDirectShortExecuteTests<MlasQuantizeLinearTest<uint8_t>>::RegisterShortExecute();
+    count += MlasDirectShortExecuteTests<MlasQuantizeLinearTest<int16_t>>::RegisterShortExecute();
+    count += MlasDirectShortExecuteTests<MlasQuantizeLinearTest<uint16_t>>::RegisterShortExecute();
   }
   return count;
 });
diff --git a/onnxruntime/test/optimizer/ensure_unique_dq_for_node_unit_test.cc b/onnxruntime/test/optimizer/ensure_unique_dq_for_node_unit_test.cc
index d0ce4898a472c..feff607703341 100644
--- a/onnxruntime/test/optimizer/ensure_unique_dq_for_node_unit_test.cc
+++ b/onnxruntime/test/optimizer/ensure_unique_dq_for_node_unit_test.cc
@@ -20,15 +20,17 @@ struct GraphConfig {
   bool has_subgraph_consumer{false};
 };
 
-auto GetGraphBuilder(const GraphConfig& config, bool use_ms_domain_qdq_ops) {
+template <typename QuantType>
+std::function<void(ModelTestBuilder&)> GetGraphBuilder(const GraphConfig& config, bool use_ms_domain_qdq_ops) {
   return [config, use_ms_domain_qdq_ops](ModelTestBuilder& builder) {
     const auto input_shape = std::vector<int64_t>{1, 2, 4};
     constexpr float scale = 0.5f;
-    constexpr uint8_t zero_point = 0;
+    constexpr QuantType zero_point = 0;
 
-    auto* dq_input = builder.MakeInput<uint8_t>(input_shape, uint8_t{0}, uint8_t{255});
+    auto* dq_input = builder.MakeInput<QuantType>(input_shape, std::numeric_limits<QuantType>::min(),
+                                                  std::numeric_limits<QuantType>::max());
     auto* dq_output = config.has_graph_output ? builder.MakeOutput() : builder.MakeIntermediate();
-    builder.AddDequantizeLinearNode(dq_input, scale, zero_point, dq_output, use_ms_domain_qdq_ops);
+    builder.AddDequantizeLinearNode<QuantType>(dq_input, scale, zero_point, dq_output, use_ms_domain_qdq_ops);
 
     for (size_t i = 0; i < config.num_explicit_consumer_nodes; ++i) {
       // use Concat for the explicit consumer node as it supports a variadic number of inputs
@@ -71,10 +73,12 @@ auto GetGraphBuilder(const GraphConfig& config, bool use_ms_domain_qdq_ops) {
 }
 
 void RunEnsureUniqueDQForNodeUnitTest(const GraphConfig& config, int expected_dq_count) {
-  auto run_tests = [config, expected_dq_count](bool use_ms_domain_qdq_ops) {
+  auto run_tests = [config, expected_dq_count](bool use_ms_domain_qdq_ops, bool use_16bit_qdq_ops) {
     constexpr int opset_version = 12;
     const char* dequantize_linear_key = use_ms_domain_qdq_ops ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
-    std::function<void(ModelTestBuilder&)> graph_builder_fn = GetGraphBuilder(config, use_ms_domain_qdq_ops);
+    std::function<void(ModelTestBuilder&)> graph_builder_fn = use_16bit_qdq_ops
+                                                                  ? GetGraphBuilder<uint16_t>(config, use_ms_domain_qdq_ops)
+                                                                  : GetGraphBuilder<uint8_t>(config, use_ms_domain_qdq_ops);
 
     {
       SCOPED_TRACE("test with standalone transformer");
@@ -117,9 +121,10 @@ void RunEnsureUniqueDQForNodeUnitTest(const GraphConfig& config, int expected_dq
     }
   };
 
-  run_tests(false);
+  run_tests(false, false);
 #if !defined(DISABLE_CONTRIB_OPS)
-  run_tests(true);  // Use contrib QDQ ops.
+  run_tests(true, false);  // Use contrib QDQ ops.
+  run_tests(true, true);   // Use 16-bit contrib QDQ ops.
 #endif
 }
 
diff --git a/onnxruntime/test/optimizer/graph_transform_test.cc b/onnxruntime/test/optimizer/graph_transform_test.cc
index 553fcca92aa78..dce1f2d40e8b9 100755
--- a/onnxruntime/test/optimizer/graph_transform_test.cc
+++ b/onnxruntime/test/optimizer/graph_transform_test.cc
@@ -83,6 +83,7 @@
 #include "test/util/include/test_utils.h"
 #include "core/optimizer/pre_shape_node_elimination.h"
 #include "core/optimizer/double_qdq_pairs_remover.h"
+#include "core/optimizer/qdq_transformer/qdq_util.h"
 #ifdef ENABLE_TRAINING
 #include "orttraining/core/optimizer/bitmask_dropout_replacement.h"
 #endif
@@ -155,44 +156,43 @@ TEST_F(GraphTransformationTests, IdentityWithSharedNodeArgNotEliminated) {
   ASSERT_TRUE(op_to_count["Add"] == 1);
 }
 
+// Runs a model to ensure that common subexpression elimination does not eliminate
+// DequantizeLinear nodes.
 TEST_F(GraphTransformationTests, DequantizeLinearNodeNotEliminated) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "qdq_with_multi_consumer_dq_nodes.fixed.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["DequantizeLinear"], 25);
+  auto test_case = [](const ORTCHAR_T* model_uri,
+                      bool use_contrib_qdq,
+                      const logging::Logger& logger) {
+    const char* dq_key = use_contrib_qdq ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
+    std::shared_ptr<Model> model;
+    ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, logger));
+    Graph& graph = model->MainGraph();
+    std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
+    ASSERT_EQ(op_to_count[dq_key], 25);
 
-  onnxruntime::GraphTransformerManager graph_transformation_mgr{5};
-  ASSERT_STATUS_OK(graph_transformation_mgr.Register(std::make_unique<CommonSubexpressionElimination>(),
-                                                     TransformerLevel::Level1));
-  ASSERT_STATUS_OK(graph_transformation_mgr.ApplyTransformers(graph, TransformerLevel::Level1, *logger_));
+    onnxruntime::GraphTransformerManager graph_transformation_mgr{5};
+    ASSERT_STATUS_OK(graph_transformation_mgr.Register(std::make_unique<CommonSubexpressionElimination>(),
+                                                       TransformerLevel::Level1));
+    ASSERT_STATUS_OK(graph_transformation_mgr.ApplyTransformers(graph, TransformerLevel::Level1, logger));
 
-  // CommonSubexpressionElimination should skip the DequantizeLinear nodes
-  op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["DequantizeLinear"], 25);
-}
+    // CommonSubexpressionElimination should skip the DequantizeLinear nodes
+    op_to_count = CountOpsInGraph(graph);
+    ASSERT_EQ(op_to_count[dq_key], 25);
+  };
 
+  test_case(MODEL_FOLDER "qdq_with_multi_consumer_dq_nodes.fixed.onnx",
+            false,  // use_contrib_qdq
+            *logger_);
 #if !defined(DISABLE_CONTRIB_OPS)
-// Test that com.microsoft.DequantizeLinear is not eliminated in CommonSubexpressionElimination
-TEST_F(GraphTransformationTests, MsDomainDequantizeLinearNodeNotEliminated) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "qdq_with_multi_consumer_dq_nodes.fixed.qdq_contrib.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["com.microsoft.DequantizeLinear"], 25);
-
-  onnxruntime::GraphTransformerManager graph_transformation_mgr{5};
-  ASSERT_STATUS_OK(graph_transformation_mgr.Register(std::make_unique<CommonSubexpressionElimination>(),
-                                                     TransformerLevel::Level1));
-  ASSERT_STATUS_OK(graph_transformation_mgr.ApplyTransformers(graph, TransformerLevel::Level1, *logger_));
-
-  // CommonSubexpressionElimination should skip the DequantizeLinear nodes
-  op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["com.microsoft.DequantizeLinear"], 25);
-}
+  // Test with 8-bit com.microsoft.DequantizeLinear
+  test_case(MODEL_FOLDER "qdq_with_multi_consumer_dq_nodes.fixed.qdq_contrib.onnx",
+            true,  // use_contrib_qdq
+            *logger_);
+  // Test with 16-bit com.microsoft.DequantizeLinear
+  test_case(MODEL_FOLDER "qdq_with_multi_consumer_dq_nodes.fixed.qdq16_contrib.onnx",
+            true,  // use_contrib_qdq
+            *logger_);
 #endif  // !defined(DISABLE_CONTRIB_OPS)
+}
 
 TEST_F(GraphTransformationTests, IdentityInputIsGraphOutputNotEliminated) {
   constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "scan9_sum.onnx";
@@ -836,158 +836,120 @@ static void VerifyConstantFoldingWithDequantizeLinear(const std::unordered_map<s
   }
 }
 
+// Runs a model that checks constant folding with DequantizeLinear nodes.
 TEST_F(GraphTransformationTests, ConstantFoldingWithDequantizeLinear) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_dequantizelinear.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_TRUE(op_to_count["QuantizeLinear"] == 1);
-  ASSERT_TRUE(op_to_count["DequantizeLinear"] == 3);
-  ASSERT_TRUE(op_to_count["Conv"] == 1);
-
-  std::unordered_map<std::string, int> expected_op_counts = {{"QuantizeLinear", 1},
-                                                             {"DequantizeLinear", 3},
-                                                             {"Conv", 1}};
-
-  SessionOptions session_options;
-  // Check DequantizeLinear aren't constant folded for default setting.
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-
-  // set kOrtSessionOptionsDisableQuantQDQ to enable it explicitly
-  ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "0"));
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
+  auto test_case = [](const ORTCHAR_T* model_uri,
+                      bool use_contrib_qdq,
+                      const logging::Logger& logger) {
+    const char* q_key = use_contrib_qdq ? "com.microsoft.QuantizeLinear" : "QuantizeLinear";
+    const char* dq_key = use_contrib_qdq ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
 
-  // set SessionOptionsEnableQuantQDQ to disable it
-  expected_op_counts["DequantizeLinear"] = 1;
-  ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "1"));
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+    std::shared_ptr<Model> model;
+    ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, logger));
+    Graph& graph = model->MainGraph();
+    std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
+    ASSERT_TRUE(op_to_count[q_key] == 1);
+    ASSERT_TRUE(op_to_count[dq_key] == 3);
+    ASSERT_TRUE(op_to_count["Conv"] == 1);
 
-#if !defined(DISABLE_CONTRIB_OPS)
-// Test constant folding with a com.microsoft.DequantizeLinear node
-TEST_F(GraphTransformationTests, ConstantFoldingWithMsDomainDequantizeLinear) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_dequantizelinear.qdq_contrib.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["com.microsoft.QuantizeLinear"], 1);
-  ASSERT_EQ(op_to_count["com.microsoft.DequantizeLinear"], 3);
-  ASSERT_EQ(op_to_count["Conv"], 1);
+    std::unordered_map<std::string, int> expected_op_counts = {{q_key, 1},
+                                                               {dq_key, 3},
+                                                               {"Conv", 1}};
 
-  std::unordered_map<std::string, int> expected_op_counts = {{"com.microsoft.QuantizeLinear", 1},
-                                                             {"com.microsoft.DequantizeLinear", 3},
-                                                             {"Conv", 1}};
+    SessionOptions session_options;
+    // Check DequantizeLinear aren't constant folded for default setting.
+    VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, logger);
 
-  SessionOptions session_options;
-  // Check DequantizeLinear aren't constant folded for default setting.
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
+    // set kOrtSessionOptionsDisableQuantQDQ to enable it explicitly
+    ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "0"));
+    VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, logger);
 
-  // set kOrtSessionOptionsDisableQuantQDQ to enable it explicitly
-  ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "0"));
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
+    // set SessionOptionsEnableQuantQDQ to disable it
+    expected_op_counts[dq_key] = 1;
+    ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "1"));
+    VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, logger);
+  };
 
-  // set SessionOptionsEnableQuantQDQ to disable it
-  expected_op_counts["com.microsoft.DequantizeLinear"] = 1;
-  ASSERT_STATUS_OK(session_options.config_options.AddConfigEntry(kOrtSessionOptionsDisableQuantQDQ, "1"));
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+  test_case(MODEL_FOLDER "fusion/constant_folding_dequantizelinear.onnx",
+            false, *logger_);
+#if !defined(DISABLE_CONTRIB_OPS)
+  // Test with 8-bit contrib QDQ ops
+  test_case(MODEL_FOLDER "fusion/constant_folding_dequantizelinear.qdq_contrib.onnx",
+            true, *logger_);
+  // Test with 16-bit contrib QDQ ops
+  test_case(MODEL_FOLDER "fusion/constant_folding_dequantizelinear.qdq16_contrib.onnx",
+            true, *logger_);
 #endif  // !defined(DISABLE_CONTRIB_OPS)
+}
 
 // model with 2 QDQ node units that can be constant folded as they are simple DQ -> Node -> Q where DQ and Node have
 // single consumer and do not produce graph outputs. Node is deterministic.
 // there are also other DQ nodes that should be ignored.
 TEST_F(GraphTransformationTests, ConstantFoldingQDQNodeUnit) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_TRUE(op_to_count["QuantizeLinear"] == 3);
-  ASSERT_TRUE(op_to_count["DequantizeLinear"] == 4);
-  ASSERT_TRUE(op_to_count["Unsqueeze"] == 1);
-  ASSERT_TRUE(op_to_count["Transpose"] == 1);
+  auto test_case = [](const ORTCHAR_T* model_uri, bool use_contrib_qdq, const logging::Logger& logger) {
+    const char* q_key = use_contrib_qdq ? "com.microsoft.QuantizeLinear" : "QuantizeLinear";
+    const char* dq_key = use_contrib_qdq ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
 
-  SessionOptions session_options;
-
-  // 2 QDQ node units should be constant folded and go away
-  std::unordered_map<std::string, int> expected_op_counts = {{"QuantizeLinear", 1},
-                                                             {"DequantizeLinear", 2},
-                                                             {"Transpose", 0},
-                                                             {"Unsqueeze", 0}};
-
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+    std::shared_ptr<Model> model;
+    ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, logger));
+    Graph& graph = model->MainGraph();
+    std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
+    ASSERT_TRUE(op_to_count[q_key] == 3);
+    ASSERT_TRUE(op_to_count[dq_key] == 4);
+    ASSERT_TRUE(op_to_count["Unsqueeze"] == 1);
+    ASSERT_TRUE(op_to_count["Transpose"] == 1);
 
-#if !defined(DISABLE_CONTRIB_OPS)
-// model with 2 (com.microsoft) QDQ node units that can be constant folded as they are simple DQ -> Node -> Q where
-// DQ and Node have single consumer and do not produce graph outputs. Node is deterministic.
-// there are also other DQ nodes that should be ignored.
-TEST_F(GraphTransformationTests, ConstantFoldingMsDomainQDQNodeUnit) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.qdq_contrib.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["com.microsoft.QuantizeLinear"], 3);
-  ASSERT_EQ(op_to_count["com.microsoft.DequantizeLinear"], 4);
-  ASSERT_EQ(op_to_count["Unsqueeze"], 1);
-  ASSERT_EQ(op_to_count["Transpose"], 1);
+    SessionOptions session_options;
 
-  SessionOptions session_options;
+    // 2 QDQ node units should be constant folded and go away
+    std::unordered_map<std::string, int> expected_op_counts = {{q_key, 1},
+                                                               {dq_key, 2},
+                                                               {"Transpose", 0},
+                                                               {"Unsqueeze", 0}};
 
-  // 2 QDQ node units should be constant folded and go away
-  std::unordered_map<std::string, int> expected_op_counts = {{"com.microsoft.QuantizeLinear", 1},
-                                                             {"com.microsoft.DequantizeLinear", 2},
-                                                             {"Transpose", 0},
-                                                             {"Unsqueeze", 0}};
+    VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, logger);
+  };
 
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.onnx", false, *logger_);
+#if !defined(DISABLE_CONTRIB_OPS)
+  // Test with 8-bit com.microsoft.Q/DQ
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.qdq_contrib.onnx", true, *logger_);
+  // Test with 16-bit com.microsoft.Q/DQ
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.qdq16_contrib.onnx", true, *logger_);
 #endif  // !defined(DISABLE_CONTRIB_OPS)
+}
 
 // Simple QDQ Node Unit but shouldn't be constant folded as the node in the middle produces a graph output
 TEST_F(GraphTransformationTests, ConstantFoldingQDQNodeUnitGraphOutput) {
-  constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.graph_output.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_TRUE(op_to_count["QuantizeLinear"] == 2);
-  ASSERT_TRUE(op_to_count["DequantizeLinear"] == 3);
-  ASSERT_TRUE(op_to_count["Unsqueeze"] == 1);
+  auto test_case = [](const ORTCHAR_T* model_uri, bool use_contrib_qdq, const logging::Logger& logger) {
+    const char* q_key = use_contrib_qdq ? "com.microsoft.QuantizeLinear" : "QuantizeLinear";
+    const char* dq_key = use_contrib_qdq ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
 
-  std::unordered_map<std::string, int> expected_op_counts = {{"QuantizeLinear", 2},
-                                                             {"DequantizeLinear", 3},
-                                                             {"Unsqueeze", 1}};
+    std::shared_ptr<Model> model;
+    ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, logger));
+    Graph& graph = model->MainGraph();
+    std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
+    ASSERT_TRUE(op_to_count[q_key] == 2);
+    ASSERT_TRUE(op_to_count[dq_key] == 3);
+    ASSERT_TRUE(op_to_count["Unsqueeze"] == 1);
 
-  SessionOptions session_options;
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+    std::unordered_map<std::string, int> expected_op_counts = {{q_key, 2},
+                                                               {dq_key, 3},
+                                                               {"Unsqueeze", 1}};
 
-#if !defined(DISABLE_CONTRIB_OPS)
-// Simple (com.microsoft) QDQ Node Unit but shouldn't be constant folded as the node in the middle produces a
-// graph output
-TEST_F(GraphTransformationTests, ConstantFoldingMsDomainQDQNodeUnitGraphOutput) {
-  constexpr const ORTCHAR_T* model_uri =
-      MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.graph_output.qdq_contrib.onnx";
-  std::shared_ptr<Model> model;
-  ASSERT_STATUS_OK(Model::Load(model_uri, model, nullptr, *logger_));
-  Graph& graph = model->MainGraph();
-  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
-  ASSERT_EQ(op_to_count["com.microsoft.QuantizeLinear"], 2);
-  ASSERT_EQ(op_to_count["com.microsoft.DequantizeLinear"], 3);
-  ASSERT_EQ(op_to_count["Unsqueeze"], 1);
+    SessionOptions session_options;
+    VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, logger);
+  };
 
-  std::unordered_map<std::string, int> expected_op_counts = {{"com.microsoft.QuantizeLinear", 2},
-                                                             {"com.microsoft.DequantizeLinear", 3},
-                                                             {"Unsqueeze", 1}};
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.graph_output.onnx", false, *logger_);
+#if !defined(DISABLE_CONTRIB_OPS)
+  // Test with 8-bit contrib QDQ ops
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.graph_output.qdq_contrib.onnx", true, *logger_);
 
-  SessionOptions session_options;
-  VerifyConstantFoldingWithDequantizeLinear(expected_op_counts, graph, session_options, *logger_);
-}
+  // Test with 16-bit contrib QDQ ops
+  test_case(MODEL_FOLDER "fusion/constant_folding_qdq_node_unit.graph_output.qdq16_contrib.onnx", true, *logger_);
 #endif  // !defined(DISABLE_CONTRIB_OPS)
+}
 
 TEST_F(GraphTransformationTests, ConstantFolding_RemoveDanglingInputNodesToConstantFoldedNode) {
   constexpr const ORTCHAR_T* model_uri = MODEL_FOLDER "fusion/constant_folding_remove_dangling_inputs.onnx";
@@ -3898,12 +3860,12 @@ TEST_F(GraphTransformationTests, DoublQDQRemover_RemoveDupQDQ) {
   std::string zp_name_after_reshape_node;
   for (auto& node : graph.Nodes()) {
     if (node.Name() == "dq_2") {
-      dq_scale_name_before_reshape_node = node.InputDefs()[InputIndex::SCALE_ID]->Name();
-      zp_name_before_reshape_node = node.InputDefs()[InputIndex::ZERO_POINT_ID]->Name();
+      dq_scale_name_before_reshape_node = node.InputDefs()[QDQ::InputIndex::SCALE_ID]->Name();
+      zp_name_before_reshape_node = node.InputDefs()[QDQ::InputIndex::ZERO_POINT_ID]->Name();
     }
     if (node.Name() == "q_3") {
-      dq_scale_name_after_reshape_node = node.InputDefs()[InputIndex::SCALE_ID]->Name();
-      zp_name_after_reshape_node = node.InputDefs()[InputIndex::ZERO_POINT_ID]->Name();
+      dq_scale_name_after_reshape_node = node.InputDefs()[QDQ::InputIndex::SCALE_ID]->Name();
+      zp_name_after_reshape_node = node.InputDefs()[QDQ::InputIndex::ZERO_POINT_ID]->Name();
     }
   }
   EXPECT_EQ(dq_scale_name_before_reshape_node.empty(), false);
diff --git a/onnxruntime/test/optimizer/graph_transform_test_builder.h b/onnxruntime/test/optimizer/graph_transform_test_builder.h
index 743faee3ee2a5..63577131480c6 100644
--- a/onnxruntime/test/optimizer/graph_transform_test_builder.h
+++ b/onnxruntime/test/optimizer/graph_transform_test_builder.h
@@ -39,9 +39,21 @@ namespace test {
 template <typename T>
 struct IsTypeQuantLinearCompatible : utils::IsByteType<T> {};
 
+template <>
+struct IsTypeQuantLinearCompatible<int16_t> : std::true_type {};
+
+template <>
+struct IsTypeQuantLinearCompatible<uint16_t> : std::true_type {};
+
 template <typename T>
 struct IsTypeDequantLinearCompatible : utils::IsByteType<T> {};
 
+template <>
+struct IsTypeDequantLinearCompatible<int16_t> : std::true_type {};
+
+template <>
+struct IsTypeDequantLinearCompatible<uint16_t> : std::true_type {};
+
 template <>
 struct IsTypeDequantLinearCompatible<int32_t> : std::true_type {};
 
diff --git a/onnxruntime/test/optimizer/qdq_transformer_test.cc b/onnxruntime/test/optimizer/qdq_transformer_test.cc
index 0dfeb599d0ae3..a438a61cb9b36 100644
--- a/onnxruntime/test/optimizer/qdq_transformer_test.cc
+++ b/onnxruntime/test/optimizer/qdq_transformer_test.cc
@@ -891,37 +891,139 @@ TEST(QDQTransformerTests, Gemm_S8S8U8) {
   QDQTransformerGemmTests<int8_t, int8_t, uint8_t, uint8_t>();
 }
 
+// Runs a test case that checks if Q/DQ nodes are dropped from DQ -> Gather -> Q.
+template <typename QuantType>
+static void RunGatherDropQDQTestCase(const std::vector<int64_t>& input1_shape,
+                                     const std::vector<int64_t>& weights_shape,
+                                     bool use_contrib_qdq = false) {
+  auto build_test_case = [input1_shape, weights_shape, use_contrib_qdq](ModelTestBuilder& builder) {
+    auto* input1_arg = builder.MakeInput<int64_t>(input1_shape, 0, weights_shape[0] - 1);
+    auto* output_arg = builder.MakeOutput();
+
+    // add Gather
+    auto* weight = builder.MakeInitializer<QuantType>(weights_shape, std::numeric_limits<QuantType>::min(),
+                                                      std::numeric_limits<QuantType>::max());
+    auto* dq_w_output = builder.MakeIntermediate();
+    auto* gather_output = builder.MakeIntermediate();
+    builder.AddDequantizeLinearNode<QuantType>(weight, .003f, 1, dq_w_output, use_contrib_qdq);
+    builder.AddNode("Gather", {dq_w_output, input1_arg}, {gather_output});
+
+    // add Q
+    builder.AddQuantizeLinearNode<QuantType>(gather_output, .003f, 1, output_arg, use_contrib_qdq);
+  };
+
+  auto check_graph = [use_contrib_qdq](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count["Gather"], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
+  };
+
+  TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2);
+}
+
+// Checks that Q/DQ nodes are dropped from DQ -> Gather -> Q. Uses 8-bit and 16-bit Q/DQ ops.
 TEST(QDQTransformerTests, Gather) {
-  auto test_case = [&](const std::vector<int64_t>& input1_shape, const std::vector<int64_t>& weights_shape,
-                       bool use_contrib_qdq = false) {
-    auto build_test_case = [&](ModelTestBuilder& builder) {
-      auto* input1_arg = builder.MakeInput<int64_t>(input1_shape, 0, weights_shape[0] - 1);
-      auto* output_arg = builder.MakeOutput();
+  RunGatherDropQDQTestCase<int8_t>({12, 37}, {24, 12});
+  RunGatherDropQDQTestCase<int8_t>({12, 37}, {24, 12}, true);   // Use com.microsoft QDQ ops
+  RunGatherDropQDQTestCase<int16_t>({12, 37}, {24, 12}, true);  // Use int16 com.microsoft QDQ ops
+}
 
-      // add Gather
-      auto* weight = builder.MakeInitializer<int8_t>(weights_shape, -128, 127);
-      auto* dq_w_output = builder.MakeIntermediate();
-      auto* gather_output = builder.MakeIntermediate();
-      builder.AddDequantizeLinearNode<int8_t>(weight, .003f, 1, dq_w_output, use_contrib_qdq);
-      builder.AddNode("Gather", {dq_w_output, input1_arg}, {gather_output});
+// Runs a test case that checks if Q/DQ nodes are dropped from DQ -> Reshape -> Q.
+template <typename QuantType>
+static void RunReshapeDropQDQTestCase(const std::vector<int64_t>& input_shape,
+                                      const std::vector<int64_t>& new_shape,
+                                      bool use_contrib_qdq = false) {
+  auto build_test_case = [input_shape, new_shape, use_contrib_qdq](ModelTestBuilder& builder) {
+    constexpr QuantType qmin = std::numeric_limits<QuantType>::min();
+    constexpr QuantType qmax = std::numeric_limits<QuantType>::max();
+
+    auto* input_arg = builder.MakeInput<QuantType>(input_shape, qmin, qmax);
+    auto* output_arg = builder.MakeOutput();
+    QuantType zero_point = 1 + (qmax + qmin) / 2;
+
+    // Add Reshape node
+    auto* new_shape_arg = builder.Make1DInitializer(new_shape);
+    auto* input_arg_dq = builder.MakeIntermediate();
+    auto* reshape_output = builder.MakeIntermediate();
+    builder.AddDequantizeLinearNode<QuantType>(input_arg, .003f, zero_point, input_arg_dq, use_contrib_qdq);
+    builder.AddNode("Reshape", {input_arg_dq, new_shape_arg}, {reshape_output});
+
+    // add Q
+    builder.AddQuantizeLinearNode<QuantType>(reshape_output, .003f, zero_point, output_arg, use_contrib_qdq);
+  };
 
-      // add Q
-      builder.AddQuantizeLinearNode<int8_t>(gather_output, .003f, 1, output_arg, use_contrib_qdq);
-    };
+  auto check_graph = [use_contrib_qdq](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count["Reshape"], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
+  };
 
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count["Gather"], 1);
-      EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
-    };
+  TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2);
+}
 
-    TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2);
+// Checks that Q/DQ nodes are dropped from DQ -> Reshape -> Q. Uses 8-bit and 16-bit Q/DQ ops.
+TEST(QDQTransformerTests, ReshapeDropQDQ) {
+  RunReshapeDropQDQTestCase<int8_t>({1, 3, 2, 2}, {1, 12});
+  RunReshapeDropQDQTestCase<int8_t>({1, 3, 2, 2}, {1, 12}, true);    // Use com.microsoft QDQ ops
+  RunReshapeDropQDQTestCase<int16_t>({1, 3, 2, 2}, {1, 12}, true);   // Use int16 com.microsoft QDQ ops
+  RunReshapeDropQDQTestCase<uint16_t>({1, 3, 2, 2}, {1, 12}, true);  // Use int16 com.microsoft QDQ ops
+}
+
+// Runs a test case that checks if Q/DQ nodes are dropped from DQ -> (Un)Squeeze -> Q.
+template <typename QuantType>
+static void RunSqueezeUnsqueezeDropQDQTestCase(const std::string& squeeze_type,
+                                               const std::vector<int64_t>& input_shape,
+                                               const std::vector<int64_t>& axes,
+                                               bool use_contrib_qdq = false) {
+  auto build_test_case = [squeeze_type, input_shape, axes, use_contrib_qdq](ModelTestBuilder& builder) {
+    constexpr QuantType qmin = std::numeric_limits<QuantType>::min();
+    constexpr QuantType qmax = std::numeric_limits<QuantType>::max();
+
+    auto* input_arg = builder.MakeInput<QuantType>(input_shape, qmin, qmax);
+    auto* output_arg = builder.MakeOutput();
+    QuantType zero_point = 1 + (qmax + qmin) / 2;
+
+    // Add Squeeze node
+    auto* axes_arg = builder.Make1DInitializer<int64_t>(axes);
+    auto* input_arg_dq = builder.MakeIntermediate();
+    auto* xsqueeze_output = builder.MakeIntermediate();
+    builder.AddDequantizeLinearNode<QuantType>(input_arg, .003f, zero_point, input_arg_dq, use_contrib_qdq);
+    builder.AddNode(squeeze_type, {input_arg_dq, axes_arg}, {xsqueeze_output});
+
+    // add Q
+    builder.AddQuantizeLinearNode<QuantType>(xsqueeze_output, .003f, zero_point, output_arg, use_contrib_qdq);
   };
 
-  test_case({12, 37}, {24, 12});
-  test_case({12, 37}, {24, 12}, true);  // Use com.microsoft QDQ ops
+  auto check_graph = [squeeze_type, use_contrib_qdq](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count[squeeze_type], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
+  };
+
+  TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2,
+                    13 /* opset_version */);
+}
+
+// Checks that Q/DQ nodes are dropped from DQ -> Squeeze -> Q. Uses 8-bit and 16-bit Q/DQ ops.
+TEST(QDQTransformerTests, SqueezeDropQDQ) {
+  RunSqueezeUnsqueezeDropQDQTestCase<int8_t>("Squeeze", {1, 3, 2, 2}, {0});
+  RunSqueezeUnsqueezeDropQDQTestCase<int8_t>("Squeeze", {1, 3, 2, 2}, {0}, true);    // Use MS domain QDQ ops
+  RunSqueezeUnsqueezeDropQDQTestCase<int16_t>("Squeeze", {1, 3, 2, 2}, {0}, true);   // Use int16 MS domain QDQ ops
+  RunSqueezeUnsqueezeDropQDQTestCase<uint16_t>("Squeeze", {1, 3, 2, 2}, {0}, true);  // Use int16 MS domain QDQ ops
+}
+
+// Checks that Q/DQ nodes are dropped from DQ -> Unsqueeze -> Q. Uses 8-bit and 16-bit Q/DQ ops.
+TEST(QDQTransformerTests, UnsqueezeDropQDQ) {
+  RunSqueezeUnsqueezeDropQDQTestCase<int8_t>("Unsqueeze", {1, 3, 2, 2}, {0});
+  RunSqueezeUnsqueezeDropQDQTestCase<int8_t>("Unsqueeze", {1, 3, 2, 2}, {0}, true);    // Use MS domain QDQ ops
+  RunSqueezeUnsqueezeDropQDQTestCase<int16_t>("Unsqueeze", {1, 3, 2, 2}, {0}, true);   // Use int16 MS domain QDQ ops
+  RunSqueezeUnsqueezeDropQDQTestCase<uint16_t>("Unsqueeze", {1, 3, 2, 2}, {0}, true);  // Use int16 MS domain QDQ ops
 }
 
 TEST(QDQTransformerTests, DoubleQDQ) {
@@ -1066,52 +1168,69 @@ TEST(QDQTransformerTests, DoubleQDQ) {
                    bad_float_point, good_float_point_2, true);  // Use com.microsoft QDQ ops
 }
 
-TEST(QDQTransformerTests, DoubleQDQ_Without_Last_Node_Being_Output) {
-  auto test_case = [&](int output_index, int expected_Q_count, int expected_DQ_count,
-                       bool use_contrib_qdq = false) {
-    auto graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], expected_Q_count);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], expected_DQ_count);
-    };
-    TransformerTester(
-        BuildDoubleQDQWithoutLastOutput<uint8_t>(output_index, use_contrib_qdq),
-        graph,
-        TransformerLevel::Default,
-        TransformerLevel::Level1);
+template <typename QuantType>
+static void RunDoubleQDQWithoutLastNodeBeingOutput(int output_index, int expected_Q_count, int expected_DQ_count,
+                                                   bool use_contrib_qdq = false) {
+  auto graph = [&](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], expected_Q_count);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], expected_DQ_count);
   };
+  TransformerTester(
+      BuildDoubleQDQWithoutLastOutput<QuantType>(output_index, use_contrib_qdq),
+      graph,
+      TransformerLevel::Default,
+      TransformerLevel::Level1);
+}
+
+TEST(QDQTransformerTests, DoubleQDQ_Without_Last_Node_Being_Output) {
   constexpr bool use_contrib_qdq = true;  // For readability.
 
-  test_case(0, 2, 2);
-  test_case(0, 2, 2, use_contrib_qdq);
-  test_case(1, 2, 3);                   // EnsureUniqueDQForNodeUnit will duplicate first DQ, so expect one more (3)
-  test_case(1, 2, 3, use_contrib_qdq);  // EnsureUniqueDQForNodeUnit will duplicate first DQ, so expect one more (3)
-  test_case(2, 2, 2);
-  test_case(2, 2, 2, use_contrib_qdq);
-  test_case(3, 1, 1);
-  test_case(3, 1, 1, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(0, 2, 2);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(0, 2, 2, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint16_t>(0, 2, 2, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<int16_t>(0, 2, 2, use_contrib_qdq);
+
+  // EnsureUniqueDQForNodeUnit will duplicate first DQ, so expected one more (3)
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(1, 2, 3);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(1, 2, 3, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint16_t>(1, 2, 3, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<int16_t>(1, 2, 3, use_contrib_qdq);
+
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(2, 2, 2);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(2, 2, 2, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint16_t>(2, 2, 2, use_contrib_qdq);
+
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(3, 1, 1);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint8_t>(3, 1, 1, use_contrib_qdq);
+  RunDoubleQDQWithoutLastNodeBeingOutput<uint16_t>(3, 1, 1, use_contrib_qdq);
+}
+
+// Runs a test that checks if DQ -> Split -> Q (many) is replaced with just Split.
+template <typename InputQType, typename OutputQType>
+static void RunDropSplitQDQTestCase(const std::vector<int64_t>& input_shape, int64_t axis,
+                                    bool use_contrib_qdq = false) {
+  auto check_graph = [use_contrib_qdq](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count["Split"], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
+  };
+  TransformerTester(BuildQDQSplitTestCase<InputQType, OutputQType>(input_shape, axis, use_contrib_qdq),
+                    check_graph,
+                    TransformerLevel::Level1,
+                    TransformerLevel::Level2,
+                    {12, 18, 19});
 }
 
-// Because split isn't one the supported ops, this will stay the same
+// Test that DQ -> Split -> Q (many) is replaced with just Split for various quantization types.
 TEST(QDQTransformerTests, Split) {
-  auto test_case = [&](const std::vector<int64_t>& input_shape, const int64_t& axis,
-                       bool use_contrib_qdq = false) {
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count["Split"], 1);
-      EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
-    };
-    TransformerTester(BuildQDQSplitTestCase<int8_t, int8_t>(input_shape, axis, use_contrib_qdq),
-                      check_graph,
-                      TransformerLevel::Level1,
-                      TransformerLevel::Level2,
-                      {12, 18, 19});
-  };
-  test_case({6, 18, 54}, 0);
-  test_case({6, 18, 54}, 0, true);  // Use com.microsoft QDQ ops
+  RunDropSplitQDQTestCase<int8_t, int8_t>({6, 18, 54}, 0);
+  RunDropSplitQDQTestCase<int8_t, int8_t>({6, 18, 54}, 0, true);      // Use com.microsoft int8 QDQ ops
+  RunDropSplitQDQTestCase<int16_t, int16_t>({6, 18, 54}, 0, true);    // Use com.microsoft int16 QDQ ops
+  RunDropSplitQDQTestCase<uint16_t, uint16_t>({6, 18, 54}, 0, true);  // Use com.microsoft uint16 QDQ ops
 }
 
 // Because split isn't one the supported ops, this will stay the same
@@ -1174,59 +1293,66 @@ TEST(QDQTransformerTests, Where) {
   test_case({1}, {1}, {1}, true /*use_contrib_qdq*/);
 }
 
-TEST(QDQTransformerTests, Transpose) {
-  auto test_case = [&](const std::vector<int64_t>& input_shape, const std::vector<int64_t>& perms,
-                       bool use_contrib_qdq = false) {
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count["Transpose"], 1);
-      EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
-    };
-
-    TransformerTester(BuildQDQTransposeTestCase<int8_t, int8_t>(input_shape, perms, use_contrib_qdq),
-                      check_graph,
-                      TransformerLevel::Level1,
-                      TransformerLevel::Level2);
+template <typename QuantType>
+static void RunDropQDQTransposeTestCase(const std::vector<int64_t>& input_shape, const std::vector<int64_t>& perms,
+                                        bool use_contrib_qdq = false) {
+  auto check_graph = [&](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count["Transpose"], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 0);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
   };
 
-  test_case({2, 13, 12, 37}, {0, 3, 1, 2});
-  test_case({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  TransformerTester(BuildQDQTransposeTestCase<QuantType, QuantType>(input_shape, perms, use_contrib_qdq),
+                    check_graph,
+                    TransformerLevel::Level1,
+                    TransformerLevel::Level2);
 }
 
-TEST(QDQTransformerTests, Transpose_No_Fusion) {
-  auto test_case = [&](const std::vector<int64_t>& input1_shape, const std::vector<int64_t>& perms,
-                       bool use_contrib_qdq = false) {
-    auto build_test_case = [&](ModelTestBuilder& builder) {
-      auto* input1_arg = builder.MakeInput<int8_t>(input1_shape, -128, 127);
-      auto* output_arg = builder.MakeOutput();
-
-      // add DQ
-      auto* dq_output = builder.MakeIntermediate();
-      builder.AddDequantizeLinearNode<int8_t>(input1_arg, .003f, 1, dq_output, use_contrib_qdq);
-
-      // add Transpose
-      auto* transpose_output = builder.MakeOutput();  // transpose output is graph output
-      Node& transpose_node = builder.AddNode("Transpose", {dq_output}, {transpose_output});
-      transpose_node.AddAttribute("perm", perms);
-
-      // add Q
-      builder.AddQuantizeLinearNode<int8_t>(transpose_output, .003f, 1, output_arg, use_contrib_qdq);
-    };
-
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 1);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 1);
-    };
+TEST(QDQTransformerTests, Transpose) {
+  RunDropQDQTransposeTestCase<uint8_t>({2, 13, 12, 37}, {0, 3, 1, 2});
+  RunDropQDQTransposeTestCase<uint8_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  RunDropQDQTransposeTestCase<uint16_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  RunDropQDQTransposeTestCase<int16_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+}
+
+template <typename QuantType>
+static void RunQDQTransposeNoFusionTestCase(const std::vector<int64_t>& input1_shape, const std::vector<int64_t>& perms,
+                                            bool use_contrib_qdq = false) {
+  auto build_test_case = [&](ModelTestBuilder& builder) {
+    auto* input1_arg = builder.MakeInput<QuantType>(input1_shape, std::numeric_limits<QuantType>::min(),
+                                                    std::numeric_limits<QuantType>::max());
+    auto* output_arg = builder.MakeOutput();
+
+    // add DQ
+    auto* dq_output = builder.MakeIntermediate();
+    builder.AddDequantizeLinearNode<QuantType>(input1_arg, .003f, 1, dq_output, use_contrib_qdq);
+
+    // add Transpose
+    auto* transpose_output = builder.MakeOutput();  // transpose output is graph output
+    Node& transpose_node = builder.AddNode("Transpose", {dq_output}, {transpose_output});
+    transpose_node.AddAttribute("perm", perms);
+
+    // add Q
+    builder.AddQuantizeLinearNode<QuantType>(transpose_output, .003f, 1, output_arg, use_contrib_qdq);
+  };
 
-    TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2);
+  auto check_graph = [&](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count[qdq_keys.quantize_linear], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 1);
   };
 
-  test_case({2, 13, 12, 37}, {0, 3, 1, 2});
-  test_case({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  TransformerTester(build_test_case, check_graph, TransformerLevel::Level1, TransformerLevel::Level2);
+}
+
+TEST(QDQTransformerTests, Transpose_No_Fusion) {
+  RunQDQTransposeNoFusionTestCase<int8_t>({2, 13, 12, 37}, {0, 3, 1, 2});
+  RunQDQTransposeNoFusionTestCase<int8_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  RunQDQTransposeNoFusionTestCase<int16_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
+  RunQDQTransposeNoFusionTestCase<uint16_t>({2, 13, 12, 37}, {0, 3, 1, 2}, true /*use_contrib_qdq*/);
 }
 
 TEST(QDQTransformerTests, Resize) {
@@ -1376,50 +1502,59 @@ TEST(QDQTransformerTests, ResizeReshapeSqueezeUnsqueeze) {
   test_case({1, 2, 26, 42}, {4}, true /*use_contrib_qdq*/);
 }
 
-TEST(QDQTransformerTests, ArgMax) {
-  auto test_case = [&](const std::vector<int64_t>& input_shape,
-                       int axis,
-                       int keepdims,
-                       int select_last_index,
-                       bool use_contrib_qdq) {
-    auto build_test_case = [&](ModelTestBuilder& builder) {
-      auto* input_arg = builder.MakeInput<uint8_t>(input_shape,
-                                                   std::numeric_limits<uint8_t>::min(),
-                                                   std::numeric_limits<uint8_t>::max());
-      auto* output_arg = builder.MakeOutput();
+// Runs a test case that checks if the DQ node is dropped from DQ -> Op (e.g., ArgMax).
+template <typename QuantType>
+static void RunArgMaxDropDQTestCase(const std::vector<int64_t>& input_shape,
+                                    int axis,
+                                    int keepdims,
+                                    int select_last_index,
+                                    bool use_contrib_qdq,
+                                    bool expect_drop_dq = true) {
+  auto build_test_case = [&](ModelTestBuilder& builder) {
+    auto* input_arg = builder.MakeInput<QuantType>(input_shape,
+                                                   std::numeric_limits<QuantType>::min(),
+                                                   std::numeric_limits<QuantType>::max());
+    auto* output_arg = builder.MakeOutput();
+
+    // add DQ
+    auto* dq_output = builder.MakeIntermediate();
+    builder.AddDequantizeLinearNode<QuantType>(input_arg, .003f, 1, dq_output, use_contrib_qdq);
+
+    // add ArgMax
+    Node& argmax_node = builder.AddNode("ArgMax", {dq_output}, {output_arg});
+    argmax_node.AddAttribute("axis", static_cast<int64_t>(axis));
+    argmax_node.AddAttribute("keepdims", static_cast<int64_t>(keepdims));
+    argmax_node.AddAttribute("select_last_index", static_cast<int64_t>(select_last_index));
+  };
 
-      // add DQ
-      auto* dq_output = builder.MakeIntermediate();
-      builder.AddDequantizeLinearNode<uint8_t>(input_arg, .003f, 1, dq_output, use_contrib_qdq);
+  auto check_graph = [use_contrib_qdq, expect_drop_dq](InferenceSessionWrapper& session) {
+    auto op_to_count = CountOpsInGraph(session.GetGraph());
+    const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
+    EXPECT_EQ(op_to_count["ArgMax"], 1);
+    EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], expect_drop_dq ? 0 : 1);
+  };
 
-      // add ArgMax
-      Node& argmax_node = builder.AddNode("ArgMax", {dq_output}, {output_arg});
-      argmax_node.AddAttribute("axis", static_cast<int64_t>(axis));
-      argmax_node.AddAttribute("keepdims", static_cast<int64_t>(keepdims));
-      argmax_node.AddAttribute("select_last_index", static_cast<int64_t>(select_last_index));
-    };
+  TransformerTester(build_test_case, check_graph,
+                    TransformerLevel::Level1,
+                    TransformerLevel::Level2,
+                    /* opset_version */ 13);
+  TransformerTester(build_test_case, check_graph,
+                    TransformerLevel::Level1,
+                    TransformerLevel::Level2,
+                    /* opset_version */ 19);
+}
 
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      auto op_to_count = CountOpsInGraph(session.GetGraph());
-      const QDQOpKeys qdq_keys = GetQDQOpKeys(use_contrib_qdq);
-      EXPECT_EQ(op_to_count["ArgMax"], 1);
-      EXPECT_EQ(op_to_count[qdq_keys.dequantize_linear], 0);
-    };
+// Checks that the DQ node is dropped from DQ -> ArgMax. Uses 8-bit and 16-bit Q/DQ ops.
+TEST(QDQTransformerTests, ArgMax) {
+  RunArgMaxDropDQTestCase<uint8_t>({2, 13, 12, 37}, 1, 0, 0, false);
+  RunArgMaxDropDQTestCase<uint8_t>({2, 13, 12, 37}, 1, 0, 0, true /*use_contrib_qdq*/);
 
-    TransformerTester(build_test_case, check_graph,
-                      TransformerLevel::Level1,
-                      TransformerLevel::Level2,
-                      /* opset_version */ 13);
-    TransformerTester(build_test_case, check_graph,
-                      TransformerLevel::Level1,
-                      TransformerLevel::Level2,
-                      /* opset_version */ 19);
-  };
+  // Should *not* drop DQ for 16-bit DQ -> ArgMax (because ORT does not support 16-bit input types for ArgMax).
+  RunArgMaxDropDQTestCase<uint16_t>({2, 13, 12, 37}, 1, 0, 0, true /*use_contrib_qdq*/, false /*expect_drop_dq*/);
+  RunArgMaxDropDQTestCase<int16_t>({2, 13, 12, 37}, 1, 0, 0, true /*use_contrib_qdq*/, false /*expect_drop_dq*/);
 
-  test_case({2, 13, 12, 37}, 1, 0, 0, false /*use_contrib_qdq*/);
-  test_case({2, 13, 12, 37}, 1, 0, 0, true /*use_contrib_qdq*/);
-  test_case({2, 13, 12, 37}, 0, 1, 0, false /*use_contrib_qdq*/);
-  test_case({2, 13, 12, 37}, 0, 0, 1, false /*use_contrib_qdq*/);
+  RunArgMaxDropDQTestCase<uint8_t>({2, 13, 12, 37}, 0, 1, 0, false);
+  RunArgMaxDropDQTestCase<uint8_t>({2, 13, 12, 37}, 0, 0, 1, false);
 }
 
 TEST(QDQTransformerTests, QLinearMatMul) {
diff --git a/onnxruntime/test/optimizer/transpose_optimizer_test.cc b/onnxruntime/test/optimizer/transpose_optimizer_test.cc
index e5aa36fc379f4..1f4c499985ad0 100644
--- a/onnxruntime/test/optimizer/transpose_optimizer_test.cc
+++ b/onnxruntime/test/optimizer/transpose_optimizer_test.cc
@@ -1,6 +1,7 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
+#include <optional>
 #include <random>
 #include <vector>
 
@@ -8,6 +9,7 @@
 #include "gmock/gmock.h"
 
 #include "core/graph/graph.h"
+#include "core/graph/node_attr_utils.h"
 #include "core/framework/op_node_proto_helper.h"
 #include "core/framework/utils.h"
 #include "core/session/onnxruntime_session_options_config_keys.h"
@@ -3501,150 +3503,116 @@ TEST(TransposeOptimizerTests, TestWhere) {
                     /*opset_version*/ {15, 18});
 }
 
-TEST(TransposeOptimizerTests, TestQuantizeLinearScalar) {
-  auto test_case = [&](const std::string& q_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<float>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0.0, 1.0);
-      auto* input1_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* quantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      builder.AddNode("QuantizeLinear", {transpose_1_out_0, input1_arg, input2_arg}, {quantizelinear_1_out_0},
-                      q_domain);
-      auto& transpose_2 = builder.AddNode("Transpose", {quantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
+// Utility function that runs TransformerTester for the graph Transpose -> QuantizeLinear -> Transpose.
+// Expects the Tranpose nodes to cancel.
+template <typename QuantType>
+static void RunQuantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
+                                      const std::vector<int64_t>& zp_value_shape,
+                                      std::optional<ONNX_NAMESPACE::AttributeProto> axis,
+                                      const std::string& q_domain = "") {
+  auto build_test_case = [&](ModelTestBuilder& builder) {
+    constexpr QuantType qmin = std::numeric_limits<QuantType>::min();
+    constexpr QuantType qmax = std::numeric_limits<QuantType>::max();
 
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+    auto* input0_arg = MakeInput<float>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0.0, 1.0);
+
+    NodeArg* scale_arg = nullptr;
+    NodeArg* zero_point_arg = nullptr;
+
+    if (zp_value_shape.empty()) {  // Per-tensor quantization
+      QuantType zp = (qmax + qmin) / 2;
+      scale_arg = MakeInput<float>(builder, zp_input_shape, zp_value_shape, {0.05f});
+      zero_point_arg = MakeInput<QuantType>(builder, zp_input_shape, zp_value_shape, {zp});
+    } else {  // Per-axis quantization
+      scale_arg = MakeInput<float>(builder, zp_input_shape, zp_value_shape, 0.0f, 1.0f);
+      zero_point_arg = MakeInput<QuantType>(builder, zp_input_shape, zp_value_shape, qmin, qmax);
+    }
+    auto* transpose_1_out_0 = builder.MakeIntermediate();
+    auto* quantizelinear_1_out_0 = builder.MakeIntermediate();
+    auto* transpose_2_out_0 = builder.MakeOutput();
+
+    auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
+    transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
+    auto& quantizelinear_1 = builder.AddNode("QuantizeLinear", {transpose_1_out_0, scale_arg, zero_point_arg},
+                                             {quantizelinear_1_out_0}, q_domain);
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
+    if (axis.has_value()) {
+      quantizelinear_1.AddAttributeProto(*axis);
+    }
+
+    auto& transpose_2 = builder.AddNode("Transpose", {quantizelinear_1_out_0}, {transpose_2_out_0});
+    transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
+  };
+
+  auto check_optimized_graph = [](InferenceSessionWrapper& session) {
+    int transpose_cost = EstimateTransposeCost(session.GetGraph());
+    EXPECT_EQ(transpose_cost, 0);
   };
 
-  test_case();
+  TransformerTester(build_test_case,
+                    check_optimized_graph,
+                    TransformerLevel::Default,
+                    TransformerLevel::Level1,
+                    /*opset_version*/ {15, 18});
+}
+
+TEST(TransposeOptimizerTests, TestQuantizeLinearScalar) {
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
+  std::vector<int64_t> zp_value_shape{};
+  std::optional<ONNX_NAMESPACE::AttributeProto> empty_axis;  // No axis value.
+
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, empty_axis, kOnnxDomain);
+
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.QuantizeLinear
+  // Use com.microsoft.QuantizeLinear op.
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, empty_axis, kMSDomain);
+  RunQuantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, empty_axis, kMSDomain);
+  RunQuantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, empty_axis, kMSDomain);
 #endif
 }
 
 TEST(TransposeOptimizerTests, TestQuantizeLinearScalarIgnoreAxis) {
-  auto test_case = [&](const std::string& q_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<float>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0.0, 1.0);
-      auto* input1_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* quantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      auto& quantizelinear_1 = builder.AddNode("QuantizeLinear", {transpose_1_out_0, input1_arg, input2_arg},
-                                               {quantizelinear_1_out_0}, q_domain);
-      quantizelinear_1.AddAttribute("axis", (int64_t)10);
-      auto& transpose_2 = builder.AddNode("Transpose", {quantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
-
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
+  std::vector<int64_t> zp_value_shape{};
+  auto ignored_axis = utils::MakeAttribute("axis", static_cast<int64_t>(10));  // Should be ignored for per-tensor Q
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
-  };
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, ignored_axis, kOnnxDomain);
 
-  test_case();
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.QuantizeLinear
+  // Use com.microsoft.QuantizeLinear op.
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
+  RunQuantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
+  RunQuantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
 #endif
 }
 
 TEST(TransposeOptimizerTests, TestQuantizeLinearVector) {
-  auto test_case = [&](const std::string& q_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<float>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0.0, 1.0);
-      auto* input1_arg = MakeInput<float>(builder, {{-1}}, {2}, {2.3f, 2.4f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {{-1}}, {2}, {10, 12});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* quantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      auto& quantizelinear_1 = builder.AddNode("QuantizeLinear", {transpose_1_out_0, input1_arg, input2_arg},
-                                               {quantizelinear_1_out_0}, q_domain);
-      quantizelinear_1.AddAttribute("axis", (int64_t)0);
-      auto& transpose_2 = builder.AddNode("Transpose", {quantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{-1};
+  std::vector<int64_t> zp_value_shape = {2};
+  auto axis = utils::MakeAttribute("axis", static_cast<int64_t>(0));
 
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, axis, kOnnxDomain);
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
-  };
-
-  test_case();
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.QuantizeLinear
+  // Use com.microsoft.QuantizeLinear op.
+  RunQuantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
+  RunQuantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
+  RunQuantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
 #endif
 }
 
 TEST(TransposeOptimizerTests, TestQuantizeLinearVectorUnknownRank) {
-  auto test_case = [&](const std::string& q_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<float>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0.0, 1.0);
-      auto* input1_arg = MakeInput<float>(builder, std::nullopt, {3}, {2.3f, 2.4f, 2.5f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, std::nullopt, {3}, {10, 12, 13});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* quantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      auto& quantizelinear_1 = builder.AddNode("QuantizeLinear", {transpose_1_out_0, input1_arg, input2_arg},
-                                               {quantizelinear_1_out_0}, q_domain);
-      quantizelinear_1.AddAttribute("axis", (int64_t)1);
-      auto& transpose_2 = builder.AddNode("Transpose", {quantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
+  std::optional<std::vector<int64_t>> zp_unknown_shape;  // Empty shape
+  std::vector<int64_t> zp_value_shape = {3};
+  auto axis = utils::MakeAttribute("axis", static_cast<int64_t>(1));
 
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+  RunQuantizeLinearTestCase<uint8_t>(zp_unknown_shape, zp_value_shape, axis, kOnnxDomain);
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
-  };
-
-  test_case();
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.QuantizeLinear
+  // Use com.microsoft.QuantizeLinear op.
+  RunQuantizeLinearTestCase<uint8_t>(zp_unknown_shape, zp_value_shape, axis, kMSDomain);
+  RunQuantizeLinearTestCase<uint16_t>(zp_unknown_shape, zp_value_shape, axis, kMSDomain);
+  RunQuantizeLinearTestCase<int16_t>(zp_unknown_shape, zp_value_shape, axis, kMSDomain);
 #endif
 }
 
@@ -3676,158 +3644,158 @@ TEST(TransposeOptimizerTests, TestQuantizeLinearScalarOpset10) {
                     /*opset_version*/ 10);
 }
 
-TEST(TransposeOptimizerTests, TestDequantizeLinearScalarIgnoreAxis) {
-  auto test_case = [&](const std::string& dq_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<uint8_t>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0, 5);
-      auto* input1_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* dequantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      auto& dequantizelinear_1 = builder.AddNode("DequantizeLinear", {transpose_1_out_0, input1_arg, input2_arg},
-                                                 {dequantizelinear_1_out_0}, dq_domain);
-      dequantizelinear_1.AddAttribute("axis", (int64_t)10);
-      auto& transpose_2 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
+// Utility function that runs TransformerTester for the graph Transpose -> DequantizeLinear -> Transpose.
+// Expects the Tranpose nodes to cancel.
+template <typename QuantType>
+static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
+                                        const std::vector<int64_t>& zp_value_shape,
+                                        std::optional<ONNX_NAMESPACE::AttributeProto> axis,
+                                        const std::string& q_domain = "") {
+  auto build_test_case = [&](ModelTestBuilder& builder) {
+    constexpr QuantType qmin = std::numeric_limits<QuantType>::min();
+    constexpr QuantType qmax = std::numeric_limits<QuantType>::max();
 
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+    auto* input0_arg = MakeInput<QuantType>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, qmin, qmax);
+
+    NodeArg* scale_arg = nullptr;
+    NodeArg* zero_point_arg = nullptr;
+
+    if (zp_value_shape.empty()) {  // Per-tensor quantization
+      QuantType zp = (qmax + qmin) / 2;
+      scale_arg = MakeInput<float>(builder, zp_input_shape, zp_value_shape, {0.05f});
+      zero_point_arg = MakeInput<QuantType>(builder, zp_input_shape, zp_value_shape, {zp});
+    } else {  // Per-axis quantization
+      scale_arg = MakeInput<float>(builder, zp_input_shape, zp_value_shape, 0.0f, 1.0f);
+      zero_point_arg = MakeInput<QuantType>(builder, zp_input_shape, zp_value_shape, qmin, qmax);
+    }
+    auto* transpose_1_out_0 = builder.MakeIntermediate();
+    auto* dequantizelinear_1_out_0 = builder.MakeIntermediate();
+    auto* transpose_2_out_0 = builder.MakeOutput();
+
+    auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
+    transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
+    auto& dequantizelinear_1 = builder.AddNode("DequantizeLinear", {transpose_1_out_0, scale_arg, zero_point_arg},
+                                               {dequantizelinear_1_out_0}, q_domain);
+
+    if (axis.has_value()) {
+      dequantizelinear_1.AddAttributeProto(*axis);
+    }
+
+    auto& transpose_2 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_2_out_0});
+    transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
+  };
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
+  auto check_optimized_graph = [](InferenceSessionWrapper& session) {
+    int transpose_cost = EstimateTransposeCost(session.GetGraph());
+    EXPECT_EQ(transpose_cost, 0);
   };
 
-  test_case();
+  TransformerTester(build_test_case,
+                    check_optimized_graph,
+                    TransformerLevel::Default,
+                    TransformerLevel::Level1,
+                    /*opset_version*/ {15, 18});
+}
+
+TEST(TransposeOptimizerTests, TestDequantizeLinearScalarIgnoreAxis) {
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
+  std::vector<int64_t> zp_value_shape{};
+  auto ignored_axis = utils::MakeAttribute("axis", static_cast<int64_t>(10));  // Should be ignored for per-tensor Q
+
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, ignored_axis, kOnnxDomain);
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.DequantizeLinear
+  // Use com.microsoft.DequantizeLinear ops
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
+  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
+  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, ignored_axis, kMSDomain);
 #endif
 }
 
 TEST(TransposeOptimizerTests, TestDequantizeLinearVector) {
-  auto test_case = [&](const std::string& dq_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<uint8_t>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0, 5);
-      auto* input1_arg = MakeInput<float>(builder, {{2}}, {2}, {2.3f, 2.4f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {{2}}, {2}, {10, 12});
-      auto* transpose_1_out_0 = builder.MakeIntermediate();
-      auto* dequantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-      auto& dequantizelinear_1 = builder.AddNode("DequantizeLinear", {transpose_1_out_0, input1_arg, input2_arg},
-                                                 {dequantizelinear_1_out_0}, dq_domain);
-      dequantizelinear_1.AddAttribute("axis", (int64_t)-4);
-      auto& transpose_2 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{2};
+  std::vector<int64_t> zp_value_shape = {2};
+  auto axis = utils::MakeAttribute("axis", static_cast<int64_t>(-4));
 
-    auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-      int transpose_cost = EstimateTransposeCost(session.GetGraph());
-      EXPECT_EQ(transpose_cost, 0);
-    };
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, axis, kOnnxDomain);
+#if !defined(DISABLE_CONTRIB_OPS)
+  // Use com.microsoft.DequantizeLinear ops
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
+  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
+  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, axis, kMSDomain);
+#endif
+}
 
-    TransformerTester(build_test_case_1,
-                      check_optimized_graph_1,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ {15, 18});
-  };
+TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
+  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
+  std::vector<int64_t> zp_value_shape{};
+  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;  // Empty axis value will not be set.
 
-  test_case();
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.DequantizeLinear
+  // Use com.microsoft.DequantizeLinear ops
+  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
+  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
+  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
 #endif
 }
 
-TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
-  auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-    auto* input0_arg = MakeInput<uint8_t>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0, 5);
-    auto* input1_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
-    auto* input2_arg = MakeInput<uint8_t>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
-    auto* transpose_1_out_0 = builder.MakeIntermediate();
+// Utility function that runs TransformerTester for the graph in which a single DequantizeLinear node is
+// the parent of two Transpose nodes. The DQ should be duplicated by EnsureUniqueDQForNodeUnit, and the
+// Transposes should be pushed.
+template <typename QuantType>
+static void RunDequantizeLinearTransposePropagationTestCase(const std::string& dq_domain = "") {
+  auto build_test_case = [dq_domain](ModelTestBuilder& builder) {
+    auto* input0_arg = MakeInput<QuantType>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0, 5);
+    auto* scale_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
+    auto* zero_point_arg = MakeInput<QuantType>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
     auto* dequantizelinear_1_out_0 = builder.MakeIntermediate();
+    auto* transpose_1_out_0 = builder.MakeOutput();
     auto* transpose_2_out_0 = builder.MakeOutput();
 
-    auto& transpose_1 = builder.AddNode("Transpose", {input0_arg}, {transpose_1_out_0});
+    builder.AddNode("DequantizeLinear", {input0_arg, scale_arg, zero_point_arg}, {dequantizelinear_1_out_0},
+                    dq_domain);
+
+    auto& transpose_1 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_1_out_0});
     transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-    builder.AddNode("DequantizeLinear", {transpose_1_out_0, input1_arg, input2_arg}, {dequantizelinear_1_out_0});
+
     auto& transpose_2 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_2_out_0});
     transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
   };
 
-  auto check_optimized_graph_1 = [&](InferenceSessionWrapper& session) {
-    int transpose_cost = EstimateTransposeCost(session.GetGraph());
-    EXPECT_EQ(transpose_cost, 0);
+  auto check_graph = [dq_domain](InferenceSessionWrapper& session) {
+    const auto& graph = session.GetGraph();
+
+    const char* dq_count_key = (dq_domain == kMSDomain) ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
+    const auto op_count = CountOpsInGraph(graph);
+    decltype(op_count) expected_op_count{
+        {dq_count_key, 2},  // EnsureUniqueDQForNodeUnit should duplicate the original DQ
+        {"Transpose", 2},
+    };
+    ASSERT_EQ(op_count, expected_op_count);
+
+    // Transposes should be pushed, so check for Transpose -> DQ edges
+    for (const auto& node : graph.Nodes()) {
+      if (node.OpType() == "Transpose") {
+        ASSERT_EQ(node.GetOutputEdgesCount(), static_cast<size_t>(1));
+        ASSERT_EQ(node.OutputEdgesBegin()->GetNode().OpType(), "DequantizeLinear");
+      }
+    }
   };
 
-  TransformerTester(build_test_case_1,
-                    check_optimized_graph_1,
+  TransformerTester(build_test_case,
+                    check_graph,
                     TransformerLevel::Default,
                     TransformerLevel::Level1,
                     /*opset_version*/ 10);
 }
 
 TEST(TransposeOptimizerTests, TestDequantizeLinearTransposePropagation) {
-  auto test_case = [&](const std::string& dq_domain = "") {
-    auto build_test_case_1 = [&](ModelTestBuilder& builder) {
-      auto* input0_arg = MakeInput<uint8_t>(builder, {{2, -1, 6, 3}}, {2, 4, 6, 3}, 0, 5);
-      auto* input1_arg = MakeInput<float>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {2.3f});
-      auto* input2_arg = MakeInput<uint8_t>(builder, {std::vector<int64_t>{}}, std::vector<int64_t>{}, {10});
-      auto* dequantizelinear_1_out_0 = builder.MakeIntermediate();
-      auto* transpose_1_out_0 = builder.MakeOutput();
-      auto* transpose_2_out_0 = builder.MakeOutput();
-
-      builder.AddNode("DequantizeLinear", {input0_arg, input1_arg, input2_arg}, {dequantizelinear_1_out_0},
-                      dq_domain);
-
-      auto& transpose_1 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_1_out_0});
-      transpose_1.AddAttribute("perm", std::vector<int64_t>{0, 3, 1, 2});
-
-      auto& transpose_2 = builder.AddNode("Transpose", {dequantizelinear_1_out_0}, {transpose_2_out_0});
-      transpose_2.AddAttribute("perm", std::vector<int64_t>{0, 2, 3, 1});
-    };
-
-    auto check_graph = [&](InferenceSessionWrapper& session) {
-      const auto& graph = session.GetGraph();
-
-      const char* dq_count_key = (dq_domain == kMSDomain) ? "com.microsoft.DequantizeLinear" : "DequantizeLinear";
-      const auto op_count = CountOpsInGraph(graph);
-      decltype(op_count) expected_op_count{
-          {dq_count_key, 2},  // EnsureUniqueDQForNodeUnit should duplicate the original DQ
-          {"Transpose", 2},
-      };
-      ASSERT_EQ(op_count, expected_op_count);
-
-      // Transposes should be pushed, so check for Transpose -> DQ edges
-      for (const auto& node : graph.Nodes()) {
-        if (node.OpType() == "Transpose") {
-          ASSERT_EQ(node.GetOutputEdgesCount(), static_cast<size_t>(1));
-          ASSERT_EQ(node.OutputEdgesBegin()->GetNode().OpType(), "DequantizeLinear");
-        }
-      }
-    };
-
-    TransformerTester(build_test_case_1,
-                      check_graph,
-                      TransformerLevel::Default,
-                      TransformerLevel::Level1,
-                      /*opset_version*/ 10);
-  };
-
-  test_case();
+  RunDequantizeLinearTransposePropagationTestCase<uint8_t>();
 #if !defined(DISABLE_CONTRIB_OPS)
-  test_case(kMSDomain);  // Use com.microsoft.DequantizeLinear
+  // Use com.microsoft.DequantizeLinear
+  RunDequantizeLinearTransposePropagationTestCase<uint8_t>(kMSDomain);
+  RunDequantizeLinearTransposePropagationTestCase<uint16_t>(kMSDomain);
+  RunDequantizeLinearTransposePropagationTestCase<int16_t>(kMSDomain);
 #endif
 }
 
diff --git a/onnxruntime/test/providers/qnn/conv_test.cc b/onnxruntime/test/providers/qnn/conv_test.cc
index e9e285411f0a7..0549051bc2387 100644
--- a/onnxruntime/test/providers/qnn/conv_test.cc
+++ b/onnxruntime/test/providers/qnn/conv_test.cc
@@ -21,7 +21,8 @@ static GetTestModelFn BuildF32ConvTestCase(const std::string& conv_op_type, cons
                                            const std::vector<int64_t>& pads,
                                            const std::vector<int64_t>& dilations,
                                            const std::string& auto_pad = "NOTSET") {
-  return [conv_op_type, input_def, weights_def, bias_def, strides, pads, dilations, auto_pad](ModelTestBuilder& builder) {
+  return [conv_op_type, input_def, weights_def, bias_def, strides, pads,
+          dilations, auto_pad](ModelTestBuilder& builder) {
     std::vector<NodeArg*> conv_inputs = {
         MakeTestInput(builder, input_def),
         MakeTestInput(builder, weights_def)};
@@ -77,29 +78,33 @@ static void RunCPUConvOpTest(const std::string& conv_op_type, const TestInputDef
 }
 
 // Creates a graph with a single Q/DQ Conv operator. Used for testing HTP backend.
-template <typename InputQType>
-static GetTestQDQModelFn<InputQType> BuildQDQConvTestCase(const std::string& conv_op_type, const TestInputDef<float>& input_def,
-                                                          const TestInputDef<float>& weights_def,
-                                                          const TestInputDef<float>& bias_def,
-                                                          const std::vector<int64_t>& strides,
-                                                          const std::vector<int64_t>& pads,
-                                                          const std::vector<int64_t>& dilations,
-                                                          const std::string& auto_pad = "NOTSET") {
+template <typename ActivationQType, typename WeightQType>
+static GetTestQDQModelFn<ActivationQType> BuildQDQConvTestCase(const std::string& conv_op_type,
+                                                               const TestInputDef<float>& input_def,
+                                                               const TestInputDef<float>& weights_def,
+                                                               const TestInputDef<float>& bias_def,
+                                                               const std::vector<int64_t>& strides,
+                                                               const std::vector<int64_t>& pads,
+                                                               const std::vector<int64_t>& dilations,
+                                                               const std::string& auto_pad = "NOTSET",
+                                                               bool use_contrib_qdq = false) {
   return [conv_op_type, input_def, weights_def, bias_def, strides, pads,
-          dilations, auto_pad](ModelTestBuilder& builder,
-                               std::vector<QuantParams<InputQType>>& output_qparams) {
+          dilations, auto_pad, use_contrib_qdq](ModelTestBuilder& builder,
+                                                std::vector<QuantParams<ActivationQType>>& output_qparams) {
     std::vector<NodeArg*> conv_inputs;
 
     // input -> Q/DQ ->
     auto* input = MakeTestInput(builder, input_def);
-    QuantParams<InputQType> input_qparams = GetTestInputQuantParams<InputQType>(input_def);
-    auto* input_qdq = AddQDQNodePair<InputQType>(builder, input, input_qparams.scale, input_qparams.zero_point);
+    QuantParams<ActivationQType> input_qparams = GetTestInputQuantParams<ActivationQType>(input_def);
+    auto* input_qdq = AddQDQNodePair<ActivationQType>(builder, input, input_qparams.scale, input_qparams.zero_point,
+                                                      use_contrib_qdq);
     conv_inputs.push_back(input_qdq);
 
     // weights -> Q/DQ ->
     auto* weights = MakeTestInput(builder, weights_def);
-    QuantParams<InputQType> weights_qparams = GetTestInputQuantParams<InputQType>(weights_def);
-    auto* weights_qdq = AddQDQNodePair<InputQType>(builder, weights, weights_qparams.scale, weights_qparams.zero_point);
+    QuantParams<WeightQType> weights_qparams = GetTestInputQuantParams<WeightQType>(weights_def);
+    auto* weights_qdq = AddQDQNodePair<WeightQType>(builder, weights, weights_qparams.scale,
+                                                    weights_qparams.zero_point, use_contrib_qdq);
     conv_inputs.push_back(weights_qdq);
 
     // bias ->
@@ -107,7 +112,7 @@ static GetTestQDQModelFn<InputQType> BuildQDQConvTestCase(const std::string& con
       // Bias requirement taken from python quantization tool: onnx_quantizer.py::quantize_bias_static()
       const float bias_scale = input_qparams.scale * weights_qparams.scale;
 
-      conv_inputs.push_back(MakeTestQDQBiasInput(builder, bias_def, bias_scale));
+      conv_inputs.push_back(MakeTestQDQBiasInput(builder, bias_def, bias_scale, use_contrib_qdq));
     }
 
     auto* conv_output = builder.MakeIntermediate();
@@ -125,13 +130,14 @@ static GetTestQDQModelFn<InputQType> BuildQDQConvTestCase(const std::string& con
       conv_node.AddAttribute("dilations", dilations);
     }
 
-    AddQDQNodePairWithOutputAsGraphOutput<InputQType>(builder, conv_output, output_qparams[0].scale, output_qparams[0].zero_point);
+    AddQDQNodePairWithOutputAsGraphOutput<ActivationQType>(builder, conv_output, output_qparams[0].scale,
+                                                           output_qparams[0].zero_point, use_contrib_qdq);
   };
 }
 
 // Runs a Conv model on the QNN HTP backend. Checks the graph node assignment, and that inference
 // outputs for QNN EP and CPU EP match.
-template <typename InputQType>
+template <typename ActivationQType, typename WeightQType>
 static void RunHTPConvOpTest(const std::string& conv_op_type, const TestInputDef<float>& input_def,
                              const TestInputDef<float>& weights_def,
                              const TestInputDef<float>& bias_def,
@@ -140,6 +146,7 @@ static void RunHTPConvOpTest(const std::string& conv_op_type, const TestInputDef
                              const std::vector<int64_t>& dilations,
                              const std::string& auto_pad,
                              ExpectedEPNodeAssignment expected_ep_assignment,
+                             bool use_contrib_qdq = false,
                              int opset = 13,
                              float fp32_abs_err = 1e-5f) {
   ProviderOptions provider_options;
@@ -150,9 +157,11 @@ static void RunHTPConvOpTest(const std::string& conv_op_type, const TestInputDef
   provider_options["backend_path"] = "libQnnHtp.so";
 #endif
 
-  TestQDQModelAccuracy(BuildF32ConvTestCase(conv_op_type, input_def, weights_def, bias_def, strides, pads, dilations, auto_pad),
-                       BuildQDQConvTestCase<InputQType>(conv_op_type, input_def, weights_def, bias_def,
-                                                        strides, pads, dilations, auto_pad),
+  TestQDQModelAccuracy(BuildF32ConvTestCase(conv_op_type, input_def, weights_def, bias_def, strides, pads, dilations,
+                                            auto_pad),
+                       BuildQDQConvTestCase<ActivationQType, WeightQType>(conv_op_type, input_def, weights_def,
+                                                                          bias_def, strides, pads, dilations,
+                                                                          auto_pad, use_contrib_qdq),
                        provider_options,
                        opset,
                        expected_ep_assignment,
@@ -279,52 +288,56 @@ TEST_F(QnnCPUBackendTests, Convf32_large_input2_nopad_bias_initializer) {
 
 // Test 1D Conv with static weights (implemented in QNN EP as 2D convolution with height of 1).
 TEST_F(QnnCPUBackendTests, Conv1Df32_StaticWeights_DefaultBias) {
+  std::vector<float> input_data = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
   RunCPUConvOpTest("Conv",
-                   TestInputDef<float>({1, 2, 4}, false, {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}),  // Dynamic input
-                   TestInputDef<float>({1, 2, 2}, true, {1.0f, 2.0f, 3.0f, 4.0f}),                           // Static weights
-                   TestInputDef<float>({1}, true, {1.0f}),                                                   // Bias of 1.f
-                   {1},                                                                                      // Strides
-                   {0, 0},                                                                                   // Pads
-                   {1},                                                                                      // Dilations
+                   TestInputDef<float>({1, 2, 4}, false, input_data),               // Dynamic input
+                   TestInputDef<float>({1, 2, 2}, true, {1.0f, 2.0f, 3.0f, 4.0f}),  // Static weights
+                   TestInputDef<float>({1}, true, {1.0f}),                          // Initializer Bias
+                   {1},                                                             // Strides
+                   {0, 0},                                                          // Pads
+                   {1},                                                             // Dilations
                    "NOTSET",
                    ExpectedEPNodeAssignment::All);
 }
 
 // Test 1D Conv with dynamic weights (implemented in QNN EP as 2D convolution with height of 1).
 TEST_F(QnnCPUBackendTests, Conv1Df32_DynamicWeights_DefaultBias) {
+  std::vector<float> input_data = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
   RunCPUConvOpTest("Conv",
-                   TestInputDef<float>({1, 2, 4}, false, {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}),  // Dynamic input
-                   TestInputDef<float>({1, 2, 2}, false, {1.0f, 2.0f, 3.0f, 4.0f}),                          // Dynamic weights
-                   TestInputDef<float>(),                                                                    // Default bias
-                   {1},                                                                                      // Strides
-                   {0, 0},                                                                                   // Pads
-                   {1},                                                                                      // Dilations
+                   TestInputDef<float>({1, 2, 4}, false, input_data),                // Dynamic input
+                   TestInputDef<float>({1, 2, 2}, false, {1.0f, 2.0f, 3.0f, 4.0f}),  // Dynamic weights
+                   TestInputDef<float>(),                                            // Default bias
+                   {1},                                                              // Strides
+                   {0, 0},                                                           // Pads
+                   {1},                                                              // Dilations
                    "NOTSET",
                    ExpectedEPNodeAssignment::All);
 }
 
 // Test 1D ConvTranspose with static weights (implemented in QNN EP as 2D convolution with height of 1).
 TEST_F(QnnCPUBackendTests, ConvTranspose1Df32_StaticWeights_DefaultBias) {
+  std::vector<float> input_data = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
   RunCPUConvOpTest("ConvTranspose",
-                   TestInputDef<float>({1, 2, 4}, false, {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}),  // Dynamic input
-                   TestInputDef<float>({2, 1, 2}, true, {1.0f, 2.0f, 3.0f, 4.0f}),                           // Static weights
-                   TestInputDef<float>({1}, true, {0.0f}),                                                   // Zero bias
-                   {1},                                                                                      // Strides
-                   {0, 0},                                                                                   // Pads
-                   {1},                                                                                      // Dilations
+                   TestInputDef<float>({1, 2, 4}, false, input_data),               // Dynamic input
+                   TestInputDef<float>({2, 1, 2}, true, {1.0f, 2.0f, 3.0f, 4.0f}),  // Static weights
+                   TestInputDef<float>({1}, true, {0.0f}),                          // Zero bias
+                   {1},                                                             // Strides
+                   {0, 0},                                                          // Pads
+                   {1},                                                             // Dilations
                    "NOTSET",
                    ExpectedEPNodeAssignment::All);
 }
 
 // Test 1D ConvTranspose with dynamic weights (implemented in QNN EP as 2D convolution with height of 1).
 TEST_F(QnnCPUBackendTests, ConvTranspose1Df32_DynamicWeights_DefaultBias) {
+  std::vector<float> input_data = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
   RunCPUConvOpTest("ConvTranspose",
-                   TestInputDef<float>({1, 2, 4}, false, {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}),  // Dynamic input
-                   TestInputDef<float>({2, 1, 2}, false, {1.0f, 2.0f, 3.0f, 4.0f}),                          // Dynamic weights
-                   TestInputDef<float>({1}, true, {0.0f}),                                                   // Zero bias
-                   {1},                                                                                      // Strides
-                   {0, 0},                                                                                   // Pads
-                   {1},                                                                                      // Dilations
+                   TestInputDef<float>({1, 2, 4}, false, input_data),                // Dynamic input
+                   TestInputDef<float>({2, 1, 2}, false, {1.0f, 2.0f, 3.0f, 4.0f}),  // Dynamic weights
+                   TestInputDef<float>({1}, true, {0.0f}),                           // Zero bias
+                   {1},                                                              // Strides
+                   {0, 0},                                                           // Pads
+                   {1},                                                              // Dilations
                    "NOTSET",
                    ExpectedEPNodeAssignment::All);
 }
@@ -397,218 +410,448 @@ TEST_F(QnnHTPBackendTests, Test_QDQConvWithDynamicWeightsFromMul) {
 
 // Check that QNN compiles DQ -> Conv -> Q as a single unit.
 // Tests bias as a dynamic input.
-TEST_F(QnnHTPBackendTests, ConvU8S32_bias_dynamic_input) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 1, 5, 5}, false, 0.0f, 10.0f),   // Random dynamic input
-                            TestInputDef<float>({1, 1, 3, 3}, true, -10.0f, 10.0f),  // Random static input
-                            TestInputDef<float>({1}, false, {2.0f}),                 // Dynamic bias = 2.0f
-                            {1, 1},                                                  // Strides
-                            {0, 0, 0, 0},                                            // Pads
-                            {1, 1},                                                  // Dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+TEST_F(QnnHTPBackendTests, ConvU8U8S32_bias_dynamic_input) {
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 1, 5, 5}, false, 0.0f, 10.0f),   // Random dynamic input
+                                     TestInputDef<float>({1, 1, 3, 3}, true, -10.0f, 10.0f),  // Random static input
+                                     TestInputDef<float>({1}, false, {2.0f}),                 // Dynamic bias
+                                     {1, 1},                                                  // Strides
+                                     {0, 0, 0, 0},                                            // Pads
+                                     {1, 1},                                                  // Dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
+}
+
+// Tests 16-bit QDQ Conv with dynamic weights and bias (uses QNN's Conv2d)
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0040235077030956745, zero_point=0.
+// Expected val: 87.354057312011719
+// QNN QDQ val: 0 (err 87.354057312011719)
+// CPU QDQ val: 87.3583984375 (err 0.00434112548828125)
+TEST_F(QnnHTPBackendTests, DISABLED_ConvU16S16S32_DynamicBias) {
+  TestInputDef<float> input_def({1, 2, 5, 5}, false, GetFloatDataInRange(-10.0f, 10.0f, 50));
+  TestInputDef<float> weight_def({1, 2, 3, 3}, false, GetFloatDataInRange(-1.0f, 5.0f, 18));
+  RunHTPConvOpTest<uint16_t, int16_t>("Conv",
+                                      input_def,                                   // Input
+                                      weight_def.OverrideValueRange(-5.0f, 5.0f),  // Weights (symmetric quant range)
+                                      TestInputDef<float>({1}, false, {2.0f}),     // Bias
+                                      {1, 1},                                      // Strides
+                                      {0, 0, 0, 0},                                // Pads
+                                      {1, 1},                                      // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true);  // Use com.microsoft QDQ ops for 16-bit
+}
+
+// Tests 16-bit QDQ Conv with dynamic weights and bias (uses QNN's DepthwiseConv2d)
+// TODO(adrianlizarraga): FAIL: Failed to finalize QNN graph. Error code 1002
+TEST_F(QnnHTPBackendTests, DISABLED_DepthwiseConvU16S16S32_DynamicBias) {
+  TestInputDef<float> input_def({1, 1, 5, 5}, false, GetFloatDataInRange(-10.0f, 10.0f, 25));
+  TestInputDef<float> weight_def({1, 1, 3, 3}, false, GetFloatDataInRange(-1.0f, 5.0f, 9));
+  RunHTPConvOpTest<uint16_t, int16_t>("Conv",
+                                      input_def,                                   // Input
+                                      weight_def.OverrideValueRange(-5.0f, 5.0f),  // Weights (symmetric quant range)
+                                      TestInputDef<float>({1}, false, {2.0f}),     // Bias
+                                      {1, 1},                                      // Strides
+                                      {0, 0, 0, 0},                                // Pads
+                                      {1, 1},                                      // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true);  // Use com.microsoft QDQ ops for 16-bit
+}
+
+// Tests 16-bit QDQ Conv with dynamic weights and no bias.
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0039929896593093872, zero_point=0.
+// Expected val: 85.354057312011719
+// QNN QDQ val: 0 (err 85.354057312011719)
+// CPU QDQ val: 85.358139038085938 (err 0.00408172607421875)
+TEST_F(QnnHTPBackendTests, DISABLED_ConvU16S16S32_NoBias) {
+  TestInputDef<float> input_def({1, 2, 5, 5}, false, GetFloatDataInRange(-10.0f, 10.0f, 50));
+  TestInputDef<float> weight_def({1, 2, 3, 3}, false, GetFloatDataInRange(-1.0f, 5.0f, 18));
+  RunHTPConvOpTest<uint16_t, int16_t>("Conv",
+                                      input_def,                                   // Input
+                                      weight_def.OverrideValueRange(-5.0f, 5.0f),  // Weights (symmetric quant range)
+                                      TestInputDef<float>(),                       // Bias
+                                      {1, 1},                                      // Strides
+                                      {0, 0, 0, 0},                                // Pads
+                                      {1, 1},                                      // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true);  // Use com.microsoft QDQ ops for 16-bit
+}
+
+// Tests 16-bit QDQ Conv with dynamic weights and no bias (uses QNN's DepthWiseConv2d)
+// TODO(adrianlizarraga): FAIL: Failed to finalize QNN graph. Error code 1002
+TEST_F(QnnHTPBackendTests, DISABLED_DepthwiseConvU16S16S32_NoBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 25);
+  std::vector<float> weight_data = GetFloatDataInRange(-10.0f, 10.0f, 9);
+  RunHTPConvOpTest<uint16_t, int16_t>("Conv",
+                                      TestInputDef<float>({1, 1, 5, 5}, false, input_data),   // Input
+                                      TestInputDef<float>({1, 1, 3, 3}, false, weight_data),  // Weights
+                                      TestInputDef<float>(),                                  // Bias
+                                      {1, 1},                                                 // Strides
+                                      {0, 0, 0, 0},                                           // Pads
+                                      {1, 1},                                                 // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true);  // Use com.microsoft QDQ ops for 16-bit
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with static bias.
+// Uses QNN's DepthwiseConv2d operator.
+// TODO: Inaccuracy detected for output 'output', element 8.
+// Output quant params: scale=0.0027466239407658577, zero_point=10194.
+// Expected val: 152
+// QNN QDQ val: 151.8004150390625 (err 0.1995849609375)
+// CPU QDQ val: 151.9981689453125 (err 0.0018310546875)
+TEST_F(QnnHTPBackendTests, DepthwiseConvU16U8S32_StaticBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 25);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 9);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 1, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 1, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>({1}, true, {2.0f}),                // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.2f);
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with static bias.
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0040235077030956745, zero_point=0.
+// Expected val: 87.354057312011719
+// QNN QDQ val: 87.559577941894531 (err 0.2055206298828125)
+// CPU QDQ val: 87.398635864257812 (err 0.04457855224609375)
+TEST_F(QnnHTPBackendTests, ConvU16U8S32_StaticBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 50);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 18);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 2, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 2, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>({1}, true, {2.0f}),                // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.6f);
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with dynamic bias.
+// Uses QNN's DepthwiseConv2d operator.
+// TODO: Inaccuracy detected for output 'output', element 1.
+// Output quant params: scale=0.0027466239407658577, zero_point=10194.
+// Expected val: -13.000001907348633
+// QNN QDQ val: -13.095903396606445 (err 0.0959014892578125)
+// CPU QDQ val: -12.999771118164062 (err 0.0002307891845703125)
+TEST_F(QnnHTPBackendTests, DepthwiseConvU16U8S32_DynamicBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 25);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 9);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 1, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 1, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>({1}, false, {2.0f}),               // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.2f);
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with dynamic bias.
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0040235077030956745, zero_point=0.
+// Expected val: 87.354057312011719
+// QNN QDQ val: 87.559577941894531 (err 0.2055206298828125)
+// CPU QDQ val: 87.398635864257812 (err 0.04457855224609375)
+TEST_F(QnnHTPBackendTests, ConvU16U8S32_DynamicBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 50);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 18);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 2, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 2, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>({1}, false, {2.0f}),               // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.57f);
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with no bias
+// TODO: Inaccuracy detected for output 'output', element 7.
+// Output quant params: scale=0.0039929896593093872, zero_point=0.
+// Expected val: 246.98667907714844
+// QNN QDQ val: 247.82090759277344 (err 0.834228515625)
+// CPU QDQ val: 247.24192810058594 (err 0.2552490234375)
+TEST_F(QnnHTPBackendTests, ConvU16U8S32_NoBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 50);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 18);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 2, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 2, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>(),                                 // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.58f);
+}
+
+// Tests 16-bit activations, 8-bit static weights QDQ Conv with no bias
+// Uses QNN's DepthwiseConv2d operator.
+// TODO: Inaccuracy detected for output 'output', element 8.
+// Output quant params: scale=0.0027466239407658577, zero_point=10923.
+// Expected val: 150
+// QNN QDQ val: 149.80087280273438 (err 0.199127197265625)
+// CPU QDQ val: 149.99862670898438 (err 0.001373291015625)
+TEST_F(QnnHTPBackendTests, DepthwiseConvU16U8S32_NoBias) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 25);
+  std::vector<float> weight_data = GetFloatDataInRange(-1.0f, 5.0f, 9);
+  RunHTPConvOpTest<uint16_t, uint8_t>("Conv",
+                                      TestInputDef<float>({1, 1, 5, 5}, false, input_data),  // Input
+                                      TestInputDef<float>({1, 1, 3, 3}, true, weight_data),  // Weights
+                                      TestInputDef<float>(),                                 // Bias
+                                      {1, 1},                                                // Strides
+                                      {0, 0, 0, 0},                                          // Pads
+                                      {1, 1},                                                // Dilations
+                                      "NOTSET",
+                                      ExpectedEPNodeAssignment::All,
+                                      true,  // Use com.microsoft QDQ ops for 16-bit
+                                      13,
+                                      0.2f);
 }
 
 // Test that dynamic weights with default bias works for Conv. This was previously not working
 // on older versions of QNN sdk.
-TEST_F(QnnHTPBackendTests, ConvU8S32_DynamicWeight_NoBias) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 3, 32, 32}, false, -10.0f, 10.0f),  // Random dynamic input
-                            TestInputDef<float>({1, 3, 4, 4}, false, -10.0f, 10.0f),    // Random dynamic weights
-                            TestInputDef<float>(),                                      // Default bias
-                            {1, 1},                                                     // Strides
-                            {0, 0, 0, 0},                                               // Pads
-                            {1, 1},                                                     // Dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+TEST_F(QnnHTPBackendTests, ConvU8U8S32_DynamicWeight_NoBias) {
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 3, 32, 32}, false, -10.0f, 10.0f),  // Input
+                                     TestInputDef<float>({1, 3, 4, 4}, false, -10.0f, 10.0f),    // Weights
+                                     TestInputDef<float>(),                                      // Bias
+                                     {1, 1},                                                     // Strides
+                                     {0, 0, 0, 0},                                               // Pads
+                                     {1, 1},                                                     // Dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 // Test that dynamic weights with default bias works for ConvTranspose. This was previously not working
 // on older versions of QNN sdk.
-TEST_F(QnnHTPBackendTests, ConvTransposeU8S32_DynamicWeight_NoBias) {
-  RunHTPConvOpTest<uint8_t>("ConvTranspose",
-                            TestInputDef<float>({1, 3, 32, 32}, false, -10.0f, 10.0f),  // Random dynamic input
-                            TestInputDef<float>({3, 1, 4, 4}, false, -10.0f, 10.0f),    // Random dynamic weights
-                            TestInputDef<float>(),                                      // Default bias
-                            {1, 1},                                                     // Strides
-                            {0, 0, 0, 0},                                               // Pads
-                            {1, 1},                                                     // Dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+TEST_F(QnnHTPBackendTests, ConvTransposeU8U8S32_DynamicWeight_NoBias) {
+  RunHTPConvOpTest<uint8_t, uint8_t>("ConvTranspose",
+                                     TestInputDef<float>({1, 3, 32, 32}, false, -10.0f, 10.0f),  // Input
+                                     TestInputDef<float>({3, 1, 4, 4}, false, -10.0f, 10.0f),    // Weights
+                                     TestInputDef<float>(),                                      // Bias
+                                     {1, 1},                                                     // Strides
+                                     {0, 0, 0, 0},                                               // Pads
+                                     {1, 1},                                                     // Dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 // Check that QNN compiles DQ -> Conv -> Q as a single unit.
 // Tests bias as an initializer.
 TEST_F(QnnHTPBackendTests, ConvU8U8S32_bias_initializer) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 1, 5, 5}, false, 0.0f, 10.0f),   // Random dynamic input
-                            TestInputDef<float>({1, 1, 3, 3}, true, -10.0f, 10.0f),  // Random static weight
-                            TestInputDef<float>({1}, true, {2.0f}),                  // Initializer bias = 2.0f
-                            {1, 1},                                                  // Strides
-                            {0, 0, 0, 0},                                            // Pads
-                            {1, 1},                                                  // Dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 1, 5, 5}, false, 0.0f, 10.0f),   // Random dynamic input
+                                     TestInputDef<float>({1, 1, 3, 3}, true, -10.0f, 10.0f),  // Random static weight
+                                     TestInputDef<float>({1}, true, {2.0f}),                  // Initializer bias
+                                     {1, 1},                                                  // Strides
+                                     {0, 0, 0, 0},                                            // Pads
+                                     {1, 1},                                                  // Dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 // Tests 1D Conv with bias as an initializer.
-TEST_F(QnnHTPBackendTests, Conv1DU8S32_bias_initializer) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0, 0},                                                                           // pads
-                            {1},                                                                              // dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+TEST_F(QnnHTPBackendTests, Conv1DU8U8S32_bias_initializer) {
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0, 0},                                                      // pads
+                                     {1},                                                         // dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 // Tests 1D ConvTranspose with bias as an initializer.
-TEST_F(QnnHTPBackendTests, ConvTranspose1DU8S32_bias_initializer) {
-  RunHTPConvOpTest<uint8_t>("ConvTranspose",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0, 0},                                                                           // pads
-                            {1},                                                                              // dilations
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+TEST_F(QnnHTPBackendTests, ConvTranspose1DU8U8S32_bias_initializer) {
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("ConvTranspose",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0, 0},                                                      // pads
+                                     {1},                                                         // dilations
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 // Tests auto_pad value "SAME_UPPER" on HTP backend (compares to CPU EP).
-TEST_F(QnnHTPBackendTests, ConvU8S32_AutoPadUpper) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
-                            TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
-                            TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias = 1.0f
-                            {1, 1},                                               // strides
-                            {},                                                   // pads
-                            {1, 1},                                               // dilations
-                            "SAME_UPPER",                                         // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+TEST_F(QnnHTPBackendTests, ConvU8U8S32_AutoPadUpper) {
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
+                                     TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
+                                     TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias
+                                     {1, 1},                                               // strides
+                                     {},                                                   // pads
+                                     {1, 1},                                               // dilations
+                                     "SAME_UPPER",                                         // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests Conv1d auto_pad value "SAME_UPPER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, Conv1DU8U8S32_AutoPadUpper) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0},                                                                              // pads
-                            {1},                                                                              // dilations
-                            "SAME_UPPER",                                                                     // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0},                                                         // pads
+                                     {1},                                                         // dilations
+                                     "SAME_UPPER",                                                // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests TransposeConv1d auto_pad value "SAME_UPPER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, ConvTranspose1DU8U8S32_AutoPadUpper) {
-  RunHTPConvOpTest<uint8_t>("ConvTranspose",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0},                                                                              // pads
-                            {1},                                                                              // dilations
-                            "SAME_UPPER",                                                                     // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("ConvTranspose",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0},                                                         // pads
+                                     {1},                                                         // dilations
+                                     "SAME_UPPER",                                                // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests Conv's auto_pad value "SAME_LOWER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, ConvU8U8S32_AutoPadLower) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
-                            TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
-                            TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias = 1.0f
-                            {1, 1},                                               // strides
-                            {},                                                   // pads
-                            {1, 1},                                               // dilations
-                            "SAME_LOWER",                                         // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
+                                     TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
+                                     TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias
+                                     {1, 1},                                               // strides
+                                     {},                                                   // pads
+                                     {1, 1},                                               // dilations
+                                     "SAME_LOWER",                                         // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests ConvTranspose's auto_pad value "SAME_LOWER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, ConvTransposeU8U8S32_AutoPadLower) {
-  RunHTPConvOpTest<uint8_t>("ConvTranspose",
-                            TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
-                            TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
-                            TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias = 1.0f
-                            {1, 1},                                               // strides
-                            {},                                                   // pads
-                            {1, 1},                                               // dilations
-                            "SAME_LOWER",                                         // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  RunHTPConvOpTest<uint8_t, uint8_t>("ConvTranspose",
+                                     TestInputDef<float>({1, 1, 5, 5}, false, 0.f, 10.f),  // Dynamic input
+                                     TestInputDef<float>({1, 1, 4, 4}, true, -1.f, 1.f),   // Static weights
+                                     TestInputDef<float>({1}, true, {1.0f}),               // Initializer bias
+                                     {1, 1},                                               // strides
+                                     {},                                                   // pads
+                                     {1, 1},                                               // dilations
+                                     "SAME_LOWER",                                         // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests Conv1d auto_pad value "SAME_LOWER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, Conv1DU8U8S32_AutoPadLower) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0},                                                                              // pads
-                            {1},                                                                              // dilations
-                            "SAME_LOWER",                                                                     // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({1, 2, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0},                                                         // pads
+                                     {1},                                                         // dilations
+                                     "SAME_LOWER",                                                // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 // Tests ConvTranspose 1d auto_pad value "SAME_LOWER" on HTP backend (compares to CPU EP).
 TEST_F(QnnHTPBackendTests, ConvTranspose1DU8U8S32_AutoPadLower) {
-  RunHTPConvOpTest<uint8_t>("ConvTranspose",
-                            TestInputDef<float>({1, 2, 4}, false, {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f}),  // Dynamic input
-                            TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),                       // Static weight
-                            TestInputDef<float>({1}, true, {1.0f}),                                           // Initializer bias = 1.0f
-                            {1},                                                                              // strides
-                            {0},                                                                              // pads
-                            {1},                                                                              // dilations
-                            "SAME_LOWER",                                                                     // auto_pad
-                            ExpectedEPNodeAssignment::All,
-                            13);
+  std::vector<float> input_data = {0.f, 1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f};
+  RunHTPConvOpTest<uint8_t, uint8_t>("ConvTranspose",
+                                     TestInputDef<float>({1, 2, 4}, false, input_data),           // Dynamic input
+                                     TestInputDef<float>({2, 1, 2}, true, {1.f, 2.f, 3.f, 4.f}),  // Static weight
+                                     TestInputDef<float>({1}, true, {1.0f}),                      // Initializer bias
+                                     {1},                                                         // strides
+                                     {0},                                                         // pads
+                                     {1},                                                         // dilations
+                                     "SAME_LOWER",                                                // auto_pad
+                                     ExpectedEPNodeAssignment::All,
+                                     false,  // use_contrib_qdq
+                                     13);
 }
 
 TEST_F(QnnHTPBackendTests, ConvU8U8S32_large_input1_padding_bias_initializer) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 3, 60, 452}, false, 0.f, 10.f),        // Dynamic input
-                            TestInputDef<float>({16, 3, 3, 3}, true, -1.f, 1.f),           // Static weights
-                            TestInputDef<float>({16}, true, std::vector<float>(16, 1.f)),  // Initializer bias = 1.f, 1.f, ...
-                            {1, 1},
-                            {1, 1, 1, 1},
-                            {1, 1},
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
-}
-
-TEST_F(QnnHTPBackendTests, ConvU8S32_large_input2_bias_initializer) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 128, 8, 56}, false, 0.f, 10.f),  // Dynamic input
-                            TestInputDef<float>({32, 128, 1, 1}, true, -1.f, 1.f),   // Random static weights
-                            TestInputDef<float>({32}, true, -1.f, 1.f),              // Random initializer bias
-                            {1, 1},
-                            {0, 0, 0, 0},
-                            {1, 1},
-                            "NOTSET",
-                            ExpectedEPNodeAssignment::All);
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 3, 60, 452}, false, 0.f, 10.f),        // Dynamic input
+                                     TestInputDef<float>({16, 3, 3, 3}, true, -1.f, 1.f),           // Static weights
+                                     TestInputDef<float>({16}, true, std::vector<float>(16, 1.f)),  // Initializer bias
+                                     {1, 1},
+                                     {1, 1, 1, 1},
+                                     {1, 1},
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
+}
+
+TEST_F(QnnHTPBackendTests, ConvU8U8S32_large_input2_bias_initializer) {
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 128, 8, 56}, false, 0.f, 10.f),  // Dynamic input
+                                     TestInputDef<float>({32, 128, 1, 1}, true, -1.f, 1.f),   // Random static weights
+                                     TestInputDef<float>({32}, true, -1.f, 1.f),              // Random initializer bias
+                                     {1, 1},
+                                     {0, 0, 0, 0},
+                                     {1, 1},
+                                     "NOTSET",
+                                     ExpectedEPNodeAssignment::All);
 }
 
 TEST_F(QnnHTPBackendTests, ConvU8U8S32_LargeInput_Dilations_Pads) {
-  RunHTPConvOpTest<uint8_t>("Conv",
-                            TestInputDef<float>({1, 3, 768, 1152}, false, 0.f, 10.f),  // Dynamic input
-                            TestInputDef<float>({64, 3, 7, 7}, true, -1.f, 1.f),       // Random static weights
-                            TestInputDef<float>({64}, true, -1.f, 1.f),                // Random initializer bias
-                            {2, 2},                                                    // strides
-                            {3, 3, 3, 3},                                              // pads
-                            {1, 1},                                                    // dilations
-                            "NOTSET",                                                  // auto_pad
-                            ExpectedEPNodeAssignment::All);
+  RunHTPConvOpTest<uint8_t, uint8_t>("Conv",
+                                     TestInputDef<float>({1, 3, 768, 1152}, false, 0.f, 10.f),  // Dynamic input
+                                     TestInputDef<float>({64, 3, 7, 7}, true, -1.f, 1.f),       // Static weights
+                                     TestInputDef<float>({64}, true, -1.f, 1.f),                // Initializer bias
+                                     {2, 2},                                                    // strides
+                                     {3, 3, 3, 3},                                              // pads
+                                     {1, 1},                                                    // dilations
+                                     "NOTSET",                                                  // auto_pad
+                                     ExpectedEPNodeAssignment::All);
 }
 #endif  // defined(__aarch64__) || defined(_M_ARM64) || defined(__linux__)
 
diff --git a/onnxruntime/test/providers/qnn/matmul_test.cpp b/onnxruntime/test/providers/qnn/matmul_test.cpp
index 6edb6ecdcfb1a..e721ccbcb45a9 100644
--- a/onnxruntime/test/providers/qnn/matmul_test.cpp
+++ b/onnxruntime/test/providers/qnn/matmul_test.cpp
@@ -27,28 +27,31 @@ static GetTestModelFn BuildMatMulOpTestCase(const TestInputDef<float>& input1_de
 }
 
 // Returns a function that creates a graph with a QDQ MatMul operator.
-template <typename QuantType>
-static GetTestQDQModelFn<QuantType> BuildMatMulOpQDQTestCase(const TestInputDef<float>& input1_def,
-                                                             const TestInputDef<float>& input2_def) {
-  return [input1_def, input2_def](ModelTestBuilder& builder,
-                                  std::vector<QuantParams<QuantType>>& output_qparams) {
+template <typename Input0QType, typename Input1QType, typename OutputQType>
+static GetTestQDQModelFn<OutputQType> BuildMatMulOpQDQTestCase(const TestInputDef<float>& input1_def,
+                                                               const TestInputDef<float>& input2_def,
+                                                               bool use_contrib_qdq) {
+  return [input1_def, input2_def, use_contrib_qdq](ModelTestBuilder& builder,
+                                                   std::vector<QuantParams<OutputQType>>& output_qparams) {
     // input1 -> Q -> DQ ->
     NodeArg* input1 = MakeTestInput(builder, input1_def);
-    QuantParams<QuantType> input1_qparams = GetTestInputQuantParams<QuantType>(input1_def);
-    auto* input1_qdq = AddQDQNodePair<QuantType>(builder, input1, input1_qparams.scale, input1_qparams.zero_point);
+    QuantParams<Input0QType> input1_qparams = GetTestInputQuantParams<Input0QType>(input1_def);
+    auto* input1_qdq = AddQDQNodePair<Input0QType>(builder, input1, input1_qparams.scale, input1_qparams.zero_point,
+                                                   use_contrib_qdq);
 
     // input2 -> Q -> DQ ->
     NodeArg* input2 = MakeTestInput(builder, input2_def);
-    QuantParams<QuantType> input2_qparams = GetTestInputQuantParams<QuantType>(input2_def);
-    auto* input2_qdq = AddQDQNodePair<QuantType>(builder, input2, input2_qparams.scale, input2_qparams.zero_point);
+    QuantParams<Input1QType> input2_qparams = GetTestInputQuantParams<Input1QType>(input2_def);
+    auto* input2_qdq = AddQDQNodePair<Input1QType>(builder, input2, input2_qparams.scale, input2_qparams.zero_point,
+                                                   use_contrib_qdq);
 
     // MatMul
     auto* op_output = builder.MakeIntermediate();
     builder.AddNode("MatMul", {input1_qdq, input2_qdq}, {op_output});
 
     // op_output -> Q -> DQ -> output
-    AddQDQNodePairWithOutputAsGraphOutput<QuantType>(builder, op_output, output_qparams[0].scale,
-                                                     output_qparams[0].zero_point);
+    AddQDQNodePairWithOutputAsGraphOutput<OutputQType>(builder, op_output, output_qparams[0].scale,
+                                                       output_qparams[0].zero_point, use_contrib_qdq);
   };
 }
 
@@ -75,11 +78,13 @@ static void RunMatMulOpOpTest(const TestInputDef<float>& input1_def,
 
 // Runs a QDQ MatMul model on the QNN HTP backend. Checks the graph node assignment, and that the
 // QDQ model is accurate on QNN EP (compared to CPU EP).
-template <typename QuantType>
+template <typename Input0QType, typename Input1QType, typename OutputQType>
 static void RunQDQMatMulOpOpTest(const TestInputDef<float>& input1_def,
                                  const TestInputDef<float>& input2_def,
                                  ExpectedEPNodeAssignment expected_ep_assignment,
-                                 int opset = 18) {
+                                 int opset = 18,
+                                 bool use_contrib_qdq = false,
+                                 float fp32_abs_err = 1e-4f) {
   ProviderOptions provider_options;
 #if defined(_WIN32)
   provider_options["backend_path"] = "QnnHtp.dll";
@@ -88,11 +93,12 @@ static void RunQDQMatMulOpOpTest(const TestInputDef<float>& input1_def,
 #endif
 
   TestQDQModelAccuracy(BuildMatMulOpTestCase(input1_def, input2_def),
-                       BuildMatMulOpQDQTestCase<QuantType>(input1_def, input2_def),
+                       BuildMatMulOpQDQTestCase<Input0QType, Input1QType, OutputQType>(input1_def, input2_def,
+                                                                                       use_contrib_qdq),
                        provider_options,
                        opset,
                        expected_ep_assignment,
-                       1e-5f);
+                       fp32_abs_err);
 }
 
 //
@@ -127,16 +133,68 @@ TEST_F(QnnCPUBackendTests, MatMulOp_Broadcast) {
 //
 
 TEST_F(QnnHTPBackendTests, MatMulOp_HTP_u8) {
-  RunQDQMatMulOpOpTest<uint8_t>(TestInputDef<float>({2, 3}, false, {-10.0f, -4.0f, -2.0f, 0.0f, 5.0f, 10.0f}),
-                                TestInputDef<float>({3, 2}, false, {-10.0f, -6.0f, -1.0f, 0.0f, 3.0f, 10.0f}),
-                                ExpectedEPNodeAssignment::All, 18);
+  std::vector<float> input0_data = {-10.0f, -4.0f, -2.0f, 0.0f, 5.0f, 10.0f};
+  std::vector<float> input1_data = {-10.0f, -6.0f, -1.0f, 0.0f, 3.0f, 10.0f};
+  RunQDQMatMulOpOpTest<uint8_t, uint8_t, uint8_t>(TestInputDef<float>({2, 3}, false, input0_data),
+                                                  TestInputDef<float>({3, 2}, false, input1_data),
+                                                  ExpectedEPNodeAssignment::All, 18);
 }
 
-// Test MatMul broadcasting
+// Test QDQ MatMul with 16-bit act, 8-bit weights (static)
+// TODO: (SLIGHT) Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0015259021893143654, zero_point=0.
+// Expected val: 98
+// QNN QDQ val: 97.720298767089844 (err 0.27970123291015625)
+// CPU QDQ val: 97.726402282714844 (err 0.27359771728515625)
+TEST_F(QnnHTPBackendTests, MatMulOp_HTP_A16_W8Static) {
+  std::vector<float> input0_data = {-10.0f, -4.0f, -2.0f, 0.0f, 5.0f, 10.0f};
+  std::vector<float> input1_data = {-10.0f, -6.0f, -1.0f, 0.0f, 3.0f, 10.0f};
+  RunQDQMatMulOpOpTest<uint16_t, uint8_t, uint16_t>(TestInputDef<float>({2, 3}, false, input0_data),
+                                                    TestInputDef<float>({3, 2}, true, input1_data),
+                                                    ExpectedEPNodeAssignment::All,
+                                                    18,
+                                                    true,  // Use com.microsoft Q/DQ ops
+                                                    7e-3f);
+}
+
+// Test 16-bit QDQ MatMul with static weights
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0015259021893143654, zero_point=0.
+// Expected val: 98
+// QNN QDQ val: 0.65461206436157227 (err 97.345390319824219)
+// CPU QDQ val: 98.002593994140625 (err 0.002593994140625)
+TEST_F(QnnHTPBackendTests, DISABLED_MatMulOp_HTP_A16_W16) {
+  std::vector<float> input0_data = {-10.0f, -4.0f, -2.0f, 0.0f, 5.0f, 10.0f};
+  std::vector<float> input1_data = {-10.0f, -6.0f, -1.0f, 0.0f, 3.0f, 10.0f};
+  RunQDQMatMulOpOpTest<uint16_t, int16_t, uint16_t>(TestInputDef<float>({2, 3}, false, input0_data),
+                                                    TestInputDef<float>({3, 2}, true, input1_data),
+                                                    ExpectedEPNodeAssignment::All,
+                                                    18,
+                                                    true);  // Use com.microsoft Q/DQ ops
+}
+
+// Test 8-bit QDQ MatMul broadcasting
 TEST_F(QnnHTPBackendTests, MatMulOp_Broadcast) {
-  RunQDQMatMulOpOpTest<uint8_t>(TestInputDef<float>({28, 1, 64}, false, -10.0f, 10.0f),
-                                TestInputDef<float>({64, 32}, false, -10.0f, 10.0f),
-                                ExpectedEPNodeAssignment::All, 18);
+  RunQDQMatMulOpOpTest<uint8_t, uint8_t, uint8_t>(TestInputDef<float>({28, 1, 64}, false, -10.0f, 10.0f),
+                                                  TestInputDef<float>({64, 32}, false, -10.0f, 10.0f),
+                                                  ExpectedEPNodeAssignment::All, 18);
+}
+
+// Test 16-bit QDQ MatMul broadcasting
+// TODO: Inaccuracy detected for output 'output', element 0.
+// Output quant params: scale=0.0028538699261844158, zero_point=6050.
+// Expected val: 169.76341247558594
+// QNN QDQ val: -16.675161361694336 (err 186.43856811523438)
+// CPU QDQ val: 169.762451171875 (err 0.0009613037109375)
+TEST_F(QnnHTPBackendTests, DISABLED_MatMulOp_Broadcast_A16_W16) {
+  std::vector<float> input_a = GetFloatDataInRange(-10.0f, 10.0f, 28 * 64);
+  std::vector<float> input_b = GetFloatDataInRange(-10.0f, 10.0f, 64 * 32);
+
+  RunQDQMatMulOpOpTest<uint16_t, int16_t, uint16_t>(TestInputDef<float>({28, 1, 64}, false, input_a),
+                                                    TestInputDef<float>({64, 32}, true, input_b),
+                                                    ExpectedEPNodeAssignment::All,
+                                                    18,
+                                                    true);  // Use com.microsoft Q/DQ ops
 }
 
 #endif  // defined(__aarch64__) || defined(_M_ARM64) || defined(__linux__)
diff --git a/onnxruntime/test/providers/qnn/qnn_basic_test.cc b/onnxruntime/test/providers/qnn/qnn_basic_test.cc
index 80b929e9dafbe..a441e828c0cc6 100644
--- a/onnxruntime/test/providers/qnn/qnn_basic_test.cc
+++ b/onnxruntime/test/providers/qnn/qnn_basic_test.cc
@@ -260,8 +260,8 @@ TEST_F(QnnCPUBackendTests, TestNHWCResizeShapeInference_sizes_opset18) {
 TEST_F(QnnHTPBackendTests, TestNHWCResizeShapeInference_qdq_sizes_opset18) {
   RunNHWCResizeModel(ORT_MODEL_FOLDER "nhwc_resize_sizes_opset18.quant.onnx", true);
 }
-#endif  // defined(__aarch64__) || defined(_M_ARM64) || defined(__linux__)
 
+#endif  // defined(__aarch64__) || defined(_M_ARM64) || defined(__linux__)
 #endif  // !defined(ORT_MINIMAL_BUILD)
 
 }  // namespace test
diff --git a/onnxruntime/test/providers/qnn/qnn_test_utils.cc b/onnxruntime/test/providers/qnn/qnn_test_utils.cc
index 548f80675a622..724e9a11cd781 100644
--- a/onnxruntime/test/providers/qnn/qnn_test_utils.cc
+++ b/onnxruntime/test/providers/qnn/qnn_test_utils.cc
@@ -116,7 +116,8 @@ void InferenceModel(const std::string& model_data, const char* log_id,
   ASSERT_STATUS_OK(session_object.Run(run_options, feeds, output_names, &output_vals));
 }
 
-NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<float>& bias_def, float bias_scale) {
+NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<float>& bias_def, float bias_scale,
+                              bool use_contrib_qdq) {
   NodeArg* bias_int32 = nullptr;
 
   // Bias must be int32 to be detected as a QDQ node unit.
@@ -124,7 +125,8 @@ NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<floa
   if (bias_def.IsRandomData()) {
     // Create random initializer def that is quantized to int32
     const auto& rand_info = bias_def.GetRandomDataInfo();
-    TestInputDef<int32_t> bias_int32_def(bias_def.GetShape(), bias_def.IsInitializer(), static_cast<int32_t>(rand_info.min / bias_scale),
+    TestInputDef<int32_t> bias_int32_def(bias_def.GetShape(), bias_def.IsInitializer(),
+                                         static_cast<int32_t>(rand_info.min / bias_scale),
                                          static_cast<int32_t>(rand_info.max / bias_scale));
     bias_int32 = MakeTestInput(builder, bias_int32_def);
   } else {
@@ -143,7 +145,7 @@ NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<floa
   }
 
   auto* bias = builder.MakeIntermediate();
-  builder.AddDequantizeLinearNode<int32_t>(bias_int32, bias_scale, 0, bias);
+  builder.AddDequantizeLinearNode<int32_t>(bias_int32, bias_scale, 0, bias, use_contrib_qdq);
 
   return bias;
 }
diff --git a/onnxruntime/test/providers/qnn/qnn_test_utils.h b/onnxruntime/test/providers/qnn/qnn_test_utils.h
index 1b0b85319918f..fd572fa17f2b1 100644
--- a/onnxruntime/test/providers/qnn/qnn_test_utils.h
+++ b/onnxruntime/test/providers/qnn/qnn_test_utils.h
@@ -266,6 +266,8 @@ inline void TestQDQModelAccuracy(const GetTestModelFn& f32_model_fn, const GetTe
   std::vector<std::string> output_names;
   InferenceModel(f32_model_data, "f32_model_logger", nullptr, ExpectedEPNodeAssignment::All,
                  f32_helper.feeds_, output_names, cpu_f32_outputs);
+  ASSERT_FALSE(cpu_f32_outputs.empty());
+
   const size_t num_outputs = cpu_f32_outputs.size();
 
   // Compute output range(s) and quantization params.
@@ -432,7 +434,8 @@ inline NodeArg* MakeTestInput(ModelTestBuilder& builder, const TestInputDef<bool
 // to input_scale * weights_scale. See quantization tool: onnx_quantizer.py::quantize_bias_static()
 //
 // i.e., initial bias => manual quantization (int32) => DQ => final float bias
-NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<float>& bias_def, float bias_scale);
+NodeArg* MakeTestQDQBiasInput(ModelTestBuilder& builder, const TestInputDef<float>& bias_def, float bias_scale,
+                              bool use_contrib_qdq = false);
 
 /**
  * Returns a function that builds a model with a single operator with N inputs of the same element type.
@@ -479,9 +482,10 @@ template <typename InputQType>
 inline GetTestQDQModelFn<InputQType> BuildQDQOpTestCase(const std::string& op_type,
                                                         const std::vector<TestInputDef<float>>& input_defs,
                                                         const std::vector<ONNX_NAMESPACE::AttributeProto>& attrs,
-                                                        const std::string& op_domain = kOnnxDomain) {
-  return [op_type, input_defs, attrs, op_domain](ModelTestBuilder& builder,
-                                                 std::vector<QuantParams<InputQType>>& output_qparams) {
+                                                        const std::string& op_domain = kOnnxDomain,
+                                                        bool use_contrib_qdq = false) {
+  return [op_type, input_defs, attrs, op_domain,
+          use_contrib_qdq](ModelTestBuilder& builder, std::vector<QuantParams<InputQType>>& output_qparams) {
     std::vector<NodeArg*> op_inputs;
     op_inputs.reserve(input_defs.size());
 
@@ -489,7 +493,7 @@ inline GetTestQDQModelFn<InputQType> BuildQDQOpTestCase(const std::string& op_ty
       NodeArg* input = MakeTestInput<float>(builder, input_def);
       QuantParams<InputQType> input_qparams = GetTestInputQuantParams<InputQType>(input_def);
       NodeArg* input_after_qdq = AddQDQNodePair<InputQType>(builder, input, input_qparams.scale,
-                                                            input_qparams.zero_point);
+                                                            input_qparams.zero_point, use_contrib_qdq);
       op_inputs.push_back(input_after_qdq);
     }
 
@@ -503,7 +507,7 @@ inline GetTestQDQModelFn<InputQType> BuildQDQOpTestCase(const std::string& op_ty
 
     // op_output -> Q -> DQ -> output
     AddQDQNodePairWithOutputAsGraphOutput<InputQType>(builder, op_output, output_qparams[0].scale,
-                                                      output_qparams[0].zero_point);
+                                                      output_qparams[0].zero_point, use_contrib_qdq);
   };
 }
 
@@ -563,4 +567,4 @@ bool ReduceOpHasAxesInput(const std::string& op_type, int opset_version);
 }  // namespace test
 }  // namespace onnxruntime
 
-#endif  // !defined(ORT_MINIMAL_BUILD)
\ No newline at end of file
+#endif  // !defined(ORT_MINIMAL_BUILD)
diff --git a/onnxruntime/test/providers/qnn/reduce_op_test.cc b/onnxruntime/test/providers/qnn/reduce_op_test.cc
index c3c2b578a1bd0..57252f93492e5 100644
--- a/onnxruntime/test/providers/qnn/reduce_op_test.cc
+++ b/onnxruntime/test/providers/qnn/reduce_op_test.cc
@@ -648,4 +648,4 @@ TEST_F(QnnHTPBackendTests, ReduceMeanS8Opset18) {
 }  // namespace test
 }  // namespace onnxruntime
 
-#endif
\ No newline at end of file
+#endif
diff --git a/onnxruntime/test/providers/qnn/simple_op_htp_test.cc b/onnxruntime/test/providers/qnn/simple_op_htp_test.cc
index eed12af3c703c..63498982930f5 100644
--- a/onnxruntime/test/providers/qnn/simple_op_htp_test.cc
+++ b/onnxruntime/test/providers/qnn/simple_op_htp_test.cc
@@ -104,6 +104,7 @@ static void RunQDQOpTest(const std::string& op_type,
                          int opset_version,
                          ExpectedEPNodeAssignment expected_ep_assignment,
                          const std::string& op_domain = kOnnxDomain,
+                         bool use_contrib_qdq = false,
                          float fp32_abs_err = 1e-4f) {
   ProviderOptions provider_options;
 #if defined(_WIN32)
@@ -113,7 +114,7 @@ static void RunQDQOpTest(const std::string& op_type,
 #endif
 
   TestQDQModelAccuracy(BuildOpTestCase<float>(op_type, input_defs, attrs, op_domain),
-                       BuildQDQOpTestCase<InputQType>(op_type, input_defs, attrs, op_domain),
+                       BuildQDQOpTestCase<InputQType>(op_type, input_defs, attrs, op_domain, use_contrib_qdq),
                        provider_options,
                        opset_version,
                        expected_ep_assignment,
@@ -151,6 +152,17 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Sigmoid) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Sigmoid.
+TEST_F(QnnHTPBackendTests, UnaryOp_Sigmoid_U16) {
+  RunQDQOpTest<uint16_t>("Sigmoid",
+                         {TestInputDef<float>({1, 2, 3}, false, GetFloatDataInRange(-10.0f, 10.0f, 6))},
+                         {},
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use MS domain Q/DQ ops
+}
+
 // Test the accuracy of QDQ Tanh.
 TEST_F(QnnHTPBackendTests, UnaryOp_Tanh) {
   RunQDQOpTest<uint8_t>("Tanh",
@@ -160,6 +172,17 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Tanh) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Tanh.
+TEST_F(QnnHTPBackendTests, UnaryOp_Tanh_U16) {
+  RunQDQOpTest<uint16_t>("Tanh",
+                         {TestInputDef<float>({1, 2, 3}, false, GetFloatDataInRange(-10.0f, 10.0f, 6))},
+                         {},
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use MS domain Q/DQ ops
+}
+
 // Check that QNN compiles DQ -> Gelu -> Q as a single unit.
 // Use an input of rank 3.
 TEST_F(QnnHTPBackendTests, UnaryOp_Gelu) {
@@ -171,6 +194,24 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Gelu) {
                         kMSDomain);  // GeLu is a contrib op.
 }
 
+// Tests accuracy of 16-bit QDQ GeLu.
+// TODO(adrianlizarraga): Inaccuracy detected for output 'output', element 5.
+// Output quant params: scale=0.00015259021893143654, zero_point=0.
+// Expected val: 10
+// QNN QDQ val: 9.997406005859375 (err 0.002593994140625)
+// CPU QDQ val: 9.999847412109375 (err 0.000152587890625)
+TEST_F(QnnHTPBackendTests, UnaryOp_Gelu_U16) {
+  const std::vector<float> input_data = {-10.0f, -8.4f, 0.0f, 4.3f, 7.1f, 10.0f};
+  RunQDQOpTest<uint16_t>("Gelu",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         11,
+                         ExpectedEPNodeAssignment::All,
+                         kMSDomain,  // GeLu is a contrib op.
+                         true,       // Use MS domain Q/DQ ops.
+                         0.0025f);   // TODO(adrianlizarraga): Accuracy
+}
+
 // Check that QNN compiles DQ -> Elu -> Q as a single unit.
 // Use an input of rank 3.
 TEST_F(QnnHTPBackendTests, UnaryOp_Elu) {
@@ -181,6 +222,23 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Elu) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Elu.
+// TODO(adrianlizarraga): Re-enable. This works on QNN SDK 2.14.1!
+// Inaccuracy detected for output 'output', element 1.
+// Output quant params: scale=0.00011093531065853313, zero_point=8992.
+// Expected val: -0.99751651287078857
+// QNN QDQ val: 6.2726154327392578 (err 7.2701320648193359)
+// CPU QDQ val: -0.99753034114837646 (err 1.3828277587890625e-05)
+TEST_F(QnnHTPBackendTests, DISABLE_UnaryOp_Elu_U16) {
+  RunQDQOpTest<uint16_t>("Elu",
+                         {TestInputDef<float>({1, 2, 3}, false, GetFloatDataInRange(-10.0f, 10.0f, 6))},
+                         {},
+                         11,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);
+}
+
 // Tests accuracy of QDQ Relu
 // TODO: Relu does not set negative values to zero!
 // Could be due to ORT's ReluQuantFusion!
@@ -208,6 +266,24 @@ TEST_F(QnnHTPBackendTests, UnaryOp_HardSwish) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ HardSwish
+// TODO(adrianlizarraga): Inaccuracy detected for output 'output', element 5.
+// Output quant params: scale=0.00015259021893143654, zero_point=0.
+// Expected val: 10
+// QNN QDQ val: 9.999237060546875 (err 0.000762939453125)
+// CPU QDQ val: 9.999847412109375 (err 0.000152587890625)
+TEST_F(QnnHTPBackendTests, UnaryOp_HardSwish_U16) {
+  const std::vector<float> input_data = {-10.0f, -8.4f, 0.0f, 4.3f, 7.1f, 10.0f};
+  RunQDQOpTest<uint16_t>("HardSwish",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         14,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true,
+                         0.001f);  // TODO(adrianlizarraga): Remove additional tolerance needed for inaccuracy
+}
+
 // Check that QNN compiles DQ -> Atan -> Q as a single unit.
 // Use an input of rank 3.
 TEST_F(QnnHTPBackendTests, UnaryOp_Atan) {
@@ -218,6 +294,24 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Atan) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Atan
+// TODO(adrianlizarraga): Inaccuracy detected for output 'output', element 1.
+// Output quant params: scale=4.4895936298416927e-05, zero_point=32768.
+// Expected val: -1.4219063520431519
+// QNN QDQ val: -1.4220787286758423 (err 0.00017237663269042969)
+// CPU QDQ val: -1.4218991994857788 (err 7.152557373046875e-06)
+TEST_F(QnnHTPBackendTests, UnaryOp_Atan_U16) {
+  const std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 6);
+  RunQDQOpTest<uint16_t>("Atan",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         14,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Atan domain
+                         true,         // Q/DQ op domain is com.microsoft
+                         1.8e-4f);
+}
+
 // Check that QNN compiles DQ -> Asin -> Q as a single unit.
 // Use an input of rank 3.
 TEST_F(QnnHTPBackendTests, UnaryOp_Asin) {
@@ -238,6 +332,18 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Sign) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Sign
+TEST_F(QnnHTPBackendTests, UnaryOp_Sign_U16) {
+  const std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 6);
+  RunQDQOpTest<uint16_t>("Sign",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Sign op domain
+                         true);        // Use com.microsoft Q/DQ op domains
+}
+
 // Check that QNN compiles DQ -> Sin -> Q as a single unit.
 // Use an input of rank 3.
 TEST_F(QnnHTPBackendTests, UnaryOp_Sin) {
@@ -260,7 +366,7 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Cos) {
 
 // Check that QNN compiles DQ -> Cos -> Q as a single unit.
 // Use an input of rank 3.
-TEST_F(QnnHTPBackendTests, UnaryOp_Cos_Inaccurate) {
+TEST_F(QnnHTPBackendTests, UnaryOp_Cos_InaccurateFixed) {
   RunQDQOpTest<uint8_t>("Cos",
                         {TestInputDef<float>({1, 2, 3}, false, {-3.14159f, -1.88436f, -0.542863f, 0.0f, 1.05622f, 3.14159f})},
                         {},
@@ -326,6 +432,18 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Round) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Log
+TEST_F(QnnHTPBackendTests, UnaryOp_Log_U16) {
+  const std::vector<float> input_data = GetFloatDataInRange(1.0f, 128.0f, 6);
+  RunQDQOpTest<uint16_t>("Log",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         11,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Log op domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
 // Check that QNN compiles DQ -> Softmax -> Q as a single unit.
 // Test that the default axis (-1) for SoftMax opset 13 works.
 TEST_F(QnnHTPBackendTests, UnaryOp_Softmax13_DefaultAxis) {
@@ -336,6 +454,18 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Softmax13_DefaultAxis) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Tests accuracy of 16-bit QDQ Softmax (opset 13) with default axis
+TEST_F(QnnHTPBackendTests, UnaryOp_Softmax13_U16_DefaultAxis) {
+  const std::vector<float> input_data = GetFloatDataInRange(-5.0f, 5.0f, 6);
+  RunQDQOpTest<uint16_t>("Softmax",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},  // Uses default axis of -1 for opset 13
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Sofmax's domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
 // Check that QNN compiles DQ -> Softmax -> Q as a single unit.
 // Test that an axis != -1 is not supported.
 TEST_F(QnnHTPBackendTests, UnaryOp_Softmax13_UnsupportedAxis) {
@@ -410,7 +540,7 @@ TEST_F(QnnHTPBackendTests, UnaryOp_LogSoftmax11_SetValidAxis) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ Abs op.
+// Test accuracy of QDQ Abs op.
 TEST_F(QnnHTPBackendTests, UnaryOp_Abs) {
   RunQDQOpTest<uint8_t>("Abs",
                         {TestInputDef<float>({1, 2, 3}, false, GetFloatDataInRange(-10.0f, 10.0f, 6))},
@@ -419,7 +549,19 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Abs) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ Ceil op.
+// Test accuracy of 16-bit QDQ Abs op.
+TEST_F(QnnHTPBackendTests, UnaryOp_Abs_U16) {
+  const std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 6);
+  RunQDQOpTest<uint16_t>("Abs",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Abs op's domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
+// Test accuracy of QDQ Ceil op.
 TEST_F(QnnHTPBackendTests, UnaryOp_Ceil) {
   const std::vector<float> input_data = GetFloatDataInRange(-12.0f, 12.0f, 6);
   RunQDQOpTest<uint8_t>("Ceil",
@@ -429,6 +571,18 @@ TEST_F(QnnHTPBackendTests, UnaryOp_Ceil) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test accuracy of 16-bit QDQ Ceil op.
+TEST_F(QnnHTPBackendTests, UnaryOp_Ceil_U16) {
+  const std::vector<float> input_data = GetFloatDataInRange(-12.0f, 12.0f, 6);
+  RunQDQOpTest<uint16_t>("Ceil",
+                         {TestInputDef<float>({1, 2, 3}, false, input_data)},
+                         {},
+                         13,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Ceil op's domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
 // Test QDQ Floor op.
 TEST_F(QnnHTPBackendTests, UnaryOp_Floor) {
   const std::vector<float> input_data = GetFloatDataInRange(-12.0f, 12.0f, 6);
@@ -457,6 +611,26 @@ TEST_F(QnnHTPBackendTests, DepthToSpaceOp_CRD) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ DepthToSpace.
+TEST_F(QnnHTPBackendTests, DepthToSpaceOp_U16_CRD) {
+  const std::vector<float> X = {0., 1., 2.,
+                                3., 4., 5.,
+                                9., 10., 11.,
+                                12., 13., 14.,
+                                18., 19., 20.,
+                                21., 22., 23.,
+                                27., 28., 29.,
+                                30., 31., 32.};
+  RunQDQOpTest<uint16_t>("DepthToSpace",
+                         {TestInputDef<float>({1, 4, 2, 3}, false, X)},
+                         {utils::MakeAttribute("blocksize", static_cast<int64_t>(2)),
+                          utils::MakeAttribute("mode", "CRD")},
+                         11,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Op's domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
 // Test QDQ DepthToSpace.
 TEST_F(QnnHTPBackendTests, DepthToSpaceOp_DCR) {
   const std::vector<float> X = {0., 1., 2.,
@@ -489,6 +663,22 @@ TEST_F(QnnHTPBackendTests, SpaceToDepthOp) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ SpaceToDepth.
+TEST_F(QnnHTPBackendTests, SpaceToDepthOp_U16) {
+  const std::vector<float> X = {0.0f, 0.1f, 0.2f, 0.3f,
+                                1.0f, 1.1f, 1.2f, 1.3f,
+
+                                2.0f, 2.1f, 2.2f, 2.3f,
+                                3.0f, 3.1f, 3.2f, 3.3f};
+  RunQDQOpTest<uint16_t>("SpaceToDepth",
+                         {TestInputDef<float>({1, 2, 2, 4}, false, X)},
+                         {utils::MakeAttribute("blocksize", static_cast<int64_t>(2))},
+                         11,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,  // Op's domain
+                         true);        // Use com.microsoft domain for Q/DQ ops
+}
+
 // Run QDQ model on HTP twice
 // 1st run will generate the Qnn context cache binary file
 // 2nd run will load and run from Qnn context cache binary file
@@ -561,7 +751,7 @@ TEST_F(QnnHTPBackendTests, QuantAccuracyTest) {
                   ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ Add
+// Test 8-bit QDQ Add
 TEST_F(QnnHTPBackendTests, BinaryOp_Add4D) {
   RunQDQOpTest<uint8_t>("Add",
                         {TestInputDef<float>({1, 2, 2, 2}, false, -10.0f, 10.0f),
@@ -571,7 +761,20 @@ TEST_F(QnnHTPBackendTests, BinaryOp_Add4D) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ Sub
+// Test 16-bit QDQ Add
+TEST_F(QnnHTPBackendTests, BinaryOp_Add4D_U16) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 8);
+  RunQDQOpTest<uint16_t>("Add",
+                         {TestInputDef<float>({1, 2, 2, 2}, false, input_data),
+                          TestInputDef<float>({1, 2, 2, 2}, false, input_data)},
+                         {},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
+// Test 8-bit QDQ Sub
 TEST_F(QnnHTPBackendTests, BinaryOp_Sub4D) {
   RunQDQOpTest<uint8_t>("Sub",
                         {TestInputDef<float>({1, 3, 8, 8}, false, -10.0f, 10.0f),
@@ -581,6 +784,20 @@ TEST_F(QnnHTPBackendTests, BinaryOp_Sub4D) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ Sub
+TEST_F(QnnHTPBackendTests, BinaryOp_Sub4D_U16) {
+  std::vector<float> input0_data = GetFloatDataInRange(-10.0f, 10.0f, 8);
+  std::vector<float> input1_data = GetFloatDataInRange(0.0f, 20.0f, 8);
+  RunQDQOpTest<uint16_t>("Sub",
+                         {TestInputDef<float>({1, 2, 2, 2}, false, input0_data),
+                          TestInputDef<float>({1, 2, 2, 2}, false, input1_data)},
+                         {},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
 TEST_F(QnnHTPBackendTests, BinaryOp_Sub4D_LargeInputs) {
   RunQDQOpTest<uint8_t>("Sub",
                         {TestInputDef<float>({1, 3, 768, 1152}, false, -1.0f, 1.0f),
@@ -656,6 +873,20 @@ TEST_F(QnnHTPBackendTests, BinaryOp_Div4D_SmallInputs) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ Sub with small input values.
+TEST_F(QnnHTPBackendTests, BinaryOp_Div4D_U16_SmallInputs) {
+  std::vector<float> input0_data = {-10.0f, -8.0f, -1.0f, 0.0f, 1.0f, 2.1f, 8.0f, 10.0f};
+  std::vector<float> input1_data = {5.0f, 4.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 5.0f};
+  RunQDQOpTest<uint16_t>("Div",
+                         {TestInputDef<float>({1, 2, 2, 2}, false, input0_data),
+                          TestInputDef<float>({1, 2, 2, 2}, false, input1_data)},
+                         {},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
 // TODO: Enable when this is fixed.
 // QNN v2.13: Inaccuracy detected for output 'output', element 2551923.
 // Output quant params: scale=4100.92626953125, zero_point=126.
@@ -680,7 +911,7 @@ TEST_F(QnnHTPBackendTests, BinaryOp_Div4D_Broadcast) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ Mul
+// Test 8-bit QDQ Mul
 TEST_F(QnnHTPBackendTests, BinaryOp_Mul4D) {
   std::vector<float> input_data = GetFloatDataInRange(-10.0, 10.0f, 8);
   RunQDQOpTest<uint8_t>("Mul",
@@ -691,6 +922,19 @@ TEST_F(QnnHTPBackendTests, BinaryOp_Mul4D) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ Mul
+TEST_F(QnnHTPBackendTests, BinaryOp_Mul4D_U16) {
+  std::vector<float> input_data = GetFloatDataInRange(-10.0f, 10.0f, 8);
+  RunQDQOpTest<uint16_t>("Mul",
+                         {TestInputDef<float>({1, 2, 2, 2}, false, input_data),
+                          TestInputDef<float>({1, 2, 2, 2}, false, input_data)},
+                         {},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
 // Test And
 TEST_F(QnnHTPBackendTests, BinaryOp_And4D) {
   RunOpTest<bool>("And",
@@ -711,7 +955,7 @@ TEST_F(QnnHTPBackendTests, BinaryOp_HTP_Or_Unsupported) {
                   ExpectedEPNodeAssignment::None);
 }
 
-// Test QDQ GridSample with bilinear
+// Test 8-bit QDQ GridSample with bilinear
 TEST_F(QnnHTPBackendTests, GridSample_Bilinear) {
   RunQDQOpTest<uint8_t>("GridSample",
                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
@@ -723,7 +967,21 @@ TEST_F(QnnHTPBackendTests, GridSample_Bilinear) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ GridSample with align corners
+// Test 16-bit QDQ GridSample with bilinear
+TEST_F(QnnHTPBackendTests, GridSample_U16_Bilinear) {
+  RunQDQOpTest<uint16_t>("GridSample",
+                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
+                          TestInputDef<float>({1, 2, 4, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 16))},
+                         {utils::MakeAttribute("align_corners", static_cast<int64_t>(0)),
+                          utils::MakeAttribute("mode", "bilinear"),
+                          utils::MakeAttribute("padding_mode", "zeros")},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
+// Test 8-bit QDQ GridSample with align corners
 TEST_F(QnnHTPBackendTests, GridSample_AlignCorners) {
   RunQDQOpTest<uint8_t>("GridSample",
                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
@@ -735,6 +993,20 @@ TEST_F(QnnHTPBackendTests, GridSample_AlignCorners) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ GridSample with align corners
+TEST_F(QnnHTPBackendTests, GridSample_U16_AlignCorners) {
+  RunQDQOpTest<uint16_t>("GridSample",
+                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
+                          TestInputDef<float>({1, 2, 4, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 16))},
+                         {utils::MakeAttribute("align_corners", static_cast<int64_t>(1)),
+                          utils::MakeAttribute("mode", "bilinear"),
+                          utils::MakeAttribute("padding_mode", "zeros")},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);  // Use com.microsoft Q/DQ ops
+}
+
 // Test QDQ GridSample with padding mode: border
 // Inaccuracy detected for output 'output', element 0.
 // Output quant params: scale=0.046370312571525574, zero_point=129.
@@ -751,7 +1023,7 @@ TEST_F(QnnHTPBackendTests, DISABLED_GridSample_BorderPadding) {
                         ExpectedEPNodeAssignment::All);
 }
 
-// Test QDQ GridSample with nearest mode
+// Test 8-bit QDQ GridSample with nearest mode
 TEST_F(QnnHTPBackendTests, GridSample_Nearest) {
   RunQDQOpTest<uint8_t>("GridSample",
                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
@@ -761,6 +1033,18 @@ TEST_F(QnnHTPBackendTests, GridSample_Nearest) {
                         ExpectedEPNodeAssignment::All);
 }
 
+// Test 16-bit QDQ GridSample with nearest mode
+TEST_F(QnnHTPBackendTests, GridSample_U16_Nearest) {
+  RunQDQOpTest<uint16_t>("GridSample",
+                         {TestInputDef<float>({1, 1, 3, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 6)),
+                          TestInputDef<float>({1, 2, 4, 2}, false, GetFloatDataInRange(-10.0f, 10.0f, 16))},
+                         {utils::MakeAttribute("mode", "nearest")},
+                         17,
+                         ExpectedEPNodeAssignment::All,
+                         kOnnxDomain,
+                         true);
+}
+
 // Test QDQ GridSample with reflection padding mode
 // Inaccuracy detected for output 'output', element 2.
 // Output quant params: scale=0.024269860237836838, zero_point=0.
@@ -801,4 +1085,4 @@ TEST_F(QnnHTPBackendTests, VariadicOp_Concat_2Inputs_2ndAxis) {
 }  // namespace test
 }  // namespace onnxruntime
 
-#endif
\ No newline at end of file
+#endif
diff --git a/onnxruntime/test/python/quantization/test_qdq.py b/onnxruntime/test/python/quantization/test_qdq.py
index 3c5f516af4846..5c2db435d7fb5 100644
--- a/onnxruntime/test/python/quantization/test_qdq.py
+++ b/onnxruntime/test/python/quantization/test_qdq.py
@@ -566,28 +566,30 @@ def construct_model_conv_relu(self, output_model_path, input_shape, weight_shape
 
         onnx.save(model, output_model_path)
 
-    def verify(self, per_channel, is_quant_type_int8):
+    def verify_qdq(self, per_channel, activation_type, weight_type, extra_options=None):
         np.random.seed(1)
         model_fp32_path = str(Path(self._tmp_model_dir.name) / f"conv_relu_fp32.{per_channel}.onnx")
-        model_int8_qdq_path = str(Path(self._tmp_model_dir.name) / f"conv_relu_quant_qdq.{per_channel}.onnx")
-        model_int8_qop_path = str(Path(self._tmp_model_dir.name) / f"conv_relu_quant_qop.{per_channel}.onnx")
+        model_qdq_path = str(
+            Path(self._tmp_model_dir.name) / f"conv_relu_quant_qdq.{activation_type}.{weight_type}.{per_channel}.onnx"
+        )
         data_reader = self.input_feeds(1, {"input": [1, 8, 33, 33]})
         self.construct_model_conv_relu(model_fp32_path, [1, 8, 33, 33], [16, 8, 3, 3], [1, 16, 31, 31])
         quantize_static(
             model_fp32_path,
-            model_int8_qdq_path,
+            model_qdq_path,
             data_reader,
             quant_format=QuantFormat.QDQ,
             per_channel=per_channel,
             reduce_range=per_channel,
-            activation_type=QuantType.QInt8 if is_quant_type_int8 else QuantType.QUInt8,
-            weight_type=QuantType.QInt8 if is_quant_type_int8 else QuantType.QUInt8,
+            activation_type=activation_type,
+            weight_type=weight_type,
+            extra_options=extra_options,
         )
         data_reader.rewind()
         # topo sort check
         check_op_type_order(
             self,
-            model_int8_qdq_path,
+            model_qdq_path,
             [
                 "DequantizeLinear",
                 "QuantizeLinear",
@@ -597,9 +599,15 @@ def verify(self, per_channel, is_quant_type_int8):
                 "DequantizeLinear",
             ],
         )
-        check_model_correctness(self, model_fp32_path, model_int8_qdq_path, data_reader.get_next())
+        check_model_correctness(self, model_fp32_path, model_qdq_path, data_reader.get_next())
+
+    def verify_qop(self, per_channel, is_quant_type_int8):
+        np.random.seed(1)
+        model_fp32_path = str(Path(self._tmp_model_dir.name) / f"conv_relu_fp32.{per_channel}.onnx")
+        model_int8_qop_path = str(Path(self._tmp_model_dir.name) / f"conv_relu_quant_qop.{per_channel}.onnx")
+        data_reader = self.input_feeds(1, {"input": [1, 8, 33, 33]})
+        self.construct_model_conv_relu(model_fp32_path, [1, 8, 33, 33], [16, 8, 3, 3], [1, 16, 31, 31])
 
-        data_reader.rewind()
         quantize_static(
             model_fp32_path,
             model_int8_qop_path,
@@ -617,10 +625,25 @@ def verify(self, per_channel, is_quant_type_int8):
 
     def test_quantize_conv_without_bias(self):
         # only test cases per_channel=True and reduce_range=True to avoid saturation on avx2 and avx512 for weight type int8
-        self.verify(True, True)  # per_channel:False, is_quant_type_int8:True
+        self.verify_qdq(True, QuantType.QInt8, QuantType.QInt8)  # per_channel:True
+        self.verify_qop(True, True)  # per_channel:True, is_quant_type_int8:True
 
-        self.verify(False, False)  # per_channel:False, is_quant_type_int8:False
-        self.verify(True, False)  # per_channel:True, is_quant_type_int8:False
+        self.verify_qdq(False, QuantType.QUInt8, QuantType.QUInt8)  # per_channel:False
+        self.verify_qop(False, False)  # per_channel:False, is_quant_type_int8:False
+
+        self.verify_qdq(True, QuantType.QUInt8, QuantType.QUInt8)  # per_channel:True
+        self.verify_qop(True, False)  # per_channel:True, is_quant_type_int8:False
+
+        # 16-bit QDQ via contrib ops
+        self.verify_qdq(False, QuantType.QUInt16, QuantType.QUInt16, {"UseQDQContribOps": True})
+        self.verify_qdq(False, QuantType.QInt16, QuantType.QInt16, {"UseQDQContribOps": True})
+        self.verify_qdq(False, QuantType.QUInt16, QuantType.QUInt8, {"UseQDQContribOps": True})
+        self.verify_qdq(False, QuantType.QInt16, QuantType.QInt8, {"UseQDQContribOps": True})
+
+        self.verify_qdq(True, QuantType.QUInt16, QuantType.QUInt16, {"UseQDQContribOps": True})
+        self.verify_qdq(True, QuantType.QInt16, QuantType.QInt16, {"UseQDQContribOps": True})
+        self.verify_qdq(True, QuantType.QUInt16, QuantType.QUInt8, {"UseQDQContribOps": True})
+        self.verify_qdq(True, QuantType.QInt16, QuantType.QInt8, {"UseQDQContribOps": True})
 
     def test_quantize_relu_conv(self):
         float_model_path = str(Path(self._tmp_model_dir.name) / "float_relu_convs_model.onnx")
diff --git a/onnxruntime/test/testdata/transform/convert_qdq_ops_to_ms_domain.py b/onnxruntime/test/testdata/transform/convert_qdq_ops_to_ms_domain.py
index 3df127f5d356d..f74342403f4c3 100644
--- a/onnxruntime/test/testdata/transform/convert_qdq_ops_to_ms_domain.py
+++ b/onnxruntime/test/testdata/transform/convert_qdq_ops_to_ms_domain.py
@@ -1,59 +1,154 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
 """
 Loads a model and updates the domain of QuantizeLinear and DequantizeLinear nodes to 'com.microsoft'.
+Optionally updates zero-points to 16bit data types.
+
 This is used to create models for testing QDQ transformations with the contrib QDQ ops.
 
-Usage: python3 convert_qdq_ops_to_ms_domain.py <onnx model>
+Usage:
+python3 convert_qdq_ops_to_ms_domain.py --input_model <input onnx model> --output_model <output model> --use_16bit_qdq
 
 Models created with this script:
 - qdq_with_multi_consumer_dq_nodes.fixed.qdq_contrib.onnx
+- qdq_with_multi_consumer_dq_nodes.fixed.qdq16_contrib.onnx
 - fusion/constant_folding_dequantizelinear.qdq_contrib.onnx
+- fusion/constant_folding_dequantizelinear.qdq16_contrib.onnx
 - fusion/constant_folding_qdq_node_unit.qdq_contrib.onnx
+- fusion/constant_folding_qdq_node_unit.qdq16_contrib.onnx
 - fusion/constant_folding_qdq_node_unit.graph_output.qdq_contrib.onnx
+- fusion/constant_folding_qdq_node_unit.graph_output.qdq16_contrib.onnx
 """
+from __future__ import annotations
+
+import argparse
 import os
+import struct
 import sys
 
 import onnx
+from onnx import shape_inference
 
 QDQ_OPS = ("QuantizeLinear", "DequantizeLinear")
-
-
-def print_usage(prog_name: str):
-    """
-    Prints the program's command-line arguments and usage.
-    """
-
-    print(f"Usage: {prog_name} <onnx model>")
-
-
-def update_qdq_node_domains(graph):
-    """
-    Updates the domain of all QuantizeLinear and DequantizeLinear nodes
-    in a graph to 'com.microsoft'.
-    """
+QDQ_CONVERT_TYPES = {onnx.TensorProto.UINT8: onnx.TensorProto.UINT16, onnx.TensorProto.INT8: onnx.TensorProto.INT16}
+TYPE_TO_STRUCT_LABEL = {
+    onnx.TensorProto.UINT8: "B",
+    onnx.TensorProto.INT8: "b",
+    onnx.TensorProto.UINT16: "H",
+    onnx.TensorProto.INT16: "h",
+}
+
+
+def convert_initializer_to_16bits(initializer: onnx.TensorProto, target_type: onnx.TensorProto.DataType):
+    byte_order = ">" if sys.byteorder == "big" else "<"
+    byte_label = TYPE_TO_STRUCT_LABEL[initializer.data_type]
+    short_label = TYPE_TO_STRUCT_LABEL[target_type]
+
+    # Do not support external data
+    if initializer.HasField("data_location") and initializer.data_location == onnx.TensorProto.EXTERNAL:
+        raise Exception("Do not support initializers with external data")
+
+    # Need to convert raw_data bytes to 16-bit values.
+    # NOTE: For tensors that use .int32_data instead of .raw_data, we don't need any special handling
+    # other than updating the data type. This is because the upper 24 bits are already cleared to zero.
+    if initializer.HasField("raw_data"):
+        num_byte_vals = len(initializer.raw_data)
+
+        # Extract 8-bit values as int32s
+        int32_vals = struct.unpack(f"{byte_order}{num_byte_vals}{byte_label}", initializer.raw_data)
+
+        # Repack int32 values as 16-bit values
+        initializer.raw_data = struct.pack(f"{byte_order}{num_byte_vals}{short_label}", *int32_vals)
+
+    initializer.data_type = target_type
+
+
+def convert_qdq_op_to_16bit(
+    name_to_initializer: dict[str, onnx.TensorProto],
+    name_to_values: dict[str, onnx.ValueInfoProto],
+    name_to_inputs: dict[str, onnx.ValueInfoProto],
+    name_to_outputs: dict[str, onnx.ValueInfoProto],
+    node: onnx.NodeProto,
+):
+    zp_input = node.input[2] if len(node.input) > 2 else None
+
+    if zp_input in name_to_initializer:
+        zp_initializer = name_to_initializer[zp_input]
+
+        zp_target_type = QDQ_CONVERT_TYPES.get(zp_initializer.data_type)
+        if zp_target_type:
+            convert_initializer_to_16bits(zp_initializer, zp_target_type)
+
+        if node.op_type == "DequantizeLinear":
+            input0 = node.input[0]
+
+            if input0 in name_to_initializer:
+                input_initializer = name_to_initializer[input0]
+                input_target_type = QDQ_CONVERT_TYPES.get(input_initializer.data_type)
+                if input_target_type:
+                    convert_initializer_to_16bits(input_initializer, input_target_type)
+            elif input0 in name_to_values:
+                input_val = name_to_values[input0]
+                input_target_type = QDQ_CONVERT_TYPES.get(input_val.type.tensor_type.elem_type)
+                if input_target_type:
+                    input_val.type.tensor_type.elem_type = input_target_type
+            elif input0 in name_to_inputs:
+                input_val = name_to_inputs[input0]
+                input_target_type = QDQ_CONVERT_TYPES.get(input_val.type.tensor_type.elem_type)
+                if input_target_type:
+                    input_val.type.tensor_type.elem_type = input_target_type
+        else:
+            # QuantizeLinear
+            output0 = node.output[0]
+
+            if output0 in name_to_values:
+                output_val = name_to_values[output0]
+                output_target_type = QDQ_CONVERT_TYPES.get(output_val.type.tensor_type.elem_type)
+                if output_target_type:
+                    output_val.type.tensor_type.elem_type = output_target_type
+            elif output0 in name_to_outputs:
+                output_val = name_to_outputs[output0]
+                output_target_type = QDQ_CONVERT_TYPES.get(output_val.type.tensor_type.elem_type)
+                if output_target_type:
+                    output_val.type.tensor_type.elem_type = output_target_type
+    else:
+        raise Exception("Only support Q/DQ ops with explicit zero-point inputs")
+
+
+def update_qdq_node_domains(graph: onnx.GraphProto, use_16bit_qdq: bool):
+    name_to_initializer = {initializer.name: initializer for initializer in graph.initializer}
+    name_to_values = {value.name: value for value in graph.value_info}
+    name_to_inputs = {g_input.name: g_input for g_input in graph.input}
+    name_to_outputs = {g_output.name: g_output for g_output in graph.output}
 
     for node in graph.node:
         # Handle subgraphs:
         for attr in node.attribute:
             if attr.type == onnx.AttributeProto.GRAPH:
-                update_qdq_node_domains(attr.g)
+                update_qdq_node_domains(attr.g, use_16bit_qdq)
             elif attr.type == onnx.AttributeProto.GRAPHS:
                 for subgraph in attr.graphs:
-                    update_qdq_node_domains(subgraph)
+                    update_qdq_node_domains(subgraph, use_16bit_qdq)
 
         # Update Q/DQ domains
         if node.op_type in QDQ_OPS:
             node.domain = "com.microsoft"
 
+            if use_16bit_qdq:
+                convert_qdq_op_to_16bit(name_to_initializer, name_to_values, name_to_inputs, name_to_outputs, node)
+
 
 def main():
-    prog_name, *argv = sys.argv
+    parser = argparse.ArgumentParser(description="Convert Q/DQ ops to com.microsoft domain (or 16-bit)")
+    parser.add_argument("--input_model", type=str, required=True, help="Input onnx model path")
+    parser.add_argument("--output_model", type=str, required=False, help="Output onnx model path")
+    parser.add_argument("--use_16bit_qdq", required=False, action="store_true", help="Convert to 16-bit QDQ")
 
-    if len(argv) != 1:
-        print_usage(prog_name)
-        sys.exit(1)
+    args = parser.parse_args()
 
-    model = onnx.load(argv[0])
+    model = onnx.load(args.input_model)
 
     has_ms_domain = False
     for opset in model.opset_import:
@@ -64,10 +159,18 @@ def main():
     if not has_ms_domain:
         model.opset_import.extend([onnx.helper.make_opsetid("com.microsoft", 1)])
 
-    update_qdq_node_domains(model.graph)
+    update_qdq_node_domains(model.graph, args.use_16bit_qdq)
+    model = shape_inference.infer_shapes(model)
     onnx.checker.check_model(model, True)
-    base_model_name = os.path.splitext(argv[0])[0]
-    onnx.save_model(model, base_model_name + ".qdq_contrib.onnx")
+
+    output_model_path = args.output_model
+    if not output_model_path:
+        base_model_name = os.path.splitext(args.input_model)[0]
+        suffix = ".qdq16_contrib" if args.use_16bit_qdq else ".qdq_contrib"
+        output_model_path = base_model_name + suffix + ".onnx"
+
+    onnx.save_model(model, output_model_path)
+    print(f"[INFO] Saved model: {output_model_path}")
 
 
 if __name__ == "__main__":
diff --git a/onnxruntime/test/testdata/transform/fusion/constant_folding_dequantizelinear.qdq16_contrib.onnx b/onnxruntime/test/testdata/transform/fusion/constant_folding_dequantizelinear.qdq16_contrib.onnx
new file mode 100644
index 0000000000000..8fc884024b00f
Binary files /dev/null and b/onnxruntime/test/testdata/transform/fusion/constant_folding_dequantizelinear.qdq16_contrib.onnx differ
diff --git a/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.graph_output.qdq16_contrib.onnx b/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.graph_output.qdq16_contrib.onnx
new file mode 100644
index 0000000000000..b9cae7f59f8e8
Binary files /dev/null and b/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.graph_output.qdq16_contrib.onnx differ
diff --git a/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.qdq16_contrib.onnx b/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.qdq16_contrib.onnx
new file mode 100644
index 0000000000000..8e12e10e90531
Binary files /dev/null and b/onnxruntime/test/testdata/transform/fusion/constant_folding_qdq_node_unit.qdq16_contrib.onnx differ
diff --git a/onnxruntime/test/testdata/transform/qdq_with_multi_consumer_dq_nodes.fixed.qdq16_contrib.onnx b/onnxruntime/test/testdata/transform/qdq_with_multi_consumer_dq_nodes.fixed.qdq16_contrib.onnx
new file mode 100644
index 0000000000000..f71114cf31bf9
Binary files /dev/null and b/onnxruntime/test/testdata/transform/qdq_with_multi_consumer_dq_nodes.fixed.qdq16_contrib.onnx differ