Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this pass flow possible for Stable Diffusion?: OrtTransformersOptimization → IncDynamicQuantization or IncStaticQuantization #852

Open
lshqqytiger opened this issue Jan 3, 2024 · 46 comments

Comments

@lshqqytiger
Copy link

lshqqytiger commented Jan 3, 2024

Describe the bug and context
I'm trying to quantize an optimized Stable Diffusion model.
I got to know that IncDynamicQuantization has less reduction in inference speed than OnnxDynamicQuantization.
But I'm getting IndexError during UNet quantization pass.
The error belongs to neural-compressor, but except for the optimization pass, it works normally, so I think this would be a compatibility issue with OrtTransformersOptimization.

To Reproduce

  1. Build and install neural-compressor from source. Enhance the ORT node name checking intel/neural-compressor#1512
  2. Set passes and run olive.

*neural-compressor from pip will work with text encoder, unet, and vae encoder, but vae decoder throws an error.

Expected behavior
UNet should be quantized.

Olive config
provider: DmlExecutionProvider
pass flow: ["optimize", "inc_quantize"]

text encoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "clip",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

unet passes:
I disabled group norm because I got NotImplemented error with fp16 and I got onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException without any error description/message with fp32.
NotImplemented error is from neural-compressor because it tries to create InferenceSession with CPUExecutionProvider. (fp16 group norm is not implemented for cpu)

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": false,
      "enable_skip_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

vae decoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "vae",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true,
    "recipes": {
      "first_conv_or_matmul_quantization": false,
      "last_conv_or_matmul_quantization": false
    }
  }
}

vae encoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "vae",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Olive logs
log.txt

Other information

  • OS: Windows 11 26016
  • Olive version: 0.4.0
  • ONNXRuntime package and version: onnxruntime==1.16.3 onnxruntime-directml==1.16.3
  • neural-compressor: ly/fix_ort
@jambayk
Copy link
Contributor

jambayk commented Jan 3, 2024

Int8 quantization is normally used on an fp32 model and not fp16 model. If you look at our other examples, that's the only workflow we try https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2.py#L19 https://github.com/microsoft/Olive/blob/main/examples/whisper/prepare_whisper_configs.py#L33

I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?

With regard to the inc pass, @yuwenzho might have better insight.

@jambayk
Copy link
Contributor

jambayk commented Jan 3, 2024

You can turn on debug logging for both inc and olive by setting log_severity_level=0 under the engine section in the config json

@guotuofeng
Copy link
Collaborator

inc debug logging could be enabled by setting env variable LOGLEVEL=DEBUG.

@guotuofeng
Copy link
Collaborator

I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?

With regard to the inc pass, @yuwenzho might have better insight.

@lshqqytiger, you can try to use Olive from main branch, which includes couple of fixes for Inc logging support.

with regarding to fp16 model support on Inc, if the fp16 model could be loaded using CPU ep, I suppose the Inc quantization should support running in CPU. If not, current inc might have some issues. @yuwenzho should have more comments.

@guotuofeng
Copy link
Collaborator

@lshqqytiger, would you please try to set https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-backend to "onnxrt_dml_ep" and try whether the inc quantization could be done by dml ep?

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

@lshqqytiger DmlExecutionProvider is supported in INC for FP32 input model now.
Please follow above comments to 1. turn fp16 off in the transformers optimization and 2. set backend to "onnxrt_dml_ep" in IncDynamicQuantization config.
With this setup, the error of ‘fp16 group norm is not implemented for cpu’ you mentioned in unet passes should also be avoidable, since INC will create InferenceSession with DmlExecutionProvider.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 4, 2024

Thank you for all your help.
With main branch of Olive and "onnxrt_dml_ep" backend, "float16" false, I got these while quantizing text encoder.
It is telling me "onnxrt_dml_ep" backend requires NPU.

[2024-01-04 14:19:10,018] [WARNING] [onnxruntime]-[inc_quantization.py:430:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-01-04 14:19:13 [INFO] Start auto tuning.
2024-01-04 14:19:13 [INFO] Quantize model without tuning!
2024-01-04 14:19:13 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-01-04 14:19:13 [INFO] Adaptor has 5 recipes.
2024-01-04 14:19:13 [INFO] 0 recipes specified by user.
2024-01-04 14:19:13 [INFO] 3 recipes require future tuning.
2024-01-04 14:19:13 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-01-04 14:19:13 [INFO] *** Initialize auto tuning
Exception in thread Thread-40:
2024-01-04 14:19:13 [INFO] {
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1016, in _bootstrap_inner
2024-01-04 14:19:13 [INFO]     'PostTrainingQuantConfig': {
2024-01-04 14:19:13 [INFO]         'AccuracyCriterion': {
2024-01-04 14:19:13 [INFO]             'criterion': 'relative',
2024-01-04 14:19:13 [INFO]             'higher_is_better': True,
2024-01-04 14:19:13 [INFO]             'tolerable_loss': 0.01,
2024-01-04 14:19:13 [INFO]             'absolute': None,
2024-01-04 14:19:13 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000002D325135D20>>,
2024-01-04 14:19:13 [INFO]             'relative': 0.01
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'approach': 'post_training_dynamic_quant',
2024-01-04 14:19:13 [INFO]         'backend': 'onnxrt_dml_ep',
2024-01-04 14:19:13 [INFO]         'calibration_sampling_size': [
2024-01-04 14:19:13 [INFO]             100
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'device': 'cpu',
2024-01-04 14:19:13 [INFO]         'diagnosis': False,
2024-01-04 14:19:13 [INFO]         'domain': 'auto',
2024-01-04 14:19:13 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-01-04 14:19:13 [INFO]         'excluded_precisions': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'framework': 'onnxruntime',
2024-01-04 14:19:13 [INFO]         'inputs': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'model_name': '',
2024-01-04 14:19:13 [INFO]         'ni_workload_name': 'quantization',
2024-01-04 14:19:13 [INFO]         'op_name_dict': None,
2024-01-04 14:19:13 [INFO]         'op_type_dict': None,
2024-01-04 14:19:13 [INFO]         'outputs': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'quant_format': 'default',
2024-01-04 14:19:13 [INFO]         'quant_level': 'auto',
2024-01-04 14:19:13 [INFO]         'recipes': {
2024-01-04 14:19:13 [INFO]             'smooth_quant': False,
2024-01-04 14:19:13 [INFO]             'smooth_quant_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'layer_wise_quant': False,
2024-01-04 14:19:13 [INFO]             'layer_wise_quant_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'fast_bias_correction': False,
2024-01-04 14:19:13 [INFO]             'weight_correction': False,
2024-01-04 14:19:13 [INFO]             'gemm_to_matmul': True,
2024-01-04 14:19:13 [INFO]             'graph_optimization_level': None,
2024-01-04 14:19:13 [INFO]             'first_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO]             'last_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO]             'pre_post_process_quantization': True,
2024-01-04 14:19:13 [INFO]             'add_qdq_pair_to_weight': False,
2024-01-04 14:19:13 [INFO]             'optypes_to_exclude_output_quant': [
2024-01-04 14:19:13 [INFO]             ],
2024-01-04 14:19:13 [INFO]             'dedicated_qdq_pair': False,
2024-01-04 14:19:13 [INFO]             'rtn_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'awq_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'gptq_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'teq_args': {
2024-01-04 14:19:13 [INFO]             }
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'reduce_range': False,
2024-01-04 14:19:13 [INFO]         'TuningCriterion': {
2024-01-04 14:19:13 [INFO]             'max_trials': 100,
2024-01-04 14:19:13 [INFO]             'objective': [
2024-01-04 14:19:13 [INFO]                 'performance'
2024-01-04 14:19:13 [INFO]             ],
2024-01-04 14:19:13 [INFO]             'strategy': 'basic',
2024-01-04 14:19:13 [INFO]             'strategy_kwargs': None,
2024-01-04 14:19:13 [INFO]             'timeout': 0
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'use_bf16': True
2024-01-04 14:19:13 [INFO]     }
2024-01-04 14:19:13 [INFO] }
    self.run()
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1376, in run
    self.finished.wait(self.interval)
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
OverflowError: timeout value is too large
2024-01-04 14:19:14 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-01-04 14:19:14 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-01-04 14:19:14 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-01-04 14:19:16 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-01-04 14:19:16 [INFO] Quantize the model with default config.
2024-01-04 14:19:17 [INFO] |******Mixed Precision Statistics******|
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] |     Op Type     |  Total   |   FP32  |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] |       Add       |   112    |   112   |
2024-01-04 14:19:17 [INFO] |     Sigmoid     |    12    |    12   |
2024-01-04 14:19:17 [INFO] |       Mul       |    49    |    49   |
2024-01-04 14:19:17 [INFO] |     Softmax     |    12    |    12   |
2024-01-04 14:19:17 [INFO] |      MatMul     |    96    |    96   |
2024-01-04 14:19:17 [INFO] |      Concat     |    76    |    76   |
2024-01-04 14:19:17 [INFO] |    Transpose    |    60    |    60   |
2024-01-04 14:19:17 [INFO] |     Squeeze     |    1     |    1    |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] Pass quantize model elapsed time: 920.16 ms
2024-01-04 14:19:17 [INFO] Save tuning history to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\./history.snapshot.
2024-01-04 14:19:17 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-01-04 14:19:17 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-01-04 14:19:17 [INFO] Save deploy yaml to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\deploy.yaml
[2024-01-04 14:19:18,617] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
    return self.run_search(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
    session = model.prepare_session(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
    return ort.InferenceSession(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-01-04 14:19:19,958] [INFO] [onnxruntime]-[engine.py:296:run] No packaging config provided, skip packaging artifacts

@guotuofeng
Copy link
Collaborator

would you please help provide the arguments for sess.initialize_session including providers/provider_options?

It seems fail when creating InferenceSession.

@guotuofeng
Copy link
Collaborator

It seems similar to microsoft/onnxruntime#18885

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

@guotuofeng
Copy link
Collaborator

@PatriceVignola, did you have any clue on the exception happens in DML ep?

@lshqqytiger
Copy link
Author

Because I don't know about initialize_session, I got arguments by adding prints.

print(providers)
print(provider_options)
print(disabled_optimizers)
sess.initialize_session(providers, provider_options, disabled_optimizers)

out:

['DmlExecutionProvider']
[{}]
set()

@lshqqytiger
Copy link
Author

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

@guotuofeng
Copy link
Collaborator

Because I don't know about initialize_session, I got arguments by adding prints.

print(providers)
print(provider_options)
print(disabled_optimizers)
sess.initialize_session(providers, provider_options, disabled_optimizers)

out:

['DmlExecutionProvider']
[{}]
set()

thanks for the info. what's your model size? is it possible to share so that we can take a look?

@lshqqytiger
Copy link
Author

invalid unordered_map<K, T> key still exists although I removed optimization pass. I think it has nothing to do with optimization pass.
The model is runwayml/stable-diffusion-v1-5, but it is ONNX converted using OnnxConversion pass.

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.

@lshqqytiger
Copy link
Author

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.

Okay. I removed "backend": "onnxrt_dml_ep" and now text encoder has no problems.

@lshqqytiger
Copy link
Author

Now I'm getting this while UNet quantization.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse
    self._prepare_tuning()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability
    self._pre_optimize(model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize
    sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'

UNet passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": false,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"]
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Should I disable group norm optimization?

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

Now I'm getting this while UNet quantization.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse
    self._prepare_tuning()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability
    self._pre_optimize(model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize
    sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'

UNet passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": false,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"]
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Should I disable group norm optimization?

Yes

@lshqqytiger
Copy link
Author

With "enable_group_norm": false,, another error with UNet quantization.

[2024-01-04 15:19:25,865] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml:
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
    return self.run_search(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
    session = model.prepare_session(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
    return ort.InferenceSession(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

It seems that something went wrong before reaching the INC quantization pass. Could you please provide some help? @guotuofeng

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 4, 2024

2024-01-04 15:33:34.4898900 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for cache\models\3_IncDynamicQuantization-2-c0850d79b40412102eb7e18807d5a62b-gpu-dml\output_model\weights.pb failed:open file weights.pb fail, errcode = 2 - ?

I think I found another error which is the cause of the previous one. There's no weights.pb, only model.onnx which is 840,906 KB. Why is it looking for weights.pb?

@guotuofeng
Copy link
Collaborator

guotuofeng commented Jan 4, 2024

It seems all passes run finish and the exception happens when evaluating the result model. The failure also happens when creating InferenceSession.

Could you double check the output model of IncDynamicQuantization for the weight?
Or you can clean your cache and rerun it to verify

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 4, 2024

I think the output model has problems.

>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
2024-01-04 15:47:03.1123015 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for .\weights.pb failed:open file weights.pb fail, errcode = 2 - ?Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
    return cls._from_pretrained(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
    model = OnnxRuntimeModel.load_model(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
n _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException

After renamed model.onnx to weights.pb and copied nc_workspace\[date]\Optimized_model.onnx to model.onnx:

>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
    return cls._from_pretrained(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
    model = OnnxRuntimeModel.load_model(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException doesn't seem to have something to do with open file weights.pb fail.

@PatriceVignola
Copy link
Contributor

invalid unordered_map<K, T> key is a generic error that means that the DML graph doesn't support some nodes in the model (we should certainly output a better error message here and fail early instead). I'll try root causing it locally.

@guotuofeng
Copy link
Collaborator

guotuofeng commented Jan 4, 2024

the output model for IncDynamicQuantization is under cache\models\3_IncDynamicQuantization-hashvalue\output_model, The model file under nc_workspace is used by IncDynamicQuantization internally.

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

@lshqqytiger The error of no weights.pb seems to be a bug in INC quantization pass. I will let you know once I fix it.

@lshqqytiger
Copy link
Author

Okay. Thanks. I disabled GELU optimization and now it works. But I don't want to disable GroupNorm so I will try again with static quantization and "onnxrt_dml_ep" backend.

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 4, 2024

@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 4, 2024

Now I'm getting this exception with IncStaticQuantization and "onnxrt_dml_ep" backend.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 505, in traverse
    q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
    res = func(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 401, in quantize
    quantize_params, _ = self._get_quantize_params(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 772, in _get_quantize_params
    self.min_max = augment.dump_minmax(quantize_config)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 477, in dump_minmax
    node_output_names, output_dicts = self.get_intermediate_outputs(q_config)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 252, in get_intermediate_outputs
    onnxruntime.InferenceSession(self.augmented_model.SerializeToString(), so, providers=[backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyFromHost(1) node with name 'Memcpy'

Is Memcpy not implemented for DmlExecutionProvider? @PatriceVignola

@lshqqytiger lshqqytiger changed the title Is this pass flow possible for Stable Diffusion?: OrtTransformersOptimization → IncDynamicQuantization Is this pass flow possible for Stable Diffusion?: OrtTransformersOptimization → IncDynamicQuantization or IncStaticQuantization Jan 4, 2024
@guotuofeng
Copy link
Collaborator

@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.

@lshqqytiger, did you try the fix? If it fix your error, I will merge it once the CI pass.

@lshqqytiger
Copy link
Author

I did and now the same error doesn't seem to happen anymore.

guotuofeng pushed a commit that referenced this issue Jan 4, 2024
## Describe your changes

Fix weight loading of large model in inc quantzation pass. Reload weight
for model > 2GB to prevent missing weight files.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.

## (Optional) Issue link

#852

---------

Signed-off-by: yuwenzho <[email protected]>
@guotuofeng
Copy link
Collaborator

@yuwenzho, the PR is merged.

@guotuofeng
Copy link
Collaborator

I did and now the same error doesn't seem to happen anymore.

@lshqqytiger, so from my understanding, your current issue is the MemcpyFromHost not implemented one?

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 5, 2024

Yes. But that's not all.

OrtTransformersOptimization -> IncDynamicQuantization

  1. default backend (cpu) with example config: GroupNorm NotImplemented. (AFAIK GroupNorm is implemented for fp32, but this error appears although I disabled float16)
  2. default backend without GroupNorm(unet, vae) optimization: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException during quantization pass.
  3. default backend without GroupNorm(unet, vae) and GELU(unet) optimization: works fine.
  4. onnxrt_dml_ep backend (Dml ep) without GroupNorm(unet, vae) and GELU(unet) optimization: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key. According to yuwenzho's comment, only static quantization is supported for onnxrt_dml_ep. @yuwenzho Is there any plan to support Dml ep on dynamic quant?

OrtTransformersOptimization -> IncStaticQuantization

  1. onnxrt_dml_ep backend: Memcpy NotImplemented while quantizing text encoder.

@guotuofeng
Copy link
Collaborator

as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?

@lshqqytiger
Copy link
Author

as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?

I tried both and wrote the results on my previous comment. Dynamic + cpu backend + no GELU & GroupNorm optimization is the only one that has no problems.

@yuwenzho
Copy link
Contributor

yuwenzho commented Jan 5, 2024

@lshqqytiger 'Memcpy NotImplemented' error seems to be a bug in INC, I am checking it. Any update will let you know.

@yuwenzho
Copy link
Contributor

Hi @lshqqytiger , I fixed it in intel/neural-compressor#1526. Please use the fixing branch to try again.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 11, 2024

Thank you for fix! But I'm getting onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key now. This is exactly same error when I ran IncDynamicQuantization with DmlExecutionProvider. DmlExecutionProvider seems to be incompatible with IncDynamicQuantization and IncStaticQuantization.

@lshqqytiger
Copy link
Author

Does anyone know why it is saying GroupNorm is not implemented although I disabled float16 when I run IncDynamicQuantization pass after OrtTransformersOptimization pass?

@jambayk
Copy link
Contributor

jambayk commented Jan 11, 2024

I checked the source code for onnxruntime. GroupNorm is a contrib op which is only implemented for cuda, rocm and dml ep. So if you are running on cpu then this fusion and operator is not supported.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 11, 2024

Thank you! Then, why can I load UNet models on CPUExecutionProvider without any NotImplemented error?

Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import diffusers
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="CPUExecutionProvider")
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x000001EACC0D4D90>
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="CPUExecutionProvider")
<diffusers.pipelines.onnx_utils.OnnxRuntimeModel object at 0x000001EACC0A3FD0>

@jambayk
Copy link
Contributor

jambayk commented Jan 11, 2024

Is this the transformers optimized unet model with groupnorm fusion enabled? Do you only get groupnorm non-implemented error with the vae models?

@lshqqytiger
Copy link
Author

It is not an optimized one, just onnx-converted model. Okay..it may not have GroupNorm op. That makes sense.
Then I have to wait until GroupNorm is implemented for CPUExecutionProvider on onnxruntime or until neural-compressor supports onnxrt_dml_ep for IncDynamicQuantization pass.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jan 11, 2024

Does anyone know why I get onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException without any error message or description during IncDynamicQuantization pass when I disabled GroupNorm and enabled GELU, which is also enabled on example, for UNet optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants