-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this pass flow possible for Stable Diffusion?: OrtTransformersOptimization → IncDynamicQuantization or IncStaticQuantization #852
Comments
Int8 quantization is normally used on an fp32 model and not fp16 model. If you look at our other examples, that's the only workflow we try https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2.py#L19 https://github.com/microsoft/Olive/blob/main/examples/whisper/prepare_whisper_configs.py#L33 I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it? With regard to the inc pass, @yuwenzho might have better insight. |
You can turn on debug logging for both inc and olive by setting |
inc debug logging could be enabled by setting env variable LOGLEVEL=DEBUG. |
@lshqqytiger, you can try to use Olive from main branch, which includes couple of fixes for Inc logging support. with regarding to fp16 model support on Inc, if the fp16 model could be loaded using CPU ep, I suppose the Inc quantization should support running in CPU. If not, current inc might have some issues. @yuwenzho should have more comments. |
@lshqqytiger, would you please try to set https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-backend to "onnxrt_dml_ep" and try whether the inc quantization could be done by dml ep? |
@lshqqytiger DmlExecutionProvider is supported in INC for FP32 input model now. |
Thank you for all your help.
|
would you please help provide the arguments for sess.initialize_session including providers/provider_options? It seems fail when creating InferenceSession. |
It seems similar to microsoft/onnxruntime#18885 |
@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed. |
@PatriceVignola, did you have any clue on the exception happens in DML ep? |
Because I don't know about
out:
|
Got it. Thanks. |
thanks for the info. what's your model size? is it possible to share so that we can take a look? |
|
Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization. |
Okay. I removed |
Now I'm getting this while UNet quantization.
UNet passes: "optimize": {
"type": "OrtTransformersOptimization",
"disable_search": true,
"config": {
"model_type": "unet",
"opt_level": 0,
"float16": false,
"use_gpu": true,
"keep_io_types": false,
"optimization_options": {
"enable_gelu": true,
"enable_layer_norm": true,
"enable_attention": true,
"use_multi_head_attention": true,
"enable_skip_layer_norm": false,
"enable_embed_layer_norm": true,
"enable_bias_skip_layer_norm": false,
"enable_bias_gelu": true,
"enable_gelu_approximation": false,
"enable_qordered_matmul": false,
"enable_shape_inference": true,
"enable_gemm_fast_gelu": false,
"enable_nhwc_conv": false,
"enable_group_norm": true,
"enable_bias_splitgelu": false,
"enable_packed_qkv": true,
"enable_packed_kv": true,
"enable_bias_add": false,
"group_norm_channels_last": false
},
"force_fp32_ops": ["RandomNormalLike"]
}
},
"inc_quantize": {
"type": "IncDynamicQuantization",
"disable_search": true,
"config": {
"save_as_external_data": false,
"all_tensors_to_one_file": true
}
} Should I disable group norm optimization? |
Yes |
With
|
It seems that something went wrong before reaching the INC quantization pass. Could you please provide some help? @guotuofeng |
I think I found another error which is the cause of the previous one. There's no weights.pb, only model.onnx which is 840,906 KB. Why is it looking for weights.pb? |
It seems all passes run finish and the exception happens when evaluating the result model. The failure also happens when creating InferenceSession. Could you double check the output model of IncDynamicQuantization for the weight? |
I think the output model has problems.
After renamed
|
|
the output model for IncDynamicQuantization is under cache\models\3_IncDynamicQuantization-hashvalue\output_model, The model file under nc_workspace is used by IncDynamicQuantization internally. |
@lshqqytiger The error of no weights.pb seems to be a bug in INC quantization pass. I will let you know once I fix it. |
Okay. Thanks. I disabled GELU optimization and now it works. But I don't want to disable GroupNorm so I will try again with static quantization and "onnxrt_dml_ep" backend. |
@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience. |
Now I'm getting this exception with IncStaticQuantization and "onnxrt_dml_ep" backend.
Is Memcpy not implemented for DmlExecutionProvider? @PatriceVignola |
@lshqqytiger, did you try the fix? If it fix your error, I will merge it once the CI pass. |
I did and now the same error doesn't seem to happen anymore. |
## Describe your changes Fix weight loading of large model in inc quantzation pass. Reload weight for model > 2GB to prevent missing weight files. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link #852 --------- Signed-off-by: yuwenzho <[email protected]>
@yuwenzho, the PR is merged. |
@lshqqytiger, so from my understanding, your current issue is the MemcpyFromHost not implemented one? |
Yes. But that's not all. OrtTransformersOptimization -> IncDynamicQuantization
OrtTransformersOptimization -> IncStaticQuantization
|
as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization? |
I tried both and wrote the results on my previous comment. Dynamic + cpu backend + no GELU & GroupNorm optimization is the only one that has no problems. |
@lshqqytiger 'Memcpy NotImplemented' error seems to be a bug in INC, I am checking it. Any update will let you know. |
Hi @lshqqytiger , I fixed it in intel/neural-compressor#1526. Please use the fixing branch to try again. |
Thank you for fix! But I'm getting |
Does anyone know why it is saying GroupNorm is not implemented although I disabled |
I checked the source code for onnxruntime. GroupNorm is a contrib op which is only implemented for cuda, rocm and dml ep. So if you are running on cpu then this fusion and operator is not supported. |
Thank you! Then, why can I load UNet models on CPUExecutionProvider without any NotImplemented error? Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import diffusers
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="CPUExecutionProvider")
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x000001EACC0D4D90>
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="CPUExecutionProvider")
<diffusers.pipelines.onnx_utils.OnnxRuntimeModel object at 0x000001EACC0A3FD0> |
Is this the transformers optimized unet model with groupnorm fusion enabled? Do you only get groupnorm non-implemented error with the vae models? |
It is not an optimized one, just onnx-converted model. Okay..it may not have GroupNorm op. That makes sense. |
Does anyone know why I get |
Describe the bug and context
I'm trying to quantize an optimized Stable Diffusion model.
I got to know that
IncDynamicQuantization
has less reduction in inference speed thanOnnxDynamicQuantization
.But I'm getting
IndexError
during UNet quantization pass.The error belongs to
neural-compressor
, but except for the optimization pass, it works normally, so I think this would be a compatibility issue withOrtTransformersOptimization
.To Reproduce
neural-compressor
from source. Enhance the ORT node name checking intel/neural-compressor#1512*neural-compressor from pip will work with text encoder, unet, and vae encoder, but vae decoder throws an error.
Expected behavior
UNet should be quantized.
Olive config
provider:
DmlExecutionProvider
pass flow:
["optimize", "inc_quantize"]
text encoder passes:
unet passes:
I disabled group norm because I got
NotImplemented
error with fp16 and I gotonnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
without any error description/message with fp32.NotImplemented
error is from neural-compressor because it tries to create InferenceSession with CPUExecutionProvider. (fp16 group norm is not implemented for cpu)vae decoder passes:
vae encoder passes:
Olive logs
log.txt
Other information
The text was updated successfully, but these errors were encountered: