You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For context, I am using AMD RDNA3.5 architecture, Strix Point Processor.
Under Olive\examples\stable_diffusion I ran the following command. python stable_diffusion.py --model_id stabilityai/stable-diffusion-2-1 --optimize --clean_cache
I encountered error while I was optimizing unet.
Same error was found when I ran python stable_diffusion_xl.py --model_id stabilityai/sdxl-turbo --optimize --clean_cache under Olive\examples\directml\stable_diffusion_xl
The error is: onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
I have scrolled through the issue list. Multiple approaches have been taken. For example:
I found the exact same problem faced by this issue: #517
Other information
OS: Windows
Olive version: 0.6.0 (git clone main branch on 21 May Singapore Time)
ONNXRuntime package and version: onnxruntime-directml 1.18.0
For full error log:
Optimizing unet
[2024-05-22 15:44:40,729] [INFO] [run.py:279:run] Loading Olive module configuration from: C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\olive_config.json
[2024-05-22 15:44:40,734] [DEBUG] [olive_evaluator.py:1153:validate_metrics] No priority is specified, but only one sub type metric is specified. Use rank 1 for single for this metric.
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OnnxConversion
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OrtTransformersOptimization
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:130:_fill_accelerators] The accelerator device and execution providers are specified, skipping deduce.
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:169:_check_execution_providers] Supported execution providers for device gpu: ['DmlExecutionProvider', 'CPUExecutionProvider']
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:199:create_accelerators] Initial accelerators and execution providers: {'gpu': ['DmlExecutionProvider']}
[2024-05-22 15:44:40,734] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OnnxConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OpenVINOConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [INFO] [engine.py:107:initialize] Using cache directory: cache
[2024-05-22 15:44:40,734] [INFO] [engine.py:263:run] Running Olive on accelerator: gpu-dml
[2024-05-22 15:44:40,734] [INFO] [engine.py:1075:_create_system] Creating target system ...
[2024-05-22 15:44:40,734] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1078:_create_system] Target system created in 0.007994 seconds
[2024-05-22 15:44:40,742] [INFO] [engine.py:1087:_create_system] Creating host system ...
[2024-05-22 15:44:40,742] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1090:_create_system] Host system created in 0.000000 seconds
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:709:_cache_model] Cached model 9c464b7b to cache\models\9c464b7b.json
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:336:run_accelerator] Running Olive in no-search mode ...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:428:run_no_search] Running ['convert', 'optimize'] with no search ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass convert:OnnxConversion
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 6_OnnxConversion-9c464b7b-89c11e05 from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 12_OrtTransformersOptimization-6-b768c232-gpu-dml from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:843:_run_passes] Run model evaluation for the final model...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:1016:_evaluate_model] Evaluating model ...
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,779] [DEBUG] [olive_evaluator.py:238:generate_metric_user_config_with_model_io] Model input shapes are not static. Cannot use inferred input shapes for creating dummy data. This will cause an error when creating dummy data for tuning.
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:72:get_ort_inference_session] inference_settings: {'execution_provider': ['DmlExecutionProvider'], 'provider_options': None}
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:111:get_ort_inference_session] Normalized providers: ['DmlExecutionProvider'], provider_options: [{}]
[2024-05-22 15:44:57,498] [WARNING] [engine.py:358:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 337, in run_accelerator
output_footprint = self.run_no_search(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 429, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 844, in _run_passes
signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 1042, in _evaluate_model
signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\systems\local.py", line 47, in evaluate_model
return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 205, in evaluate
metrics_res[metric.name] = self._evaluate_latency(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 123, in _evaluate_latency
latencies = self._evaluate_raw_latency(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 762, in _evaluate_raw_latency
return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 543, in _evaluate_onnx_latency
latencies = session.time_run(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\common\ort_inference.py", line 334, in time_run
self.session.run(input_feed=input_feed, output_names=None)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
[2024-05-22 15:44:58,009] [INFO] [engine.py:280:run] Run history for gpu-dml:
[2024-05-22 15:44:58,009] [INFO] [engine.py:570:dump_run_history] Please install tabulate for better run history output
[2024-05-22 15:44:58,009] [INFO] [engine.py:295:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 433, in <module>
main()
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 370, in main
optimize(common_args.model_id, common_args.provider, unoptimized_model_dir, optimized_model_dir)
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 253, in optimize
save_optimized_onnx_submodel(submodel_name, provider, model_info)
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\sd_utils\ort.py", line 59, in save_optimized_onnx_submodel
with footprints_file_path.open("r") as footprint_file:
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\wy-te\\OneDrive\\Desktop\\Projects\\Olive\\examples\\stable_diffusion\\footprints\\unet_gpu-dml_footprints.json'
This error might be related to DXGI_ERROR_DEVICE_HUNG?
Describe the bug
For context, I am using AMD RDNA3.5 architecture, Strix Point Processor.
Under
Olive\examples\stable_diffusion
I ran the following command.python stable_diffusion.py --model_id stabilityai/stable-diffusion-2-1 --optimize --clean_cache
I encountered error while I was optimizing unet.
Same error was found when I ran
python stable_diffusion_xl.py --model_id stabilityai/sdxl-turbo --optimize --clean_cache
underOlive\examples\directml\stable_diffusion_xl
The error is:
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
I have scrolled through the issue list. Multiple approaches have been taken. For example:
"save_as_external_data": true
(Suggested from the same issue as above: Whisper-medium conversion failed #1023)None of these works.
I found the exact same problem faced by this issue: #517
Other information
For full error log:
This error might be related to
DXGI_ERROR_DEVICE_HUNG
?@jstoecker, @guotuofeng Would love to hear insights from yall, thanks!
The text was updated successfully, but these errors were encountered: