Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Optimization of Unet fails - AMD RDNA3.5 Strix Point Processor #1170

Open
woonyee28 opened this issue May 22, 2024 · 2 comments
Open
Labels
DirectML DirectML

Comments

@woonyee28
Copy link

Describe the bug

For context, I am using AMD RDNA3.5 architecture, Strix Point Processor.

Under Olive\examples\stable_diffusion I ran the following command.
python stable_diffusion.py --model_id stabilityai/stable-diffusion-2-1 --optimize --clean_cache
I encountered error while I was optimizing unet.
Same error was found when I ran python stable_diffusion_xl.py --model_id stabilityai/sdxl-turbo --optimize --clean_cache under Olive\examples\directml\stable_diffusion_xl

The error is: onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

I have scrolled through the issue list. Multiple approaches have been taken. For example:

  1. Downgrading Python to 3.10, previously I was using Python 3.12. (Suggested from this issue: Whisper-medium conversion failed #1023)
  2. Set "save_as_external_data": true (Suggested from the same issue as above: Whisper-medium conversion failed #1023)
  3. Set --temp-dir . (Suggested from the same issue as above: Whisper-medium conversion failed #1023)
    None of these works.

I found the exact same problem faced by this issue: #517

Other information

  • OS: Windows
  • Olive version: 0.6.0 (git clone main branch on 21 May Singapore Time)
  • ONNXRuntime package and version: onnxruntime-directml 1.18.0

For full error log:

Optimizing unet
[2024-05-22 15:44:40,729] [INFO] [run.py:279:run] Loading Olive module configuration from: C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\olive_config.json
[2024-05-22 15:44:40,734] [DEBUG] [olive_evaluator.py:1153:validate_metrics] No priority is specified, but only one sub type  metric is specified. Use rank 1 for single for this metric.
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OnnxConversion
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OrtTransformersOptimization
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:130:_fill_accelerators] The accelerator device and execution providers are specified, skipping deduce.
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:169:_check_execution_providers] Supported execution providers for device gpu: ['DmlExecutionProvider', 'CPUExecutionProvider']
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:199:create_accelerators] Initial accelerators and execution providers: {'gpu': ['DmlExecutionProvider']}
[2024-05-22 15:44:40,734] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OnnxConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OpenVINOConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [INFO] [engine.py:107:initialize] Using cache directory: cache
[2024-05-22 15:44:40,734] [INFO] [engine.py:263:run] Running Olive on accelerator: gpu-dml
[2024-05-22 15:44:40,734] [INFO] [engine.py:1075:_create_system] Creating target system ...
[2024-05-22 15:44:40,734] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1078:_create_system] Target system created in 0.007994 seconds
[2024-05-22 15:44:40,742] [INFO] [engine.py:1087:_create_system] Creating host system ...
[2024-05-22 15:44:40,742] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1090:_create_system] Host system created in 0.000000 seconds
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:709:_cache_model] Cached model 9c464b7b to cache\models\9c464b7b.json
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:336:run_accelerator] Running Olive in no-search mode ...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:428:run_no_search] Running ['convert', 'optimize'] with no search ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass convert:OnnxConversion
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 6_OnnxConversion-9c464b7b-89c11e05 from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 12_OrtTransformersOptimization-6-b768c232-gpu-dml from cache\runs       
[2024-05-22 15:44:40,764] [INFO] [engine.py:843:_run_passes] Run model evaluation for the final model...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:1016:_evaluate_model] Evaluating model ...
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,779] [DEBUG] [olive_evaluator.py:238:generate_metric_user_config_with_model_io] Model input shapes are not static. Cannot use inferred input shapes for creating dummy data. This will cause an error when creating dummy data for tuning.
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:72:get_ort_inference_session] inference_settings: {'execution_provider': ['DmlExecutionProvider'], 'provider_options': None}
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:111:get_ort_inference_session] Normalized providers: ['DmlExecutionProvider'], provider_options: [{}]   
[2024-05-22 15:44:57,498] [WARNING] [engine.py:358:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 337, in run_accelerator
    output_footprint = self.run_no_search(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 429, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 844, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 1042, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\systems\local.py", line 47, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 205, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 123, in _evaluate_latency        
    latencies = self._evaluate_raw_latency(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 762, in _evaluate_raw_latency    
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 543, in _evaluate_onnx_latency   
    latencies = session.time_run(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\common\ort_inference.py", line 334, in time_run
    self.session.run(input_feed=input_feed, output_names=None)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run    
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

[2024-05-22 15:44:58,009] [INFO] [engine.py:280:run] Run history for gpu-dml:
[2024-05-22 15:44:58,009] [INFO] [engine.py:570:dump_run_history] Please install tabulate for better run history output
[2024-05-22 15:44:58,009] [INFO] [engine.py:295:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 433, in <module>
    main()
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 370, in main
    optimize(common_args.model_id, common_args.provider, unoptimized_model_dir, optimized_model_dir)
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 253, in optimize
    save_optimized_onnx_submodel(submodel_name, provider, model_info)
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\sd_utils\ort.py", line 59, in save_optimized_onnx_submodel
    with footprints_file_path.open("r") as footprint_file:
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\wy-te\\OneDrive\\Desktop\\Projects\\Olive\\examples\\stable_diffusion\\footprints\\unet_gpu-dml_footprints.json'

This error might be related to DXGI_ERROR_DEVICE_HUNG?

@jstoecker, @guotuofeng Would love to hear insights from yall, thanks!

@guotuofeng
Copy link
Collaborator

@PatriceVignola, do you have any idea?

@devang-ml devang-ml added the DirectML DirectML label Jun 3, 2024
@Jay19751103
Copy link

Set registry TdrLevel = 0
SDXL need to setup paging file around 150GB.
and optimization may trigger TDR timeout event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DirectML DirectML
Projects
None yet
Development

No branches or pull requests

4 participants