update sdxl demo (#18889)

### Description (1) Support importing model from Olive. (2) Add backend engine Torch (Eager and Compile modes) to the demo. (3) Use fp16 in most places. (4) Remove some old pipeline scripts that are not useful anymore. They are replaced by the demo. (5) Remove old benchmark results that are out of date. (6) Add PIL image conversion to end to end latency (for fair comparison with diffusers since the default output type is pil) (7) Remove some options are seldom used like force-rebuild-engine, hf-token, refit etc. ### Motivation and Context
microsoft · Dec 20, 2023 · 2d6e2e2 · 2d6e2e2
1 parent 9a61388
commit 2d6e2e2
Show file tree

Hide file tree

Showing 21 changed files with 1,305 additions and 2,524 deletions.
diff --git a/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md b/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
@@ -21,7 +21,7 @@ These optimizations are firstly carried out on CUDA EP. They may not work on oth
 | [demo_txt2img.py](./demo_txt2img.py)           | Demo of text to image generation using Stable Diffusion models except XL.                 |
 | [optimize_pipeline.py](./optimize_pipeline.py) | Optimize Stable Diffusion ONNX models exported from Huggingface diffusers or optimum      |
 | [benchmark.py](./benchmark.py)                 | Benchmark latency and memory of OnnxRuntime, xFormers or PyTorch 2.0 on stable diffusion. |
-| [benchmark_turbo.py](./benchmark_controlnet.py)| Benchmark latency of PyTorch or Stable-Fast with canny control net.                       |
+| [benchmark_controlnet.py](./benchmark_controlnet.py)| Benchmark latency of canny control net.                                              |
 
 ## Run demo with docker
 
@@ -379,97 +379,6 @@ Common settings for below test results:
 | ------------------------------ | ---------------------- | ------ | ----- | ----- | ----------- | ----------- |
 | runwayml/stable-diffusion-v1-5 | TRUE                   | 512    | 512   | 50    | 5           | 1           |
 
-#### Results of RTX 3060 (Windows 11)
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 4.8             | 4,117               | 4,625                |
-| torch       | 2.0.0+cu117             | default               | 1          | 5.6             | 4,325               | 4,047                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 6.0             | 9,124               | 9,130                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 17.7            | 6,659               | 6,659                |
-| torch       | 2.0.0+cu117             | default               | 4          | 20.1            | 6,421               | 6,907                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 21.6            | 10,407              | 10,409               |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 33.5            | 6,663               | 6,663                |
-| torch       | 2.0.0+cu117             | default               | 8          | 39.5            | 10,767              | 10,813               |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 41.1            | 10,825              | 9,255                |
-
-
-#### Results of A100-SXM4-40GB (Ubuntu 20.04)
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 1.1             | 6,883               | 7,395                |
-| torch       | 2.0.0+cu117             | default               | 1          | 1.5             | 13,828              | 4,400                |
-| torch       | 2.0.0+cu117             | compile               | 1          | 1.8             | 13,892              | 4,386                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 3.7             | 7,381               | 7,381                |
-| torch       | 2.0.0+cu117             | default               | 4          | 3.9             | 31,278              | 6,870                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 3.4             | 31,364              | 6,880                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 6.9             | 7,411               | 7,411                |
-| torch       | 2.0.0+cu117             | default               | 8          | 7.6             | 31,660              | 10,122               |
-| torch       | 2.0.0+cu117             | compile               | 8          | 6.5             | 31,800              | 10,308               |
-| onnxruntime | 1.14.1                  | CUDA                  | 16         | 13.6            | 11,479              | 11,479               |
-| torch       | 2.0.0+cu117             | default               | 16         | 14.8            | 32,306              | 16,520               |
-| torch       | 2.0.0+cu117             | compile               | 16         | 12.6            | 32,636              | 16,898               |
-
-#### Results of A100-PCIE-80GB (Ubuntu 20.04)
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| tensorrt    | 8.6.1                   | default               | 1          | 1.00            | 9,056               | 9,056                |
-| onnxruntime | 1.16.0 nightly          | tensorrt              | 1          | 1.09            | 11,250              | 11,250               |
-| onnxruntime | 1.16.0 nightly          | tensorrt (cuda graph) | 1          | 0.96            | 11,382              | 11,382               |
-| onnxruntime | 1.16.0 nightly          | cuda                  | 1          | 1.11            | 4,760               | 5,144                |
-| onnxruntime | 1.16.0 nightly          | cuda (cuda graph)     | 1          | 1.04            | 5,230               | 5,390                |
-| tensorrt    | 8.6.1                   | default               | 4          | 3.39            | 9,072               | 9,072                |
-| onnxruntime | 1.16.0 nightly          | tensorrt              | 4          | 3.60            | 11,266              | 11,266               |
-| onnxruntime | 1.16.0 nightly          | tensorrt (cuda graph) | 4          | 3.43            | 11,428              | 11,428               |
-
-#### Results of V100-PCIE-16GB (Ubuntu 20.04)
-
-Results from Standard_NC6s_v3 Azure virtual machine:
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 2.7             | 12,646              | 7,152                |
-| torch       | 2.0.0+cu117             | compile               | 1          | 3.2             | 13,317              | 3,909                |
-| torch       | 2.0.0+cu117             | default               | 1          | 2.7             | 13,343              | 3,921                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 3.5             | 14,979              | 10,449               |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 8.4             | 7,114               | 7,114                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 8.0             | 13,897              | 6,821                |
-| torch       | 2.0.0+cu117             | default               | 4          | 8.7             | 13,873              | 6,607                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 9.1             | 12,969              | 8,421                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 15.9            | 7,120               | 7,120                |
-| torch       | 2.0.0+cu117             | compile               | 8          | 15.5            | 14,669              | 10,355               |
-| torch       | 2.0.0+cu117             | default               | 8          | 17.0            | 14,469              | 9,657                |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 17.4            | 15,593              | 9,133                |
-
-#### Results of T4 (Ubuntu 20.04)
-
-To make the result stable, we lock the frequency of T4 GPU like
-`sudo nvidia-smi --lock-gpu-clocks=990` for fair comparison. See [nvidia blog](https://developer.nvidia.com/blog/advanced-api-performance-setstablepowerstate/) for more information. Note that performance might be slightly better without locking frequency.
-
-Results are from Standard_NC4as_T4_v3 Azure virtual machine:
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 5.6             | 4,925               | 4,925                |
-| onnxruntime | 1.15.1                  | CUDA                  | 1          | 5.5             | 3,738               | 4,250                |
-| onnxruntime | 1.15.1 (tensorrt 8.6.1) | Tensorrt              | 1          | 4.8             | 10,710              | 10,710               |
-| onnxruntime | 1.16.0 nightly          | Tensorrt (cuda graph) | 1          | 4.7             | 11,746              | 10,746               |
-| tensorrt    | 8.6.1                   | default               | 1          | 5.0             | 8,530               | 8,530                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 6.9             | 14,845              | 10,317               |
-| torch       | 2.0.0+cu117             | compile               | 1          | 6.0             | 12,989              | 3,841                |
-| torch       | 2.0.0+cu117             | default               | 1          | 6.4             | 12,987              | 3,841                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 23.0            | 6,977               | 6,977                |
-| onnxruntime | 1.15.1                  | CUDA                  | 4          | 22.6            | 6,298               | 6,298                |
-| onnxruntime | 1.15.1 (tensorrt 8.6.1) | Tensorrt              | 4          | 21.8            | 10,746              | 10,746               |
-| tensorrt    | 8.6.1                   | default               | 4          | 22.2            | 8,542               | 8,542                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 25.8            | 12,819              | 8,269                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 22.2            | 14,637              | 6,583                |
-| torch       | 2.0.0+cu117             | default               | 4          | 25.2            | 14,409              | 6,355                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 46.4            | 6,779               | 6,779                |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 51.4            | 14,827              | 9,001                |
-| torch       | 2.0.0+cu117             | compile               | 8          | 46.5            | 12,595              | 10,171               |
-| torch       | 2.0.0+cu117             | default               | 8          | 50.7            | 11,955              | 9,531                |
-
 #### Results of MI250X, 1 GCD (Ubuntu 20.04)
 
 | engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |