update sdxl demo

microsoft · Dec 20, 2023 · d0c0478 · d0c0478
1 parent 8931854
commit d0c0478
Show file tree

Hide file tree

Showing 21 changed files with 1,290 additions and 2,492 deletions.
diff --git a/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md b/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
@@ -21,7 +21,7 @@ These optimizations are firstly carried out on CUDA EP. They may not work on oth
 | [demo_txt2img.py](./demo_txt2img.py)           | Demo of text to image generation using Stable Diffusion models except XL.                 |
 | [optimize_pipeline.py](./optimize_pipeline.py) | Optimize Stable Diffusion ONNX models exported from Huggingface diffusers or optimum      |
 | [benchmark.py](./benchmark.py)                 | Benchmark latency and memory of OnnxRuntime, xFormers or PyTorch 2.0 on stable diffusion. |
-| [benchmark_turbo.py](./benchmark_controlnet.py)| Benchmark latency of PyTorch or Stable-Fast with canny control net.                       |
+| [benchmark_controlnet.py](./benchmark_controlnet.py)| Benchmark latency of canny control net.                                              |
 
 ## Run demo with docker
 
@@ -379,97 +379,6 @@ Common settings for below test results:
 | ------------------------------ | ---------------------- | ------ | ----- | ----- | ----------- | ----------- |
 | runwayml/stable-diffusion-v1-5 | TRUE                   | 512    | 512   | 50    | 5           | 1           |
 
-#### Results of RTX 3060 (Windows 11)
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 4.8             | 4,117               | 4,625                |
-| torch       | 2.0.0+cu117             | default               | 1          | 5.6             | 4,325               | 4,047                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 6.0             | 9,124               | 9,130                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 17.7            | 6,659               | 6,659                |
-| torch       | 2.0.0+cu117             | default               | 4          | 20.1            | 6,421               | 6,907                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 21.6            | 10,407              | 10,409               |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 33.5            | 6,663               | 6,663                |
-| torch       | 2.0.0+cu117             | default               | 8          | 39.5            | 10,767              | 10,813               |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 41.1            | 10,825              | 9,255                |
-
-
-#### Results of A100-SXM4-40GB (Ubuntu 20.04)
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 1.1             | 6,883               | 7,395                |
-| torch       | 2.0.0+cu117             | default               | 1          | 1.5             | 13,828              | 4,400                |
-| torch       | 2.0.0+cu117             | compile               | 1          | 1.8             | 13,892              | 4,386                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 3.7             | 7,381               | 7,381                |
-| torch       | 2.0.0+cu117             | default               | 4          | 3.9             | 31,278              | 6,870                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 3.4             | 31,364              | 6,880                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 6.9             | 7,411               | 7,411                |
-| torch       | 2.0.0+cu117             | default               | 8          | 7.6             | 31,660              | 10,122               |
-| torch       | 2.0.0+cu117             | compile               | 8          | 6.5             | 31,800              | 10,308               |
-| onnxruntime | 1.14.1                  | CUDA                  | 16         | 13.6            | 11,479              | 11,479               |
-| torch       | 2.0.0+cu117             | default               | 16         | 14.8            | 32,306              | 16,520               |
-| torch       | 2.0.0+cu117             | compile               | 16         | 12.6            | 32,636              | 16,898               |
-
-#### Results of A100-PCIE-80GB (Ubuntu 20.04)
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| tensorrt    | 8.6.1                   | default               | 1          | 1.00            | 9,056               | 9,056                |
-| onnxruntime | 1.16.0 nightly          | tensorrt              | 1          | 1.09            | 11,250              | 11,250               |
-| onnxruntime | 1.16.0 nightly          | tensorrt (cuda graph) | 1          | 0.96            | 11,382              | 11,382               |
-| onnxruntime | 1.16.0 nightly          | cuda                  | 1          | 1.11            | 4,760               | 5,144                |
-| onnxruntime | 1.16.0 nightly          | cuda (cuda graph)     | 1          | 1.04            | 5,230               | 5,390                |
-| tensorrt    | 8.6.1                   | default               | 4          | 3.39            | 9,072               | 9,072                |
-| onnxruntime | 1.16.0 nightly          | tensorrt              | 4          | 3.60            | 11,266              | 11,266               |
-| onnxruntime | 1.16.0 nightly          | tensorrt (cuda graph) | 4          | 3.43            | 11,428              | 11,428               |
-
-#### Results of V100-PCIE-16GB (Ubuntu 20.04)
-
-Results from Standard_NC6s_v3 Azure virtual machine:
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 2.7             | 12,646              | 7,152                |
-| torch       | 2.0.0+cu117             | compile               | 1          | 3.2             | 13,317              | 3,909                |
-| torch       | 2.0.0+cu117             | default               | 1          | 2.7             | 13,343              | 3,921                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 3.5             | 14,979              | 10,449               |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 8.4             | 7,114               | 7,114                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 8.0             | 13,897              | 6,821                |
-| torch       | 2.0.0+cu117             | default               | 4          | 8.7             | 13,873              | 6,607                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 9.1             | 12,969              | 8,421                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 15.9            | 7,120               | 7,120                |
-| torch       | 2.0.0+cu117             | compile               | 8          | 15.5            | 14,669              | 10,355               |
-| torch       | 2.0.0+cu117             | default               | 8          | 17.0            | 14,469              | 9,657                |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 17.4            | 15,593              | 9,133                |
-
-#### Results of T4 (Ubuntu 20.04)
-
-To make the result stable, we lock the frequency of T4 GPU like
-`sudo nvidia-smi --lock-gpu-clocks=990` for fair comparison. See [nvidia blog](https://developer.nvidia.com/blog/advanced-api-performance-setstablepowerstate/) for more information. Note that performance might be slightly better without locking frequency.
-
-Results are from Standard_NC4as_T4_v3 Azure virtual machine:
-
-| engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |
-| ----------- | ----------------------- | --------------------- | ---------- | --------------- | ------------------- | -------------------- |
-| onnxruntime | 1.14.1                  | CUDA                  | 1          | 5.6             | 4,925               | 4,925                |
-| onnxruntime | 1.15.1                  | CUDA                  | 1          | 5.5             | 3,738               | 4,250                |
-| onnxruntime | 1.15.1 (tensorrt 8.6.1) | Tensorrt              | 1          | 4.8             | 10,710              | 10,710               |
-| onnxruntime | 1.16.0 nightly          | Tensorrt (cuda graph) | 1          | 4.7             | 11,746              | 10,746               |
-| tensorrt    | 8.6.1                   | default               | 1          | 5.0             | 8,530               | 8,530                |
-| torch       | 1.13.1+cu117            | xformers              | 1          | 6.9             | 14,845              | 10,317               |
-| torch       | 2.0.0+cu117             | compile               | 1          | 6.0             | 12,989              | 3,841                |
-| torch       | 2.0.0+cu117             | default               | 1          | 6.4             | 12,987              | 3,841                |
-| onnxruntime | 1.14.1                  | CUDA                  | 4          | 23.0            | 6,977               | 6,977                |
-| onnxruntime | 1.15.1                  | CUDA                  | 4          | 22.6            | 6,298               | 6,298                |
-| onnxruntime | 1.15.1 (tensorrt 8.6.1) | Tensorrt              | 4          | 21.8            | 10,746              | 10,746               |
-| tensorrt    | 8.6.1                   | default               | 4          | 22.2            | 8,542               | 8,542                |
-| torch       | 1.13.1+cu117            | xformers              | 4          | 25.8            | 12,819              | 8,269                |
-| torch       | 2.0.0+cu117             | compile               | 4          | 22.2            | 14,637              | 6,583                |
-| torch       | 2.0.0+cu117             | default               | 4          | 25.2            | 14,409              | 6,355                |
-| onnxruntime | 1.14.1                  | CUDA                  | 8          | 46.4            | 6,779               | 6,779                |
-| torch       | 1.13.1+cu117            | xformers              | 8          | 51.4            | 14,827              | 9,001                |
-| torch       | 2.0.0+cu117             | compile               | 8          | 46.5            | 12,595              | 10,171               |
-| torch       | 2.0.0+cu117             | default               | 8          | 50.7            | 11,955              | 9,531                |
-
 #### Results of MI250X, 1 GCD (Ubuntu 20.04)
 
 | engine      | version                 | provider              | batch size | average latency | first run memory MB | second run memory MB |