forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add CUDA EP in StableDiffusion demo (microsoft#17788)
Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms) -- | -- | -- | -- | -- | -- | -- SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 | N/A SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```
- Loading branch information
Showing
13 changed files
with
341 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.