-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA EP in StableDiffusion demo #17788
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
onnxruntime/python/tools/transformers/models/stable_diffusion/demo_utils.py
Show resolved
Hide resolved
kunal-vaishnavi
approved these changes
Oct 5, 2023
tianleiwu
added a commit
that referenced
this pull request
Oct 31, 2023
Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms) -- | -- | -- | -- | -- | -- | -- SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 | N/A SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```
tianleiwu
removed
triage:approved
Approved for cherrypicks for release
release:1.16.2
labels
Nov 1, 2023
kleiti
pushed a commit
to kleiti/onnxruntime
that referenced
this pull request
Mar 22, 2024
Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms) -- | -- | -- | -- | -- | -- | -- SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 | N/A SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Add CUDA EP to the demo of stable diffusion.
A100 Performance
We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size.
The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test:
For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled.
For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512.
The script to test static and dynamic input shape are like the following:
Performance of PyTorch is from commands like the following:
Motivation and Context