Add CUDA EP in StableDiffusion demo #17788

tianleiwu · 2023-10-04T22:06:12Z

Description

Add CUDA EP to the demo of stable diffusion.

A100 Performance

Test	Engine Property	Batch Size	TRT Latency (ms)	ORT_TRT Latency (ms)	ORT_CUDA Latency (ms)	TORCH Latency (ms)
SD 1.5, 50 steps, 512x512	Static Input Shape	1	861	851	861	N/A
SD 1.5, 50 steps, 512x512	Dynamic Input Shape, Optimized for batch size 1 and image size 512x512	1	974	1079	928	1222
SD 1.5, 50 steps, 768x768	Dynamic Input Shape, Optimized for batch size 1 and image size 512x512	1	2492	OOM	1901	1971
SD 1.5, 50 steps, 768x768	Dynamic Input Shape, Optimized for batch size 1 and image size 512x512	4	9091	OOM	6785	6700

We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size.

The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test:

tensorrt==8.6.1.post1
torch==2.2.0.dev20230920+cu121
transformers==4.31.0
diffusers==0.19.3
onnx==1.14.1
onnx-graphsurgeon==0.3.27
polygraphy==0.47.1
protobuf==3.20.2
onnxruntime-gpu==1.17.0 (built from source of main branch)
CUDA 12.2.2
cuDNN 8.9.5.29
python 3.10.13

For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled.

For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512.

The script to test static and dynamic input shape are like the following:

prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining"
for e in TRT ORT_TRT ORT_CUDA
do
  python demo_txt2img.py --engine $e "$prompt"
  python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt"
  python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt"
done

Performance of PyTorch is from commands like the following:

python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768

Motivation and Context

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_utils.py

Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms) -- | -- | -- | -- | -- | -- | -- SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 | N/A SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971 SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```

tianleiwu added 2 commits October 4, 2023 21:56

Add CUDA EP in StableDiffusion demo

aacf0df

update logging

c686c01

tianleiwu requested review from wangyems and kunal-vaishnavi October 4, 2023 22:27

fix code scan warning

90ee7ff

kunal-vaishnavi reviewed Oct 5, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_utils.py Show resolved Hide resolved

kunal-vaishnavi approved these changes Oct 5, 2023

View reviewed changes

tianleiwu merged commit d6dad96 into main Oct 5, 2023
91 checks passed

tianleiwu deleted the tlwu/sdxl_demo_cuda branch October 5, 2023 15:19

tianleiwu added the release:1.16.2 label Oct 24, 2023

faxu added triage:approved Approved for cherrypicks for release sdxl_llama labels Oct 25, 2023

tianleiwu removed triage:approved Approved for cherrypicks for release release:1.16.2 labels Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA EP in StableDiffusion demo #17788

Add CUDA EP in StableDiffusion demo #17788

tianleiwu commented Oct 4, 2023 •

edited

Loading

Add CUDA EP in StableDiffusion demo #17788

Add CUDA EP in StableDiffusion demo #17788

Conversation

tianleiwu commented Oct 4, 2023 • edited Loading

Description

A100 Performance

Motivation and Context

tianleiwu commented Oct 4, 2023 •

edited

Loading