New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

StableDiffusion XL with TensorRT EP #17748

Merged

tianleiwu merged 7 commits into main from tlwu/sdxl_trt

Oct 4, 2023

Contributor

tianleiwu commented Sep 30, 2023 •

edited

Loading

Description

Accelerate StableDiffusion XL with TensorRT EP. It is modified from TensorRT demo diffusion, and we updated the design to make the pipeline works with different backend engines.

Peformance

The following result is from A100 80GB with 30 steps of Base, or 30 steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine is built with static input shape, and cuda graph is enabled. onnxruntime-gpu is built from source, and the following packages or libraries are used in this test:

tensorrt==8.6.1.post1
torch==2.2.0.dev20230920+cu121
transformers==4.31.0
diffusers==0.19.3
onnx==1.14.1
onnx-graphsurgeon==0.3.27
polygraphy==0.47.1
protobuf==3.20.2
onnxruntime-gpu==1.17.0 (built from source of main branch)
CUDA 12.2.2
cuDNN 8.9.5.29
python 3.10.13

	Batch Size	TRT Latency (ms)	ORT_TRT Latency (ms)	Diff
Base	1	2714	2679	-1.3%
Base & Refiner	1	3593	3530	-1.8%

Motivation and Context


          Accelerate SD XL with TensorRT EP

fa19752

tianleiwu changed the title ~~SD XL with TensorRT EP~~ StableDiffusion XL with TensorRT EP

github-advanced-security bot found potential problems

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_tensorrt.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_tensorrt.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder.py

+                      self.torch_models = {}
+                  def teardown(self):
+                      for engine in self.engines.values():

Check failure

Code scanning / CodeQL

Suspicious unused loop iteration variable

For loop variable 'engine' is deleted, but not used, in the loop body.

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py

+                              image = image.repeat(batch_size, 1, 1, 1)
+                          init_images.append(image)
+                      if self.nvtx_profile:
+                          nvtx.end_range(nvtx_image_preprocess)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_image_preprocess' may be used before it is initialized.

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py

+                      cudart.cudaEventRecord(self.events["clip-stop"], 0)
+                      if self.nvtx_profile:
+                          nvtx.end_range(nvtx_clip)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_clip' may be used before it is initialized.

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py

+                      init_latents = self.run_engine("vae_encoder", {"images": init_image})["latent"]
+                      cudart.cudaEventRecord(self.events["vae_encoder-stop"], 0)
+                      if self.nvtx_profile:
+                          nvtx.end_range(nvtx_vae)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_vae' may be used before it is initialized.

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py

+                      images = self.backend.vae_decode(latents)
+                      cudart.cudaEventRecord(self.events["vae-stop"], 0)
+                      if self.nvtx_profile:
+                          nvtx.end_range(nvtx_vae)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_vae' may be used before it is initialized.


          add --enable-refiner in demo

f597d12

github-advanced-security bot found potential problems

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_tensorrt.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_tensorrt.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_tensorrt.py Fixed Show fixed Hide fixed

tianleiwu added 2 commits

October 1, 2023 16:26


          update engine paths and return_type

38f90c4


          fix code scan warnings

ef94b1b

tianleiwu requested review from yufenglee, wangyems and kunal-vaishnavi

October 2, 2023 02:22

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_img2img_xl.py Show resolved Hide resolved

wangyems reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/README.md Outdated Show resolved Hide resolved

wangyems reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

wangyems reviewed

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

tianleiwu added 2 commits

October 4, 2023 00:13


          add demo for SD

8c37fc8


          address review feedback

9d829ba

github-advanced-security bot found potential problems

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py Fixed Show fixed Hide fixed


          fix code scan warning

c380930

tianleiwu requested review from kunal-vaishnavi and wangyems

October 4, 2023 05:10

github-advanced-security bot found potential problems

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  if not args.disable_cuda_graph:
+                      # inference once to get cuda graph
+                      _image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_image' is unnecessary as it is [redefined](1) before this value is used. This assignment to '_image' is unnecessary as it is [redefined](2) before this value is used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  if not args.disable_cuda_graph:
+                      # inference once to get cuda graph
+                      _image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_latency' is unnecessary as it is [redefined](1) before this value is used. This assignment to '_latency' is unnecessary as it is [redefined](2) before this value is used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Warming up ..")
+                  for _ in range(args.num_warmup_runs):
+                      _image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_image' is unnecessary as it is [redefined](1) before this value is used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Warming up ..")
+                  for _ in range(args.num_warmup_runs):
+                      _image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_latency' is unnecessary as it is [redefined](1) before this value is used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  if not args.disable_cuda_graph:
+                      # inference once to get cuda graph
+                      _image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  if not args.disable_cuda_graph:
+                      # inference once to get cuda graph
+                      _image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Warming up ..")
+                  for _ in range(args.num_warmup_runs):
+                      _image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Warming up ..")
+                  for _ in range(args.num_warmup_runs):
+                      _image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Running StableDiffusion pipeline")
+                  if args.nvtx_profile:
+                      cudart.cudaProfilerStart()
+                  _image, _latency = run_inference(warmup=False)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.

onnxruntime/python/tools/transformers/models/stable_diffusion/demo_txt2img.py

+                  print("[I] Running StableDiffusion pipeline")
+                  if args.nvtx_profile:
+                      cudart.cudaProfilerStart()
+                  _image, _latency = run_inference(warmup=False)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.

wangyems approved these changes

View reviewed changes

kunal-vaishnavi approved these changes

View reviewed changes

tianleiwu merged commit a05580e into main

tianleiwu deleted the tlwu/sdxl_trt branch

October 4, 2023 15:01

snnn mentioned this pull request

[Build] build with cmake (makefile or ninja generator) spawns zillons of cuda process filling 128Gb ram and gets OOM killer #17993

Closed

tianleiwu added the release:1.16.2 label

faxu added triage:approved sdxl_llama labels

tianleiwu added a commit that referenced this pull request


          StableDiffusion XL with TensorRT EP (#17748)

86e334d

Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.

The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.

  | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%

The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13

tianleiwu removed triage:approved release:1.16.2 labels

kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request


          StableDiffusion XL with TensorRT EP (microsoft#17748)

Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.

The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.

  | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%

The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet