Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StableDiffusion XL with TensorRT EP #17748

Merged
merged 7 commits into from
Oct 4, 2023
Merged

StableDiffusion XL with TensorRT EP #17748

merged 7 commits into from
Oct 4, 2023

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Sep 30, 2023

Description

Accelerate StableDiffusion XL with TensorRT EP. It is modified from TensorRT demo diffusion, and we updated the design to make the pipeline works with different backend engines.

Peformance

The following result is from A100 80GB with 30 steps of Base, or 30 steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine is built with static input shape, and cuda graph is enabled. onnxruntime-gpu is built from source, and the following packages or libraries are used in this test:

  • tensorrt==8.6.1.post1
  • torch==2.2.0.dev20230920+cu121
  • transformers==4.31.0
  • diffusers==0.19.3
  • onnx==1.14.1
  • onnx-graphsurgeon==0.3.27
  • polygraphy==0.47.1
  • protobuf==3.20.2
  • onnxruntime-gpu==1.17.0 (built from source of main branch)
  • CUDA 12.2.2
  • cuDNN 8.9.5.29
  • python 3.10.13
  Batch Size TRT Latency (ms) ORT_TRT Latency (ms) Diff
Base 1 2714 2679 -1.3%
Base & Refiner 1 3593 3530 -1.8%

Motivation and Context

@tianleiwu tianleiwu changed the title SD XL with TensorRT EP StableDiffusion XL with TensorRT EP Sep 30, 2023
self.torch_models = {}

def teardown(self):
for engine in self.engines.values():

Check failure

Code scanning / CodeQL

Suspicious unused loop iteration variable

For loop variable 'engine' is deleted, but not used, in the loop body.
image = image.repeat(batch_size, 1, 1, 1)
init_images.append(image)
if self.nvtx_profile:
nvtx.end_range(nvtx_image_preprocess)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_image_preprocess' may be used before it is initialized.

cudart.cudaEventRecord(self.events["clip-stop"], 0)
if self.nvtx_profile:
nvtx.end_range(nvtx_clip)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_clip' may be used before it is initialized.
init_latents = self.run_engine("vae_encoder", {"images": init_image})["latent"]
cudart.cudaEventRecord(self.events["vae_encoder-stop"], 0)
if self.nvtx_profile:
nvtx.end_range(nvtx_vae)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_vae' may be used before it is initialized.
images = self.backend.vae_decode(latents)
cudart.cudaEventRecord(self.events["vae-stop"], 0)
if self.nvtx_profile:
nvtx.end_range(nvtx_vae)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'nvtx_vae' may be used before it is initialized.

if not args.disable_cuda_graph:
# inference once to get cuda graph
_image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_image' is unnecessary as it is [redefined](1) before this value is used. This assignment to '_image' is unnecessary as it is [redefined](2) before this value is used.

if not args.disable_cuda_graph:
# inference once to get cuda graph
_image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_latency' is unnecessary as it is [redefined](1) before this value is used. This assignment to '_latency' is unnecessary as it is [redefined](2) before this value is used.

print("[I] Warming up ..")
for _ in range(args.num_warmup_runs):
_image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_image' is unnecessary as it is [redefined](1) before this value is used.

print("[I] Warming up ..")
for _ in range(args.num_warmup_runs):
_image, _latency = run_inference(warmup=True)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to '_latency' is unnecessary as it is [redefined](1) before this value is used.

if not args.disable_cuda_graph:
# inference once to get cuda graph
_image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.

if not args.disable_cuda_graph:
# inference once to get cuda graph
_image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.

print("[I] Warming up ..")
for _ in range(args.num_warmup_runs):
_image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.

print("[I] Warming up ..")
for _ in range(args.num_warmup_runs):
_image, _latency = run_inference(warmup=True)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.
print("[I] Running StableDiffusion pipeline")
if args.nvtx_profile:
cudart.cudaProfilerStart()
_image, _latency = run_inference(warmup=False)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_image' is not used.
print("[I] Running StableDiffusion pipeline")
if args.nvtx_profile:
cudart.cudaProfilerStart()
_image, _latency = run_inference(warmup=False)

Check notice

Code scanning / CodeQL

Unused global variable

The global variable '_latency' is not used.
@tianleiwu tianleiwu merged commit a05580e into main Oct 4, 2023
@tianleiwu tianleiwu deleted the tlwu/sdxl_trt branch October 4, 2023 15:01
@faxu faxu added triage:approved Approved for cherrypicks for release sdxl_llama labels Oct 25, 2023
tianleiwu added a commit that referenced this pull request Oct 31, 2023
Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.

The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.

  | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%

The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13
@tianleiwu tianleiwu removed triage:approved Approved for cherrypicks for release release:1.16.2 labels Nov 1, 2023
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.

The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.

  | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%

The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants