Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Severe performance penalty with transformer model and DirectML #20983

Open
andrea-cimatoribus-pix4d opened this issue Jun 10, 2024 · 6 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider performance issues related to performance regressions platform:windows issues related to the Windows platform

Comments

@andrea-cimatoribus-pix4d
Copy link

andrea-cimatoribus-pix4d commented Jun 10, 2024

Describe the issue

I am testing Meta's Segment Anything (SAM) encoder model, both on Linux (CUDA) and on Windows (DirectML). When testing the model on the two platforms, using identical hardware (Intel i9-9900, NVIDIA RTX A2000 12GB), I see extremely different runtime (median over 115 images):

  • On Linux+CUDA, model loading takes ~2s and encoding takes ~370ms per image.
  • On Windows+DirectML, model loading takes ~14s and encoding takes ~780ms per image.

I got these numbers using the C++ API v1.14.1 with some custom code, but I got comparable results also with more recent versions (including the latest 1.18.0), different hardware and also using the Python bindings. I thus decided to try profiling the model execution. Comparing the profiling on Linux+CUDA vs Windows+DirectML, it seems that the longer runtime on Windows+DirectML is related to the time spent in Memcpy_token_..._kernel_time. Why would DirectML need to make copies when CUDA doesn't? Can that be really related to the specific execution provider? [note: a very hacky test using CUDA on windows might suggest that also the CUDA EP suffers from a similar issue on Windows, however I cannot tell that for sure]

I am now wondering if the issue I see is related to some error that I make, e.g. in the model export, or if it is actually related to some limitation of DirectML or Windows with this model. Other models (in particular, models without attention layers), do not show comparable platform-dependent differences. I also wonder if the optimizations suggested for transformer models might have an impact, but I don't think that SAM or ViT transformers are supported, or at least I did not understand how to apply the optimizations.

I am running out of ideas, at least given the available time and hardware that I have, so I write to try to understand if anybody experienced similar issues, or if anybody understands what is going on. Thanks.

Linux+CUDA profiling: https://drive.google.com/file/d/19NykxOWKMxZebQn3UQ9oOOs2atDv7O_8/view?usp=drive_link
Windows+DirectML profiling: https://drive.google.com/file/d/1mTCB1CzbQVj1EysXJ-hJ1wSGF077cAhV/view?usp=drive_link

To reproduce

The onnx exported model is available here.

For CUDA on linux, the EP is created with the following options:

            OrtCUDAProviderOptions cuda_options;
            cuda_options.device_id = 0;
            cuda_options.arena_extend_strategy = 0;
            cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::OrtCudnnConvAlgoSearchDefault;
            cuda_options.gpu_mem_limit = 0;
            cuda_options.do_copy_in_default_stream = true;
            cuda_options.has_user_compute_stream = false;
            cuda_options.default_memory_arena_cfg = nullptr;
            session_options.AppendExecutionProvider_CUDA(cuda_options);

For DirectML on windows, this is the set-up (based on this):

            session_options.DisableMemPattern();
            session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);

--- EDIT 10/06/24 ---
It turns out that the two options above don't seem to be required any more. Removing them has a positive impact on the Windows+DirectML runtime (~750ms per image), which however remains very far from the Linux+CUDA one.
--- END EDIT ---

In both cases session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED); and I append the CPU provider as suggested in documentation.

--- EDIT 11/06/24 ---
Note that image preparation (resizing, normalization, padding), which is done outside of the inference call to onnxruntime, is included in the runtimes reported above. However, it cannot explain the differences observed (~55ms on Linux, ~60ms on Windows).
--- END EDIT ---

Urgency

This might be an important issue for DirectML-based inference on Windows.

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

Onnxruntime default DirectML version

Model File

Meta's Segment Anything (SAM) model exported with default settings, opset v17, constant folding optimization enabled, no dynamic input axes. Exported model available here.

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform labels Jun 10, 2024
@sophies927
Copy link
Contributor

@smk2007

@sophies927 sophies927 added performance issues related to performance regressions and removed ep:CUDA issues related to the CUDA execution provider labels Jun 13, 2024
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jun 14, 2024
@andrea-cimatoribus-pix4d
Copy link
Author

I made some extra experiments with dynamic axes. I could confirm that removing any dynamic axes provides a small speed-up (few % with DirectML), but once again it does not fill the gap between Windows+DirectML and Linux+CUDA.

@yuslepukhin
Copy link
Member

Any reason not to compare Windows CUDA with Linux CUDA?

@andrea-cimatoribus-pix4d
Copy link
Author

Any reason not to compare Windows CUDA with Linux CUDA?

The reason is that I don't have a have a reliable way to build onnxruntime with CUDA on Windows, cudart/cudnn distribution for Windows has been sketchy at best in recent times (at least up to 11.7 which is what I currently use). So, I cannot do the measurements on the same infrastructure I use for the other cases. A hacky test however suggests that Windows+CUDA suffers from a similar performance penalty than Windows+DirectML.

@andrea-cimatoribus-pix4d
Copy link
Author

@yuslepukhin

I could collect some more runtime measurements, using the publicly available python bindings (v1.18.1). The results are interesting: The DirectML bindings are the fastest, at 1.5s per image, CUDA is at least as slow. If I run the same model with pytorch+CUDA on windows, I get ~0.5s per image, which is the same runtime I get on linux on identical hardware. On linux, our C++ onnxruntime integration is faster than pytorch by ~20%. It seems that the model has some issue on windows in onnxruntime, not clearly related to the execution provider. As from my original message, the issue seems to come from some extra memcpy, but I don't really understand what is causing them.

@andrea-cimatoribus-pix4d
Copy link
Author

andrea-cimatoribus-pix4d commented Aug 28, 2024

I was able to test 1.19.0 with DirectML on Windows, and the runtime is improved by ~20% with respect to the previous best results I had on Windows (v1.14.1 + DirectML). Still pretty far from the Linux speed but clearly an improvement. Also the variance of the runtime (per-image) that I measure is cut by a factor of around 2. I could not run the profiler on this build yet, but this latter result might suggest that there are less copies done between CPU and GPU (if the memcpy I observed in previous profiling runs are really due to that). On the other hand, model loading is still much slower on Windows than Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider performance issues related to performance regressions platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants