CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

cooper-a · 2024-10-04T23:19:38Z

Describe the issue

We are seeing an issue with a Transformer model which was exported using torch.onnx.export and then optimized with optimum ORTOptimizer. Inferencing seems to not be using GPU and only CPU.

Model was exported on CPU machine using ONNX 1.16.0. We see the following logs when starting the inference session.

[transformer_memcpy.cc:74](http://transformer_memcpy.cc:74/)
ApplyImpl] 36
Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph).
Set session_options.log_severity_level=1 to see the detail logs before this message.[m
[0;93m2024-10-04 20:35:10.514629537 [W:onnxruntime:,
[session_state.cc:1166](http://session_state.cc:1166/)
VerifyEachNodeIsAssignedToAnEp]
Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance.
e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m```

### To reproduce

Model was deployed onto Ubuntu2004 Docker Container running the Azure SKU [NCasT4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ncast4v3-series?tabs=sizebasic )with the following versions:

onnxruntime-gpu 1.18.0
CUDA Version: 11.8
cuDNN Version 8.9.6.50

**Inferencing Code**

```class UnifiedModelOnnx(BaseModel):
    def __init__(self, model_root, gpu_mem_limit=12, device=None):
        self.model_path = os.path.join(model_root, "model.onnx")

        self.tokenizer = DebertaV2Tokenizer.from_pretrained(model_root)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if device is None else device
        print(f"Using device {self.device}")
        providers = [
            (
                "CUDAExecutionProvider",
                {
                    "device_id": 0,
                    "arena_extend_strategy": "kNextPowerOfTwo",
                    "gpu_mem_limit": gpu_mem_limit * 1024 * 1024 * 1024,
                    "cudnn_conv_algo_search": "EXHAUSTIVE",
                    "do_copy_in_default_stream": True,
                },
            )
        ]
        ort_session_options = ort.SessionOptions()
        ort_session_options.enable_cpu_mem_arena = False
        self.ort_session = ort.InferenceSession(self.model_path, providers=providers)
        self.run_config = ort.RunOptions().add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")```

We also ran `ort.get_available_providers()` showing: `['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']`

And `self.ort_session.get_providers()` showing: `['CUDAExecutionProvider', 'CPUExecutionProvider']}]`

### Urgency

Internal Microsoft team alias for contact [email protected] or [email protected]

### Platform

Linux

### OS Version

Ubuntu 2004

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.18.0

### ONNX Runtime API

Python

### Architecture

X86

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA Version: 11.8 cuDNN Version 8.9.6.50

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-10-05T03:24:14Z

Memcpy nodes is to copy between devices (like GPU and CPU). So it's using both GPU and CPU.

Could you share some information to reproduce (like transformers/optimum/pytorch versions, and python script or command line for onnx export and optimize)?

github-actions · 2024-11-10T15:03:30Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

AdamsLee · 2025-01-01T07:26:19Z

The same issue for me.
torch 2.4.0
torchaudio 2.4.0
transformers 4.40.1
optimum 1.22.0
onnxruntime-gpu 1.19.0

github-actions bot added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Oct 4, 2024

sophies927 added the ep:CUDA issues related to the CUDA execution provider label Oct 10, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

cooper-a commented Oct 4, 2024

tianleiwu commented Oct 5, 2024

github-actions bot commented Nov 10, 2024

AdamsLee commented Jan 1, 2025

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

Comments

cooper-a commented Oct 4, 2024

Describe the issue

tianleiwu commented Oct 5, 2024

github-actions bot commented Nov 10, 2024

AdamsLee commented Jan 1, 2025