Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

Open
cooper-a opened this issue Oct 4, 2024 · 3 comments
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot

Comments

@cooper-a
Copy link

cooper-a commented Oct 4, 2024

Describe the issue

We are seeing an issue with a Transformer model which was exported using torch.onnx.export and then optimized with optimum ORTOptimizer. Inferencing seems to not be using GPU and only CPU.

Model was exported on CPU machine using ONNX 1.16.0. We see the following logs when starting the inference session.

[transformer_memcpy.cc:74](http://transformer_memcpy.cc:74/)
ApplyImpl] 36
Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph).
Set session_options.log_severity_level=1 to see the detail logs before this message.[m
[0;93m2024-10-04 20:35:10.514629537 [W:onnxruntime:,
[session_state.cc:1166](http://session_state.cc:1166/)
VerifyEachNodeIsAssignedToAnEp]
Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance.
e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m```

### To reproduce

Model was deployed onto Ubuntu2004 Docker Container running the Azure SKU [NCasT4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ncast4v3-series?tabs=sizebasic )with the following versions:

onnxruntime-gpu 1.18.0
CUDA Version: 11.8
cuDNN Version 8.9.6.50

**Inferencing Code**

```class UnifiedModelOnnx(BaseModel):
    def __init__(self, model_root, gpu_mem_limit=12, device=None):
        self.model_path = os.path.join(model_root, "model.onnx")

        self.tokenizer = DebertaV2Tokenizer.from_pretrained(model_root)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if device is None else device
        print(f"Using device {self.device}")
        providers = [
            (
                "CUDAExecutionProvider",
                {
                    "device_id": 0,
                    "arena_extend_strategy": "kNextPowerOfTwo",
                    "gpu_mem_limit": gpu_mem_limit * 1024 * 1024 * 1024,
                    "cudnn_conv_algo_search": "EXHAUSTIVE",
                    "do_copy_in_default_stream": True,
                },
            )
        ]
        ort_session_options = ort.SessionOptions()
        ort_session_options.enable_cpu_mem_arena = False
        self.ort_session = ort.InferenceSession(self.model_path, providers=providers)
        self.run_config = ort.RunOptions().add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")```

We also ran `ort.get_available_providers()` showing: `['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']`

And `self.ort_session.get_providers()` showing: `['CUDAExecutionProvider', 'CPUExecutionProvider']}]`

### Urgency

Internal Microsoft team alias for contact [email protected] or [email protected]

### Platform

Linux

### OS Version

Ubuntu 2004

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.18.0

### ONNX Runtime API

Python

### Architecture

X86

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA Version: 11.8 cuDNN Version 8.9.6.50
@github-actions github-actions bot added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Oct 4, 2024
@tianleiwu
Copy link
Contributor

Memcpy nodes is to copy between devices (like GPU and CPU). So it's using both GPU and CPU.

Could you share some information to reproduce (like transformers/optimum/pytorch versions, and python script or command line for onnx export and optimize)?

@sophies927 sophies927 added the ep:CUDA issues related to the CUDA execution provider label Oct 10, 2024
Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 10, 2024
@AdamsLee
Copy link

AdamsLee commented Jan 1, 2025

The same issue for me.
torch 2.4.0
torchaudio 2.4.0
transformers 4.40.1
optimum 1.22.0
onnxruntime-gpu 1.19.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

4 participants