[Performance] High CUDA memory usage with ONNX Runtime and inconsistent memory release #22297
Labels
ep:CUDA
issues related to the CUDA execution provider
memory
performance
issues related to performance regressions
After converting my PyTorch model to ONNX format, I noticed an issue with CUDA memory management. When processing a large input, the CUDA memory usage spikes as expected. However, for subsequent smaller inputs, the memory usage does not decrease, and the high CUDA memory allocation persists.
To mitigate this, I attempted to configure the ONNX Runtime session options as follows:
While these settings increased the inference time, there was no improvement in CUDA memory usage, and the memory was not released as expected after processing smaller inputs.
Expected Behavior:
CUDA memory usage should decrease after processing smaller inputs, releasing the previously allocated memory.
Observed Behavior:
CUDA memory is not released when model encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases.
Is there any way to optimize CUDA memory usage in ONNX Runtime for this case?
To reproduce
Urgency
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda_12.1.r12.1
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: