[Feature request] API for controlling memory residency of inference sessions #18142

axodox · 2023-10-28T15:38:42Z

When working with more complex AI pipelines involving multiple sessions (e.g. Stable Diffusion) which also use very large models, it is a frequent situation, that while all sessions fit into device memory (which is in most cases VRAM), they do not fit there all at once.

We can of course try to optimize the models to use less memory, but this is not always feasible. Otherwise, we have two choices as far as I can see:

We can decide to destroy the sessions after use, and instantiate one at a time, so we do not allocate so much memory. This can indeed work, however there is a significant problem: creating a session with large models can be slow, for example a freshly Olive optimized 4.5GB Stable Diffusion XL u-net model will take around ~17 seconds to load on my machine with DirectML even after it has been loaded previously and files are all cached in RAM, as processing the very complex graph takes long (and this is with all graph optimizations disabled).
We can decide to let the memory be over-committed and risk run out of it. In this case the OS will try to shuffle memory pages around to free up enough memory to do the work. This can still be still a lot faster than recreating the session, however the user experience and the inference performance can greatly degrade. For example: on Windows when running out of video memory, the OS will become unresponsive for several seconds, the UI stops updating, the mouse cursor hitches around, the YouTube video in the background will skip interrupting the audio playback as well.

For your reference the above is on a machine with AMD 3900X 12 Core CPU, Nvidia 3080Ti GPU with 12GB VRAM, 64GB RAM and a 4TB Samsung 990 Pro NVMe SSD.

Based on this I would like to propose to add a new API to manage the memory residency of sessions. Using the API the user could mark a session to be evicted, which will notify the related execution providers that they may page out the session's memory when there is memory pressure from other sessions.

I have already created a prototype implementation PR with DirectML and my app Unpaint, with that I could completely eliminate the need for inference session recreation and achieve full performance even if not all models fit into VRAM at the same time, while the system remained fully responsive. With the change instead of running out of video memory, the runtime will quickly move unused GPU memory pages to system memory and back as I switch between the sessions, so the inference can run at full speed.

MatteoPagliani · 2023-10-29T09:55:06Z

Hi @axodox, sorry to interfere. Can I ask you how to destroy the sessions after use? I've never managed to do it so far. Thanks!
By the way, +1 for your feature request. It would be useful indeed.

axodox · 2023-10-29T10:19:48Z

Hi @MatteoPagliani, I just destruct the Ort::Session object and that does it - I am using the C++ API.

MatteoPagliani · 2023-10-29T10:33:58Z

Oh okay, I thought you were using the python API. Do you know how to do that in python?

prithivi1 · 2023-10-30T07:38:15Z

Hi, Can anyone help me how to destroy the inference session in javascript api. I tried using dispose, but it didn't work.

axodox · 2023-10-30T10:26:41Z

I have no experience with ONNX runtime on python or javascript, but I would look for some method like CloseSession, session.Close etc. based on the C and C++ APIs.

github-actions · 2023-11-29T15:01:00Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions · 2024-01-03T15:01:01Z

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

github-actions bot added ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform labels Oct 28, 2023

axodox changed the title ~~API for controlling memory residency of inference sessions~~ [Feature request] API for controlling memory residency of inference sessions Oct 28, 2023

axodox mentioned this issue Oct 28, 2023

[Proof-of-concept] Inference Session Memory Residency API #18143

Open

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] API for controlling memory residency of inference sessions #18142

[Feature request] API for controlling memory residency of inference sessions #18142

axodox commented Oct 28, 2023 •

edited

Loading

MatteoPagliani commented Oct 29, 2023

axodox commented Oct 29, 2023

MatteoPagliani commented Oct 29, 2023

prithivi1 commented Oct 30, 2023

axodox commented Oct 30, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Jan 3, 2024

[Feature request] API for controlling memory residency of inference sessions #18142

[Feature request] API for controlling memory residency of inference sessions #18142

Comments

axodox commented Oct 28, 2023 • edited Loading

MatteoPagliani commented Oct 29, 2023

axodox commented Oct 29, 2023

MatteoPagliani commented Oct 29, 2023

prithivi1 commented Oct 30, 2023

axodox commented Oct 30, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Jan 3, 2024

axodox commented Oct 28, 2023 •

edited

Loading