Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] API for controlling memory residency of inference sessions #18142

Closed
axodox opened this issue Oct 28, 2023 · 7 comments
Closed
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform stale issues that have not been addressed in a while; categorized by a bot

Comments

@axodox
Copy link

axodox commented Oct 28, 2023

When working with more complex AI pipelines involving multiple sessions (e.g. Stable Diffusion) which also use very large models, it is a frequent situation, that while all sessions fit into device memory (which is in most cases VRAM), they do not fit there all at once.

We can of course try to optimize the models to use less memory, but this is not always feasible. Otherwise, we have two choices as far as I can see:

  • We can decide to destroy the sessions after use, and instantiate one at a time, so we do not allocate so much memory. This can indeed work, however there is a significant problem: creating a session with large models can be slow, for example a freshly Olive optimized 4.5GB Stable Diffusion XL u-net model will take around ~17 seconds to load on my machine with DirectML even after it has been loaded previously and files are all cached in RAM, as processing the very complex graph takes long (and this is with all graph optimizations disabled).
  • We can decide to let the memory be over-committed and risk run out of it. In this case the OS will try to shuffle memory pages around to free up enough memory to do the work. This can still be still a lot faster than recreating the session, however the user experience and the inference performance can greatly degrade. For example: on Windows when running out of video memory, the OS will become unresponsive for several seconds, the UI stops updating, the mouse cursor hitches around, the YouTube video in the background will skip interrupting the audio playback as well.

For your reference the above is on a machine with AMD 3900X 12 Core CPU, Nvidia 3080Ti GPU with 12GB VRAM, 64GB RAM and a 4TB Samsung 990 Pro NVMe SSD.

Based on this I would like to propose to add a new API to manage the memory residency of sessions. Using the API the user could mark a session to be evicted, which will notify the related execution providers that they may page out the session's memory when there is memory pressure from other sessions.

I have already created a prototype implementation PR with DirectML and my app Unpaint, with that I could completely eliminate the need for inference session recreation and achieve full performance even if not all models fit into VRAM at the same time, while the system remained fully responsive. With the change instead of running out of video memory, the runtime will quickly move unused GPU memory pages to system memory and back as I switch between the sessions, so the inference can run at full speed.

@github-actions github-actions bot added ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform labels Oct 28, 2023
@axodox axodox changed the title API for controlling memory residency of inference sessions [Feature request] API for controlling memory residency of inference sessions Oct 28, 2023
@MatteoPagliani
Copy link

Hi @axodox, sorry to interfere. Can I ask you how to destroy the sessions after use? I've never managed to do it so far. Thanks!
By the way, +1 for your feature request. It would be useful indeed.

@axodox
Copy link
Author

axodox commented Oct 29, 2023

Hi @MatteoPagliani, I just destruct the Ort::Session object and that does it - I am using the C++ API.

@MatteoPagliani
Copy link

Oh okay, I thought you were using the python API. Do you know how to do that in python?

@prithivi1
Copy link

Hi, Can anyone help me how to destroy the inference session in javascript api. I tried using dispose, but it didn't work.

@axodox
Copy link
Author

axodox commented Oct 30, 2023

I have no experience with ONNX runtime on python or javascript, but I would look for some method like CloseSession, session.Close etc. based on the C and C++ APIs.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 29, 2023
Copy link
Contributor

github-actions bot commented Jan 3, 2024

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants