-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] API for controlling memory residency of inference sessions #18142
Comments
Hi @axodox, sorry to interfere. Can I ask you how to destroy the sessions after use? I've never managed to do it so far. Thanks! |
Hi @MatteoPagliani, I just destruct the Ort::Session object and that does it - I am using the C++ API. |
Oh okay, I thought you were using the python API. Do you know how to do that in python? |
Hi, Can anyone help me how to destroy the inference session in javascript api. I tried using dispose, but it didn't work. |
I have no experience with ONNX runtime on python or javascript, but I would look for some method like CloseSession, session.Close etc. based on the C and C++ APIs. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
This issue has been automatically closed due to inactivity. Please reactivate if further support is needed. |
When working with more complex AI pipelines involving multiple sessions (e.g. Stable Diffusion) which also use very large models, it is a frequent situation, that while all sessions fit into device memory (which is in most cases VRAM), they do not fit there all at once.
We can of course try to optimize the models to use less memory, but this is not always feasible. Otherwise, we have two choices as far as I can see:
Based on this I would like to propose to add a new API to manage the memory residency of sessions. Using the API the user could mark a session to be evicted, which will notify the related execution providers that they may page out the session's memory when there is memory pressure from other sessions.
I have already created a prototype implementation PR with DirectML and my app Unpaint, with that I could completely eliminate the need for inference session recreation and achieve full performance even if not all models fit into VRAM at the same time, while the system remained fully responsive. With the change instead of running out of video memory, the runtime will quickly move unused GPU memory pages to system memory and back as I switch between the sessions, so the inference can run at full speed.
The text was updated successfully, but these errors were encountered: