-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Java API lacks functionality to control allocator settings. #18845
Comments
I'm trying to use the ONNXRUNTIME for stream processing with Apache Flink in a low-latency, high-throughput, and memory-constrained setting. See this paper comparing Onnx and alternatives for this use case; with source code which is similar to my own usage. |
I think a bunch of those are possible for CUDA as we expose an add method to the CUDA EP options, but you're right we don't expose memory allocators at all for CPUs. It's not straightforward to design an API which exposes the allocators, at the moment there's a single default allocator used everywhere and it's not exposed in any of the value construction methods so it would be a substantial effort to build an API around that, |
It seems like you could follow the same patterns as the Python API and just translate some of the implementation. Let me know if you're able to add this to your backlog and the timeline. Otherwise I will be looking for a workaround or an alternative like onnx-scala. |
Python is a little easier as it doesn't have to deal with concurrency, so they can get away with a laxer API. I'll scope out the amount of work in the new year. |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Keep this issue open, it can track CPU allocator settings. |
Any updates? Also, if I set the number of inter-op and intra-op threads to 1 and share a session object across many threads would each thread calling |
No updates, I'm waiting for this PR (#18556) to be merged before starting on more memory management related issues. I believe the thread you send in to ORT is used for compute, so if you have concurrent requesting threads then those threads will concurrently execute the model. |
Any updates? I have my fingers crossed that some work on this will get planned |
Not yet. |
Describe the issue
The Java API is very limited with no way to control the arena allocator settings (e.g. "arena_extend_strategy" to "kSameAsRequested", "max_mem", "max_dead_bytes_per_chunk", "initial_chunk_size_bytes").
This of course means that memory will be wasted, and startup cannot be optimized. Also, if there is a memory leak it will OOMKilled the entire container instead of producing a reasonable error message (as it should with reasonable max_mem).
I've tried looking for any way to configure it but found nothing. It seems like it would be really easy to forward some configurations to the underlying C-implementation.
To reproduce
Use the Java API.
Urgency
It's causing problems for me at work.
Platform
Linux
OS Version
AKS Docker image based Mariner image
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.2
ONNX Runtime API
Java
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: