Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Java API lacks functionality to control allocator settings. #18845

Open
ivanthewebber opened this issue Dec 15, 2023 · 10 comments
Open
Labels
api:Java issues related to the Java API

Comments

@ivanthewebber
Copy link

Describe the issue

The Java API is very limited with no way to control the arena allocator settings (e.g. "arena_extend_strategy" to "kSameAsRequested", "max_mem", "max_dead_bytes_per_chunk", "initial_chunk_size_bytes").

This of course means that memory will be wasted, and startup cannot be optimized. Also, if there is a memory leak it will OOMKilled the entire container instead of producing a reasonable error message (as it should with reasonable max_mem).

I've tried looking for any way to configure it but found nothing. It seems like it would be really easy to forward some configurations to the underlying C-implementation.

To reproduce

Use the Java API.

Urgency

It's causing problems for me at work.

Platform

Linux

OS Version

AKS Docker image based Mariner image

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.2

ONNX Runtime API

Java

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added the api:Java issues related to the Java API label Dec 15, 2023
@ivanthewebber
Copy link
Author

I'm trying to use the ONNXRUNTIME for stream processing with Apache Flink in a low-latency, high-throughput, and memory-constrained setting.

See this paper comparing Onnx and alternatives for this use case; with source code which is similar to my own usage.

@Craigacp
Copy link
Contributor

I think a bunch of those are possible for CUDA as we expose an add method to the CUDA EP options, but you're right we don't expose memory allocators at all for CPUs.

It's not straightforward to design an API which exposes the allocators, at the moment there's a single default allocator used everywhere and it's not exposed in any of the value construction methods so it would be a substantial effort to build an API around that, OrtMemoryInfo and OrtArenaCfg. It's on the todo list as it will enable direct allocation of GPU memory which can be useful, but needs careful designing.

@ivanthewebber
Copy link
Author

ivanthewebber commented Dec 18, 2023

It seems like you could follow the same patterns as the Python API and just translate some of the implementation. Let me know if you're able to add this to your backlog and the timeline. Otherwise I will be looking for a workaround or an alternative like onnx-scala.

@Craigacp
Copy link
Contributor

Python is a little easier as it doesn't have to deal with concurrency, so they can get away with a laxer API. I'll scope out the amount of work in the new year.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 18, 2024
@Craigacp
Copy link
Contributor

Keep this issue open, it can track CPU allocator settings.

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Jan 21, 2024
@ivanthewebber
Copy link
Author

Any updates? Also, if I set the number of inter-op and intra-op threads to 1 and share a session object across many threads would each thread calling run be able to run in parallel or would the affinity of the ONNX thread be tied to a single CPU?

@Craigacp
Copy link
Contributor

No updates, I'm waiting for this PR (#18556) to be merged before starting on more memory management related issues.

I believe the thread you send in to ORT is used for compute, so if you have concurrent requesting threads then those threads will concurrently execute the model.

@ivanthewebber
Copy link
Author

Any updates? I have my fingers crossed that some work on this will get planned

@Craigacp
Copy link
Contributor

Not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:Java issues related to the Java API
Projects
None yet
Development

No branches or pull requests

2 participants