-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mobile] Memory crash after repeated inference with dynamic shape input #22520
Comments
How are you constructing the tensors? For best performance you should be using a cache of direct |
I created an MRE in this repo. Regarding your question about constructing the tensors, the relevant code is here: val env = OrtEnvironment.getEnvironment()
var inputTensorShape: LongArray = longArrayOf(1, 112, 112, 3)
when (modelType) {
ModelType.ClipImageEncoder -> {
inputTensorShape = inputShapeArray!!.map { it.toLong() }.toLongArray()
}
ModelType.YOLOv5Face -> {
inputTensorShape = inputShapeArray!!.map { it.toLong() }.toLongArray()
}
}
var buffer: ByteBuffer = ByteBuffer.allocate(0)
if (inputUint8DataArray != null) {
buffer = ByteBuffer.wrap(inputUint8DataArray)
}
val inputTensor = OnnxTensor.createTensor(env, buffer, inputTensorShape, OnnxJavaType.UINT8)
val inputs = mutableMapOf<String, OnnxTensor>()
inputs["input"] = inputTensor
val outputs = session.run(inputs)
val outputTensor = (outputs[0].value as Array<FloatArray>)
val flatList = outputTensor.flattenToFloatArray()
withContext(Dispatchers.Main) {
result.success(flatList)
}
outputs.close()
inputTensor.close()
buffer.clear() I'm not used to writing Kotlin code, so I might be missing something obvious. If so, any pointers on how to solve this would be appreciated! If not, then it's probably a memory management issue in ORT. |
You should use If your inputs are always of the same size (or a small set of sizes) then keep around a cache of the buffers, you can rewrite the entries (via |
Thanks for pointing this out! I have changed it to the following: val buffer: ByteBuffer = ByteBuffer.allocateDirect(inputUint8DataArray!!.size)
buffer.put(inputUint8DataArray)
buffer.flip() Despite this change I'm still seeing a lot of GC work in the logs:
So I'm afraid this change alone doesn't solve the issue.
I'm not sure I understand this comment. As for as I understood the ORT java API only takes byte buffers for creating tensors of this type of data (uint8). Is there some other way of creating the onnx tensor that I'm not aware of?
Unfortunately the inputs are very dynamic and can be of any size, so I don't think this would help. |
Lots of GC work just means you're creating a lot of garbage. If you keep passing in large bitmaps allocated in fresh objects then it'll necessarily have to create garbage. You can try to modify your code so you write directly to the buffer from the image source rather than having intermediate arrays, but I don't know what the rest of your codebase looks like.
Yeah, there's no way to create a uint8 input aside from a buffer, but if you have other inputs you should not use arrays to create those.
If there's an upper bound on the size then you can allocate buffers of that size, set the limit on them as appropriate for the image you've got and pass it in to tensor construction. ORT doesn't care if the buffer has other stuff in it provided you've set the position and limit correctly. |
First of all, I really appreciate all the pointers, thank you so much for your help @Craigacp ! 🙏
This will be tricky, since the app is in fact a Flutter app where I'm writing a platform plugin to access the java API. Unfortunately the data I have in Flutter/Dart can only be passed to kotlin as a
This is a great idea, thanks for the suggestion! Unfortunately I cannot seem to get it to work, or at least I'm not able to reduce the GC calls through this strategy. To make sure I'm not making a dumb mistake, here is the code I'm using: First I initiate permant direct buffers inside the class I'm using: private val yoloBuffer = ByteBuffer.allocateDirect(5000*5000*4)
private val clipBuffer = ByteBuffer.allocateDirect(5000*5000*4) Then in my private fun predict(modelType: ModelType, sessionAddress: Int, inputUint8DataArray: ByteArray? = null, inputShapeArray: IntArray? = null, result: Result) {
scope.launch {
val modelState = sessionMap[modelType]
val session = modelState?.sessionAddresses?.get(sessionAddress)
if (session == null) {
withContext(Dispatchers.Main) {
result.error("SESSION_NOT_FOUND", "Session not found for address: $sessionAddress", null)
}
return@launch
}
try {
val env = OrtEnvironment.getEnvironment()
var inputTensorShape: LongArray = longArrayOf(1, 112, 112, 3)
var inputTensor: OnnxTensor? = null
when (modelType) {
ModelType.ClipImageEncoder -> {
inputTensorShape = inputShapeArray!!.map { it.toLong() }.toLongArray()
clipBuffer.clear()
clipBuffer.put(inputUint8DataArray!!)
clipBuffer.flip()
inputTensor = OnnxTensor.createTensor(env, clipBuffer, inputTensorShape, OnnxJavaType.UINT8)
}
ModelType.YOLOv5Face -> {
inputTensorShape = inputShapeArray!!.map { it.toLong() }.toLongArray()
yoloBuffer.clear()
yoloBuffer.put(inputUint8DataArray!!)
yoloBuffer.flip()
inputTensor = OnnxTensor.createTensor(env, yoloBuffer, inputTensorShape, OnnxJavaType.UINT8)
}
}
val inputs = mutableMapOf<String, OnnxTensor>()
inputs["input"] = inputTensor
val outputs = session.run(inputs)
val outputTensor = (outputs[0].value as Array<FloatArray>)
val flatList = outputTensor.flattenToFloatArray()
withContext(Dispatchers.Main) {
result.success(flatList)
}
outputs.close()
inputTensor.close()
} catch (e: OrtException) {
withContext(Dispatchers.Main) {
result.error("PREDICTION_ERROR", "Error during prediction: ${e.message} ${e.stackTraceToString()}", null)
}
} catch (e: Exception) {
Log.e(TAG, "Error during prediction: ${e.message}", e)
withContext(Dispatchers.Main) {
result.error("UNHANDLED_ERROR", "Error during prediction: ${e.message}", null)
}
}
}
} To be honest I'm slowly starting to lose faith that I'll ever be able to get rid of the memory issues, but still motivated to try potential fixes out. So if you have any other ideas please let me know :) |
You can supply a buffer as the output tensor too assuming you know the size of the output. That will prevent ORT from allocating memory to hold the output, and also prevent the Java code from allocating a float array to store it. The input side of things looks ok in your example. |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Describe the issue
We recently altered our ONNX model used in production in our mobile app to include the preprocessing steps, which were previously done separately prior to inference. Because it is an image model, this means that now the model takes as input an array of raw RGBA bytes of an image, which tends to be a lot of data. We've found that since this change the memory consumption goes continually up as the app performs more inference runs, eventually resulting in a crash.
I was wondering, is there anything we can do in our Java/Kotlin code to make sure memory is getting properly cleared? Aside from the
outputs.close()
andinputTensor.close()
calls that we already have? It seems like GC is not able to keep up with continued inference runs right now.Please see below for the crash logs. Thank you in advance for any and all help!
To reproduce
Urgency
Urgent, as this issue is happening in production, causing crashes and inconvenience for our mobile customers.
Platform
Android
OS Version
Android 14
ONNX Runtime Installation
Released Package
Compiler Version (if 'Built from Source')
No response
Package Name (if 'Released Package')
onnxruntime-android
ONNX Runtime Version or Commit ID
1.18
ONNX Runtime API
Java/Kotlin
Architecture
ARM64
Execution Provider
Default CPU
Execution Provider Library Version
No response
The text was updated successfully, but these errors were encountered: