Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mobile] android whisper onnx model gives out of memory error during runtime #19514

Closed
suyash-narain opened this issue Feb 14, 2024 · 12 comments
Closed
Labels
platform:mobile issues related to ONNX Runtime mobile; typically submitted using template

Comments

@suyash-narain
Copy link

suyash-narain commented Feb 14, 2024

Describe the issue

I am using the android local example sourced from https://github.com/microsoft/onnxruntime-inference-examples/tree/main/mobile/examples/whisper/local/android

I created the whisper-base onnx model using the steps mentioned here: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/mobile/examples/whisper/local/android#generate-the-model

I created whisper-base_cpu_int8.onnx model with --no_audio_decoder enabled.

I am using an aarch64 android mobile device using android 12 with 2GB RAM

when i use the android app example with the whisper_tiny_cpu_int8_model.onnx as provided in example by default, it works without issues. the default model size if around 74MB.

When i replace this model with whisper_base_cpu_int8_model.onnx, which has a size of 140MB, it fails with out of memory error as it tries to access java heap memory instead of native memory.
All i did was change the model, i did not change the code anywhere. while running the app, i still have 500MB of free RAM available and still i get out of memory error.

the error log is below:

02-13 20:54:39.039 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:39.039 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:39.063 1973 2038 I le.whisperLoca: Alloc young concurrent copying GC freed 2084(104KB) AllocSpace objects, 2(1488KB) LOS objects, 3% free, 138MB/143MB, paused 137us,47us total 23.358ms
02-13 20:54:39.063 1973 2038 I le.whisperLoca: Forcing collection of SoftReferences for 133MB allocation
02-13 20:54:39.064 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:39.140 1973 2038 I le.whisperLoca: Alloc concurrent copying GC freed 399(27KB) AllocSpace objects, 2(134MB) LOS objects, 49% free, 3879KB/7758KB, paused 128us,39us total 75.682ms
02-13 20:54:41.492 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.492 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.507 1973 2038 I le.whisperLoca: Alloc young concurrent copying GC freed 49(47KB) AllocSpace objects, 0(0B) LOS objects, 4% free, 137MB/143MB, paused 148us,35us total 15.306ms
02-13 20:54:41.508 1973 2038 I le.whisperLoca: Forcing collection of SoftReferences for 133MB allocation
02-13 20:54:41.508 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.534 1973 2038 I le.whisperLoca: Alloc concurrent copying GC freed 738(47KB) AllocSpace objects, 0(0B) LOS objects, 4% free, 137MB/143MB, paused 298us,37us total 26.083ms
02-13 20:54:41.535 1973 2038 W le.whisperLoca: Throwing OutOfMemoryError "Failed to allocate a 140086368 byte allocation with 6291456 free bytes and 118MB until OOM, target footprint 150295640, growth limit 268435456" (VmSize 14958388 kB)
02-13 20:54:41.535 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.535 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.549 1973 2038 I le.whisperLoca: Alloc young concurrent copying GC freed 4(31KB) AllocSpace objects, 0(0B) LOS objects, 4% free, 137MB/143MB, paused 128us,40us total 13.928ms
02-13 20:54:41.550 1973 2038 I le.whisperLoca: Forcing collection of SoftReferences for 133MB allocation
02-13 20:54:41.550 1973 2038 I le.whisperLoca: Starting a blocking GC Alloc
02-13 20:54:41.575 1973 2038 I le.whisperLoca: Alloc concurrent copying GC freed 34(16KB) AllocSpace objects, 0(0B) LOS objects, 4% free, 137MB/143MB, paused 128us,40us total 25.049ms
02-13 20:54:41.576 1973 2038 W le.whisperLoca: Throwing OutOfMemoryError "Failed to allocate a 140086368 byte allocation with 6291456 free bytes and 118MB until OOM, target footprint 150294824, growth limit 268435456" (VmSize 14958388 kB)

How can I proceed to debug this? I should not be getting any memory error here. The part of app which should access native heap is accessing java heap. How to undo this?

To reproduce

model is here: https://drive.google.com/file/d/1mrtEbq4PGcfppTVfGK2wMqRfniBivtB5/view?usp=sharing

build the application on android studio using: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/mobile/examples/whisper/local/android and replace the model in android/app/src/main/res/raw directory from tiny to base and build the application. Execute on the device.
it will give OutOfMemory error

Urgency

urgent

Platform

Android

OS Version

android 12

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

1.17.0

ONNX Runtime API

Java/Kotlin

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Tasks

Preview Give feedback
No tasks being tracked yet.

Tasks

Preview Give feedback
No tasks being tracked yet.
@suyash-narain suyash-narain added the platform:mobile issues related to ONNX Runtime mobile; typically submitted using template label Feb 14, 2024
@skottmckay
Copy link
Contributor

There's nothing in the implementation that allocates memory differently based on the model. The ORT code that runs the model is all C++ and knows nothing about the java heap. The ORT Java bindings are a thin layer which calls this C++ API

The memory usage for a model is not 1:1 with the model size either. Data needs to be converted from the stored format to in-memory. Per-node optimizations may use more memory to change the layout of data so execution is more efficient. Running the model is most likely going to require more memory as well.

Add to that it will take a lot more processing to run a model 2x as large.

02-13 20:54:41.535 1973 2038 W le.whisperLoca: Throwing OutOfMemoryError "Failed to allocate a 140086368 byte allocation with 6291456 free bytes and 118MB until OOM, target footprint 150295640, growth limit 268435456" (VmSize 14958388 kB)

140086368 byte allocation seems like it matches the model size so possibly it's running out of memory loading the model on the Java side. Is the 118MB until OOM indicating you only have 118 MB available?

@suyash-narain
Copy link
Author

thank you for your reply.
here is my dumpsys output for the app using base model:

dumpsys meminfo ai.onnxruntime.example.whisperLocal                            <
Applications Memory Usage (in Kilobytes):
Uptime: 6028627 Realtime: 6028627

** MEMINFO in pid 3487 [ai.onnxruntime.example.whisperLocal] **
                   Pss  Private  Private  SwapPss      Rss     Heap     Heap     Heap
                 Total    Dirty    Clean    Dirty    Total     Size    Alloc     Free
                ------   ------   ------   ------   ------   ------   ------   ------
  Native Heap    25735    25640        0        4    28204    53608    15868    32798
  Dalvik Heap   140025   139660        0        0   148148     6254     3127     3127
 Dalvik Other     2939     2684        0        0     4568
        Stack      604      604        0        0      616
       Ashmem       25        0        0        0      536
    Other dev       22        0       20        0      288
     .so mmap     7989      312     3360        4    63780
    .jar mmap     2553        0      108        0    30908
    .apk mmap     7185      376     6396        0     9812
    .ttf mmap       83        0        0        0      568
    .dex mmap    10379    10368        0        0    10836
    .oat mmap      216        0        0        0     4140
    .art mmap     7155     6088      536       21    19236
   Other mmap       61        8        0        0     1172
      Unknown      603      560        0        1     1144
        TOTAL   205604   186300    10420       30   323956    59862    18995    35925

 App Summary
                       Pss(KB)                        Rss(KB)
                        ------                         ------
           Java Heap:   146284                         167384
         Native Heap:    25640                          28204
                Code:    20940                         120564
               Stack:      604                            616
            Graphics:        0                              0
       Private Other:     3252
              System:     8884
             Unknown:                                    7188

           TOTAL PSS:   205604            TOTAL RSS:   323956       TOTAL SWAP PSS:       30

 Objects
               Views:       24         ViewRootImpl:        1
         AppContexts:        6           Activities:        1
              Assets:        2        AssetManagers:        0
       Local Binders:       11        Proxy Binders:       32
       Parcel memory:        3         Parcel count:       12
    Death Recipients:        0      OpenSSL Sockets:        0
            WebViews:        0

 SQL
         MEMORY_USED:        0
  PAGECACHE_OVERFLOW:        0          MALLOC_SIZE:        0

and this is the dumpsys output for my tiny model:

dumpsys meminfo ai.onnxruntime.example.whisperLocal                            <
Applications Memory Usage (in Kilobytes):
Uptime: 6963380 Realtime: 6963380

** MEMINFO in pid 3865 [ai.onnxruntime.example.whisperLocal] **
                   Pss  Private  Private  SwapPss      Rss     Heap     Heap     Heap
                 Total    Dirty    Clean    Dirty    Total     Size    Alloc     Free
                ------   ------   ------   ------   ------   ------   ------   ------
  Native Heap   272332   272288        0       49   273648   414856   404699     4175
  Dalvik Heap     3468     3112        4       12    11300     6236     3118     3118
 Dalvik Other     2554     2344        0        2     3840
        Stack      952      952        0        0      960
       Ashmem       25        0        0        0      536
    Other dev       14        0       12        0      292
     .so mmap     5085      224      256       80    59816
    .jar mmap     2312        0       64        0    29456
    .apk mmap    14613      628    13592        0    17172
    .ttf mmap       83        0        0        0      568
    .dex mmap     9867     9856        0        0    10312
    .oat mmap      180        0        0        0     3564
    .art mmap     6579     6052        0       44    18676
   Other mmap       60        8        0        0     1172
      Unknown      559      516        0        4      984
        TOTAL   318874   295980    13928      191   432296   421092   407817     7293

 App Summary
                       Pss(KB)                        Rss(KB)
                        ------                         ------
           Java Heap:     9164                          29976
         Native Heap:   272288                         273648
                Code:    24640                         121328
               Stack:      952                            960
            Graphics:        0                              0
       Private Other:     2864
              System:     8966
             Unknown:                                    6384

           TOTAL PSS:   318874            TOTAL RSS:   432296       TOTAL SWAP PSS:      191

 Objects
               Views:       24         ViewRootImpl:        1
         AppContexts:        6           Activities:        1
              Assets:        2        AssetManagers:        0
       Local Binders:       11        Proxy Binders:       32
       Parcel memory:        3         Parcel count:       12
    Death Recipients:        0      OpenSSL Sockets:        0
            WebViews:        0

 SQL
         MEMORY_USED:        0
  PAGECACHE_OVERFLOW:        0          MALLOC_SIZE:        0

and I still can't understand why tiny is using just 6mb of heap but 260mb of native whereas base is using 140mb of heap but 25mb of native.
Tiny model is 70mb large. and i still have 500MB of free RAM available even after app using 300mb od RAM. something doesn't seem right to me
some help here would be amazing and appreciated.
thanks

@skottmckay
Copy link
Contributor

I would suggest stepping through it in a debugger to find where it runs out of memory. Is it loading the model bytes in Java here?

https://github.com/microsoft/onnxruntime-inference-examples/blob/33c9f0fab885c0ebbb220e3ef04b3dc4dd49402c/mobile/examples/whisper/local/android/app/src/main/java/ai/onnxruntime/example/whisperLocal/MainActivity.kt#L24

Or when attempting to create the inference session?

https://github.com/microsoft/onnxruntime-inference-examples/blob/33c9f0fab885c0ebbb220e3ef04b3dc4dd49402c/mobile/examples/whisper/local/android/app/src/main/java/ai/onnxruntime/example/whisperLocal/SpeechRecognizer.kt#L18

From the look of it the model bytes are loaded in Java first, an InferenceSession is created which will allocate native memory for the model (it parses the input bytes to load the model) and whatever else is required to execute it, and I assume after that the model bytes in Java are freed when they go out of scope.

If you were only able to load the model bytes in Java and the inference session creation failed, it might make sense you have 140MB of heap and not a lot of native if the Java allocated model bytes haven't been freed yet.

@suyash-narain
Copy link
Author

it seems to me that its failing at the loading 'model bytes' in java step itself and even when garbage collector is trying to free up space, that much heap is just not available.
this is interesting because i have the same app but using a tflite model instead of onnx, which is of much bigger size ~300mb, but that doesn't fail and i can easily get outputs. But with onnx, 140mb file is failing to load itself.

is it because its loading model bytes and not using memory mapping to load the model?

@Craigacp
Copy link
Contributor

I'm not sure what the behaviour of the JNI interface methods are in Android, but it may copy the bytes from Java into JNI before handing them to the ORT session constructor (that's allowable under the spec, and it's not clear what OpenJDK does), so there may temporarily be three copies of the model (one in Java, one in JNI and one in the ORT session constructor). TFlite may pass through a reference and thus only have a single copy.

@suyash-narain
Copy link
Author

is there a way to just have one copy instead of three at any given point in time so that it doesn't consume that much heap memory?
what optimizations can i perform to enable it to execute within the limited memory I have ~2GB
The final aim is to deploy this onto an edge device.
It would be really helpful if some optimizations could be suggested

@Craigacp
Copy link
Contributor

If there's enough of a filesystem you can ask ORT to load it from the file path. I'm not familiar enough with Android to know if that works.

Otherwise we'd need code changes in ORT's Java layer to expose a byte buffer end point to allow it to be passed through to ORT without a copy on the JNI side. Using getByteArrayCritical would avoid a JNI copy too, but it's not supposed to be used for longer running operations (session construction can take some time), and I'm not sure if it's supported on Android.

@suyash-narain
Copy link
Author

asking ORT to load from file path as a string instead of a byte array in createSession?

@Craigacp
Copy link
Contributor

Yes, as ORT will do the load itself in that case so Java never has a copy of the model.

@suyash-narain
Copy link
Author

suyash-narain commented Feb 14, 2024

am facing an interesting error here.
my onnx model is int8 as generated using olive, and all i am making changes is to load the onnx model from my assets folder by directly loading it as a file path instead of bytearray. I keep the audio file load as it is, where the tensor is created using a bytearray.
But now i get an error
'unexpected input data type. Actual: (tensor(int8)), expected: (tensor(uint8))

do i need to make changes to the way audio file is loaded to ort as inputs?
as far as i know, onnxruntime does not have anything specific for uint8, so not sure why its expecting uint8 inputs

@Craigacp
Copy link
Contributor

I looked through the example code and that appears to make float tensors, so I'm a bit confused as to how this works in the first place if the model is expecting an uint8 input.

@suyash-narain
Copy link
Author

my bad, i was using a model which was generated using --no_audio_decoder option disabled. I was able to make it to work after following your suggestion.
thank you very much for your assistance. I'll close this issue now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:mobile issues related to ONNX Runtime mobile; typically submitted using template
Projects
None yet
Development

No branches or pull requests

3 participants