[Mobile] How to run model inference with ARM GPU on android device? #18224
Labels
platform:mobile
issues related to ONNX Runtime mobile; typically submitted using template
quantization
issues related to quantization
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
We are currently developing a system that involves deploying Large Language Models (LLMs) on Android smartphones. To date, we've managed to execute inference tasks using ONNX Runtime with the CPU Execution Provider, but the process is regrettably slow. Our goal is to leverage the built-in hardware accelerators, such as the GPU, to expedite the inference process. The specific GPU integrated into our Android devices is the ARM Mali-G710 MP7.
In an attempt to utilize the device's GPU, we've experimented with ONNX Runtime in conjunction with NNAPI. Unfortunately, it appears that NNAPI defaults to using the edgetpu for inference tasks, which is not currently supported by ONNX Runtime. Additionally, we've ensured the compatibility of our model with ORT Mobile, NNAPI, and CoreML by validating it with the onnxruntime.tools.check_onnx_model_mobile_usability tool.
Below is the log corresponding to our attempts:
Based on the log, it's apparent that our current model is not optimally compatible with NNAPI or CoreML for hardware acceleration on Android devices. Despite our efforts to validate the model and partition the operations, a substantial number of nodes remain unsupported, and the model is divided into numerous partitions, which hampers performance. Furthermore, the presence of dynamic shapes and certain unsupported operators such as 'ConstantOfShape', 'CumSum', and 'MatMulInteger' further complicates the utilization of hardware acceleration.
To proceed, we are considering the following steps and would appreciate any guidance or suggestions:
Model Optimization: We plan to revisit our model architecture and optimization strategies. Our goal is to minimize unsupported operations and dynamic shapes, as well as reduce the number of partitions when using NNAPI or CoreML. We would greatly benefit from any tips or best practices in optimizing models for these execution providers.
Alternative Execution Providers: Given the limitations we've encountered with NNAPI and CoreML, we are open to exploring other execution providers or acceleration frameworks that might be more compatible with our model and the ARM Mali-G710 MP7 GPU. If there are known providers or frameworks that have shown success with similar setups, we'd be keen to explore those.
Custom Operators: For the operators that are not supported out-of-the-box by NNAPI or CoreML, is it feasible and advisable to implement custom operators? We understand this could be a complex endeavor, but it might be a necessary step to achieve the performance we desire.
Direct GPU Inference: Bypassing high-level frameworks, is there a pathway to leverage the GPU more directly for inference tasks? We realize this might involve significant low-level programming and optimization, but if there are established approaches or libraries that can aid in this process, we would be interested in learning more.
Model Partitioning Strategy: The current partitioning does not seem to be beneficial. Would manual partitioning or a different strategy for partitioning the model be more effective in optimizing performance with hardware acceleration?
Thanks!
To reproduce
N/A.
Urgency
Not very urgent.
Platform
Android
OS Version
13
ONNX Runtime Installation
Built from Source
Compiler Version (if 'Built from Source')
No response
Package Name (if 'Released Package')
onnxruntime-android
ONNX Runtime Version or Commit ID
1.15
ONNX Runtime API
C++/C
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
The text was updated successfully, but these errors were encountered: