Native-LLM-for-Android

Overview

Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:

Qwen2.5-Instruct: 0.5B, 1.5B
Qwen2VL: 2B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma2-it: 2B
Phi3.5-mini-instruct: 3.8B
Llama-3.2-Instruct: 1B

Getting Started

Download Models:
- Demo models are available on Google Drive.
- Alternatively, use Baidu Cloud with the extraction code: dake.
Setup Instructions:
- Place the downloaded model files into the assets folder.
- Decompress the *.so files stored in the libs/arm64-v8a folder.
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
ONNX Export Considerations:
- Dynamic axes were not used during export to better adapt to ONNX Runtime on Android. Exported ONNX models may not be optimal for x86_64 systems.

Tokenizer Files

The tokenizer.cpp and tokenizer.hpp files are sourced from the mnn-llm repository.

Exporting Models

Navigate to the Export_ONNX folder.
Follow the comments in the Python scripts to set the folder paths.
Execute the ***_Export.py script to export the model.
Quantize or optimize the ONNX model manually.

Quantization Notes

Use onnxruntime.tools.convert_onnx_models_to_ort to convert models to *.ort format. Note that this process automatically adds Cast operators that change FP16 multiplication to FP32.
The quantization methods are detailed in the Do_Quantize folder.
The q4 (uint4) quantization method is not recommended due to poor performance of the MatMulNBits operator in ONNX Runtime.

Recent Updates

2024/11/04: Added support for Qwen2VL-2B (Vision LLM).

Additional Resources

Explore more projects: DakeQQ Projects

Performance Metrics

Qwen2VL

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Qwen2VL-2B q8f32	15 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	Qwen2VL-2B q8f32	9 token/s

Qwen

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Qwen2-1.5B-Instruct q8f32	20 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	Qwen2-1.5B-Instruct q8f32	13 token/s
Harmony 3	荣耀\u20 (20S)	Kirin_810-CPU (2*A76)	Qwen2-1.5B-Instruct q8f32	7 token/s

MiniCPM

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	MiniCPM-2.7B q8f32	9.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	MiniCPM-2.7B q8f32	6 token/s
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	MiniCPM-1.3B q8f32	16.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	MiniCPM-1.3B q8f32	11 token/s

Yuan

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Yuan2.0-2B-Mars-hf q8f32	12 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	Yuan2.0-2B-Mars-hf q8f32	6.5 token/s

Gemma

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Gemma1.1-it-2B q8f32	16 token/s

StableLM

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	StableLM2-1.6B-Chat q8f32	17.8 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	StableLM2-1.6B-Chat q8f32	11 token/s
Harmony 3	荣耀\u20 (20S)	Kirin_810-CPU (2*A76)	StableLM2-1.6B-Chat q8f32	5.5 token/s

Phi

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Phi2-2B-Orange-V2 q8f32	9.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	Phi2-2B-Orange-V2 q8f32	5.8 token/s

Llama

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU (X3+A715)	Llama3.2-1B-Instruct q8f32	25 token/s
Harmony 4	P40	Kirin_990_5G-CPU (2*A76)	Llama3.2-1B-Instruct q8f32	16 token/s

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括：

Qwen2.5-Instruct: 0.5B, 1.5B
Qwen2VL: 2B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma2-it: 2B
Phi3.5-mini-instruct: 3.8B
Llama-3.2-Instruct: 1B

入门指南

下载模型：
- Demo模型可以在 Google Drive 上获取。
- 或者使用百度网盘提取码：dake。
设置说明：
- 将下载的模型文件放入 assets 文件夹。
- 解压存储在 libs/arm64-v8a 文件夹中的 *.so 文件。
模型说明：
- 演示模型是从 HuggingFace 或 ModelScope 转换而来，并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
ONNX 导出注意事项：
- 导出时未使用动态轴，以更好地适应 Android 上的 ONNX Runtime。导出的 ONNX 模型可能不适合 x86_64 系统。

分词器文件

tokenizer.cpp 和 tokenizer.hpp 文件来源于 mnn-llm 仓库。

导出模型

进入 Export_ONNX 文件夹。
按照 Python 脚本中的注释设置文件夹路径。
执行 ***_Export.py 脚本以导出模型。
手动量化或优化 ONNX 模型。

量化说明

使用 onnxruntime.tools.convert_onnx_models_to_ort 将模型转换为 *.ort 格式。注意该过程会自动添加 Cast 操作符，将 FP16 乘法改为 FP32。
量化方法详见 Do_Quantize 文件夹。
不推荐使用 q4 (uint4) 量化方法，因为 ONNX Runtime 中 MatMulNBits 操作符性能较差。

额外资源

探索更多项目：DakeQQ Projects

Name		Name	Last commit message	Last commit date
Latest commit History 964 Commits
Do_Quantize/Dynamic_Quant		Do_Quantize/Dynamic_Quant
Export_ONNX		Export_ONNX
Gemma_ONNX		Gemma_ONNX
Llama_ONNX		Llama_ONNX
MiniCPM_ONNX		MiniCPM_ONNX
Phi_ONNX		Phi_ONNX
QwenVL_ONNX		QwenVL_ONNX
Qwen_ONNX		Qwen_ONNX
StableLM_ONNX		StableLM_ONNX
Yuan_ONNX		Yuan_ONNX
LICENSE		LICENSE
LLM_Qwen.gif		LLM_Qwen.gif
LLM_QwenVL.gif		LLM_QwenVL.gif
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Native-LLM-for-Android

Overview

Getting Started

Tokenizer Files

Exporting Models

Quantization Notes

Recent Updates

Additional Resources

Performance Metrics

Qwen2VL

Qwen

MiniCPM

Yuan

Gemma

StableLM

Phi

Llama

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

入门指南

分词器文件

导出模型

量化说明

最近更新

额外资源

About

Releases

Packages

Languages

License

DakeQQ/Native-LLM-for-Android

Folders and files

Latest commit

History

Repository files navigation

Native-LLM-for-Android

Overview

Getting Started

Tokenizer Files

Exporting Models

Quantization Notes

Recent Updates

Additional Resources

Performance Metrics

Qwen2VL

Qwen

MiniCPM

Yuan

Gemma

StableLM

Phi

Llama

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

入门指南

分词器文件

导出模型

量化说明

最近更新

额外资源

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages