Check what to do to bring in LLM models #940

freedomtan · 2024-12-24T07:47:10Z

What can we do for the default backend?

Model: Let's start with llama 3.2 1B, preferably quantized ones
Runtime:
- MediaPipe and AI-Edge-Torch: our current default backend is tflite based, so if we can continue using TFLite-based solution, it would be good.
- ExecuTorch: https://github.com/pytorch/executorch, https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md
- llama.cpp: https://github.com/ggerganov/llama.cpp

farook-edev · 2025-01-13T10:39:11Z

I've looked into building each runtime for Android and here's what I found out:

TFLite

Google's warning about AI-Edge-Torch being experimental is quite the understatement. After much struggling I ended up getting the nightly versions of ai-edge-torch and ai-edge-quantizer which were released the the same day as the last edit to the llama example was made. That was the only way to get the thing to actually function.
Once it started it used up all 11GB my GPU had and promptly failed. Trying to get it to use the CPU ended with a C++ protobuf error.
So I gave up and downloaded pre-converted models from HuggingFace, and attempted to use them with MediaPipe. The android app compiled but ~~would give a PERMISSION_DENIED error when attempting to load the model~~.

Edit

I managed to get a model loaded by using /data/ directory instead of /sdcard/ seems like Android had some protection for internal storage. it worked but the model did not respond to my prompt properly, instead giving python examples...

Executorch

Executorch seemed to have the most support and robust examples behind it. But unfortunately I couldn't even install the library's dependencies because of a cmake error regarding gflags. I want to look further into it to see if I can set it up locally since it's quite promising.

Edit

I was able to get Executorch working. The cmake error was resolved by removing my local gflags package, seems like it was conflicting with executorch's gflags in the venv.

Executorch has quite a bit of steps to get the .aar library compiled for android, which caused me trouble when using an NDK version lower than 25, their main guide for the llama example does mention that they support only version 27, but this isn't mentioned in the android demo app guide. I was also able to compile the library using NDK version 28.

Getting the models wasn't very straightforward either, since they needed to be converted from .pth to .pte. Thankfully executorch's converter worked without issue (unlike AI-Edge-Torch).

Running the app required upgrading gradle to 8.5 which thankfully Android Studio made quite straightforward, once that was done the app compiled and ran without issue.

I attempted to use the unquantized model first. Upon pressing load in the android app, my Galaxy S8 completely froze for 5 minutes until I restarted it. Possibly due to RAM capacity. Loading the quantized model did not cause any issues, and the model worked on my machine flawlessly.

This was done using the xnnpack backend, Executorch also supports Qualcomm and Mediatek AI engines which I did not test.

llama.cpp

working with llama.cpp has been the most straightforward, the app compiled and launched without issue, a couple of line changes allowed me to download whichever custome GGUF model I wanted, and the models actually functioned (They were however extremely slow on my Galaxy S8). There was an issue where manually loading the models into my phone caused them to not be readable, but was fixed by having the phone download the models instead of my PC.

Conclusion

My priority was TFLite because it's what we use. but it seems that both MediaPipe and AI-Edge-Torch are still in their infancy, and could cause problems when attempting to integrate them into the mobile app.
It's a similar story with ExecuTorch but in a different direction, I like the documentation and support, though it is quite complicated to get running, especially if you stray from the guides.
That leaves us with llama.cpp,models for which seem to be available. And while I've not attempted to convert my own (because HuggingFace has GGUF versions of the models), the models actually working without me having to dig through files upon files of code was a welcome change.

~~I'll look further into installing Executorch for now, but until I can get it running,~~
llama.cpp was the only runtime I could get to work without struggle. But Executorch's converter and android app (when they didn't crash the entire machine) were quite impressive.
It's worth mentioning while executorch is built around pytorch and its models, llama.cpp supports quite a few LLMs other than llama.

freedomtan · 2025-01-14T03:12:49Z

For model conversion with AI-Edge-Torch, as I noted at google-ai-edge/ai-edge-torch#269 (comment)

you don't need GPU
you need a lot of DRAM, which is supposed to be a bug / feature of current AI-Edge-Torch implementation, (I don't remember the amount needed, supposedly 64 GiB is enough).
I used Colab to do it.

farook-edev · 2025-01-15T19:14:49Z

I tried getting AI-Edge-Torch to work again, this time using a docker container. The code seemed to run without issue until it killed itself, seemingly due to lack of RAM (my machine only has 32GB). But with this setup a local CPU can be used to convert the models. It's just extremely finicky and easily prone to breaking.

freedomtan · 2025-01-21T06:52:21Z

@Mostelk: there is a GPT2 Android tflite app, https://github.com/huggingface/tflite-android-transformers/tree/master/gpt2
@mohitmundhragithub: there is https://github.com/mlcommons/mlperf_client which runs llama 2 7B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check what to do to bring in LLM models #940

Check what to do to bring in LLM models #940

freedomtan commented Dec 24, 2024

farook-edev commented Jan 13, 2025 •

edited

Loading

freedomtan commented Jan 14, 2025 •

edited

Loading

farook-edev commented Jan 15, 2025

freedomtan commented Jan 21, 2025 •

edited

Loading

Check what to do to bring in LLM models #940

Check what to do to bring in LLM models #940

Comments

freedomtan commented Dec 24, 2024

farook-edev commented Jan 13, 2025 • edited Loading

TFLite

Edit

Executorch

Edit

llama.cpp

Conclusion

freedomtan commented Jan 14, 2025 • edited Loading

farook-edev commented Jan 15, 2025

freedomtan commented Jan 21, 2025 • edited Loading

farook-edev commented Jan 13, 2025 •

edited

Loading

freedomtan commented Jan 14, 2025 •

edited

Loading

freedomtan commented Jan 21, 2025 •

edited

Loading