Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check what to do to bring in LLM models #940

Open
freedomtan opened this issue Dec 24, 2024 · 4 comments
Open

Check what to do to bring in LLM models #940

freedomtan opened this issue Dec 24, 2024 · 4 comments

Comments

@freedomtan
Copy link
Contributor

What can we do for the default backend?

@farook-edev
Copy link

farook-edev commented Jan 13, 2025

I've looked into building each runtime for Android and here's what I found out:

TFLite

Google's warning about AI-Edge-Torch being experimental is quite the understatement. After much struggling I ended up getting the nightly versions of ai-edge-torch and ai-edge-quantizer which were released the the same day as the last edit to the llama example was made. That was the only way to get the thing to actually function.
Once it started it used up all 11GB my GPU had and promptly failed. Trying to get it to use the CPU ended with a C++ protobuf error.
So I gave up and downloaded pre-converted models from HuggingFace, and attempted to use them with MediaPipe. The android app compiled but would give a PERMISSION_DENIED error when attempting to load the model.

Edit

I managed to get a model loaded by using /data/ directory instead of /sdcard/ seems like Android had some protection for internal storage. it worked but the model did not respond to my prompt properly, instead giving python examples...

Executorch

Executorch seemed to have the most support and robust examples behind it. But unfortunately I couldn't even install the library's dependencies because of a cmake error regarding gflags. I want to look further into it to see if I can set it up locally since it's quite promising.

Edit

I was able to get Executorch working. The cmake error was resolved by removing my local gflags package, seems like it was conflicting with executorch's gflags in the venv.

Executorch has quite a bit of steps to get the .aar library compiled for android, which caused me trouble when using an NDK version lower than 25, their main guide for the llama example does mention that they support only version 27, but this isn't mentioned in the android demo app guide. I was also able to compile the library using NDK version 28.

Getting the models wasn't very straightforward either, since they needed to be converted from .pth to .pte. Thankfully executorch's converter worked without issue (unlike AI-Edge-Torch).

Running the app required upgrading gradle to 8.5 which thankfully Android Studio made quite straightforward, once that was done the app compiled and ran without issue.

I attempted to use the unquantized model first. Upon pressing load in the android app, my Galaxy S8 completely froze for 5 minutes until I restarted it. Possibly due to RAM capacity. Loading the quantized model did not cause any issues, and the model worked on my machine flawlessly.

This was done using the xnnpack backend, Executorch also supports Qualcomm and Mediatek AI engines which I did not test.

llama.cpp

working with llama.cpp has been the most straightforward, the app compiled and launched without issue, a couple of line changes allowed me to download whichever custome GGUF model I wanted, and the models actually functioned (They were however extremely slow on my Galaxy S8). There was an issue where manually loading the models into my phone caused them to not be readable, but was fixed by having the phone download the models instead of my PC.

Conclusion

My priority was TFLite because it's what we use. but it seems that both MediaPipe and AI-Edge-Torch are still in their infancy, and could cause problems when attempting to integrate them into the mobile app.
It's a similar story with ExecuTorch but in a different direction, I like the documentation and support, though it is quite complicated to get running, especially if you stray from the guides.
That leaves us with llama.cpp,models for which seem to be available. And while I've not attempted to convert my own (because HuggingFace has GGUF versions of the models), the models actually working without me having to dig through files upon files of code was a welcome change.

I'll look further into installing Executorch for now, but until I can get it running,
llama.cpp was the only runtime I could get to work without struggle. But Executorch's converter and android app (when they didn't crash the entire machine) were quite impressive.
It's worth mentioning while executorch is built around pytorch and its models, llama.cpp supports quite a few LLMs other than llama.

@freedomtan
Copy link
Contributor Author

freedomtan commented Jan 14, 2025

For model conversion with AI-Edge-Torch, as I noted at google-ai-edge/ai-edge-torch#269 (comment)

  • you don't need GPU
  • you need a lot of DRAM, which is supposed to be a bug / feature of current AI-Edge-Torch implementation, (I don't remember the amount needed, supposedly 64 GiB is enough).
  • I used Colab to do it.

@farook-edev
Copy link

I tried getting AI-Edge-Torch to work again, this time using a docker container. The code seemed to run without issue until it killed itself, seemingly due to lack of RAM (my machine only has 32GB). But with this setup a local CPU can be used to convert the models. It's just extremely finicky and easily prone to breaking.

@freedomtan
Copy link
Contributor Author

freedomtan commented Jan 21, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants