I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.
The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.
The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.
LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the ESP32-S3FH4R2 because it has 2MB of embedded PSRAM.
With the following changes to llama2.c
, I am able to achieve 19.13 tok/s:
- Utilizing both cores of the ESP32 during math heavy operations.
- Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
- Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.
This requires the ESP-IDF toolchain to be installed
idf.py build
idf.py -p /dev/{DEVICE_PORT} flash