llama.cpp is an open-source software library primarily written in C++ that performs inference on various Large Language Models (LLMs), such as Llama. Its main goal is to provide state-of-the-art performance for LLM inference across a wide range of hardware with minimal setup. Additionally, there are Python bindings available for this library, which offer a high-level API for text completion and an OpenAI compatible web server.
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
The Phi-3.5-Instruct model can be quantized using llama.cpp, but Phi-3.5-Vision and Phi-3.5-MoE are not supported yet. The format converted by llama.cpp is gguf, which is also the most widely used quantization format.
There are a large number of quantized GGUF format models on Hugging face. AI Foundry, Ollama, and LlamaEdge rely on llama.cpp, so GGUF models are also often used.
GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. GGUF was developed by @ggerganov who is also the developer of llama.cpp, a popular C/C++ LLM inference framework. Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines.
ONNX is a traditional machine learning/deep learning format, which is well supported in different AI Frameworks and has good usage scenarios in edge devices. As for GGUF, it is based on llama.cpp and can be said to be produced in the GenAI era. The two have similar uses. If you want better performance in embedded hardware and application layers, ONNX may be your choice. If you use the derivative framework and technology of llama.cpp, then GGUF may be better.
1. Environment Configuration
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j8
2. Quantization
Using llama.cpp convert Phi-3.5-Instruct to FP16 GGUF
./convert_hf_to_gguf.py <Your Phi-3.5-Instruct Location> --outfile phi-3.5-128k-mini_fp16.gguf
Quantizing Phi-3.5 to INT4
./llama.cpp/llama-quantize <Your phi-3.5-128k-mini_fp16.gguf location> ./gguf/phi-3.5-128k-mini_Q4_K_M.gguf Q4_K_M
3. Testing
Install llama-cpp-python
pip install llama-cpp-python -U
Note
If you use Apple Silicon , please install llama-cpp-python like this
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python -U
Testing
llama.cpp/llama-cli --model <Your phi-3.5-128k-mini_Q4_K_M.gguf location> --prompt "<|user|>\nCan you introduce .NET<|end|>\n<|assistant|>\n" --gpu-layers 10
-
Learn more about llama.cpp https://onnxruntime.ai/docs/genai/
-
Learn more about GGUF https://huggingface.co/docs/hub/en/gguf