Skip to content

2.2.2 Backend: llama.cpp

av edited this page Sep 14, 2024 · 1 revision

Handle: llamacpp URL: http://localhost:33831

llama

License: MIT Server Conan Center

LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features.

Starting

llamacpp docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time.

# [Optional] Pull the llamacpp
# images ahead of starting the service
harbor pull llamacpp

Start Harbor with llamacpp service:

harbor up llamacpp

Models

You can find GGUF models to run on Huggingface here. After you find a model you want to run, grab the URL from the browser address bar and pass it to the harbor config

# Quick lookup for the models
harbof hf find gguf

# 1. With llama.cpp own cache:
#
# - Set the model to run, will be downloaded when llamacpp starts
#   Accepts a full URL to the GGUF file (from Browser address bar)
harbor llamacpp model https://huggingface.co/user/repo/file.gguf

# 2. Shared HuggingFace Hub cache, single file:
#
# - Locate the GGUF to download, for example:
#   https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/blob/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Download a single file: <user/repo> <file.gguf>
harbor hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Locate the file in the cache
harbor find Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Set the GGUF to llama.cpp
#   "/app/models/hub" is where the HuggingFace cache is mounted in the container
#   "llamacpp.model.specifier" is the "raw" config key for model specifier
#   "-m /path/to/file.gguf" is llama.cpp native CLI argument to set the model to run
harbor config set llamacpp.model.specifier -m /app/models/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf


# 3. Shared HuggingFace Hub cache, whole repo:
#
# - Locate and download the repo in its entirety
harbor hf download av-codes/Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Find the files from the repo
harbor find Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Set the GGUF to llama.cpp
#   "/app/models/hub" is where the HuggingFace cache is mounted in the container
#   "llamacpp.model.specifier" is the "raw" config key for model specifier
#   "-m /path/to/file.gguf" is llama.cpp native CLI argument to set the model to run
harbor config set llamacpp.model.specifier -m /app/models/hub/models--av-codes--Trinity-2-Codestral-22B-Q4_K_M-GGUF/snapshots/c0a1f7283809423d193025e92eec6f287425ed59/trinity-2-codestral-22b-q4_k_m.gguf

Note

Please, note that this procedure doesn't download the model. If model is not found in the cache, it will be downloaded on the next start of llamacpp service.

Downloaded models are stored in the global llama.cpp cache on your local machine (same as native version uses). The server can only run one model at a time and must be restarted to switch models.

Configuration

You can provide additional arguments to the llama.cpp CLI via the LLAMACPP_EXTRA_ARGS. It can be set either with Harbor CLI or in the .env file.

# See llama.cpp server args
harbor run llamacpp --server --help

# Set the extra arguments
harbor llamacpp args '--max-tokens 1024 -ngl 100'

# Edit the .env file
HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100"

You can add llamacpp to default services in Harbor:

# Add llamacpp to the default services
# Will always start when running `harbor up`
harbor defaults add llamacpp

# Remove llamacpp from the default services
harbor defaults rm llamacpp

llama.cpp CLIs and scripts

llama.cpp comes with a lot of helper tools/CLIs, which all can be accessed via the harbor exec llamacpp command (once the service is running).

# Show the list of available llama.cpp CLIs
harbor exec llamacpp ls

# See the help for one of the CLIs
harbor exec llamacpp ./scripts/llama-bench --help
Clone this wiki locally