-
Notifications
You must be signed in to change notification settings - Fork 37
2.2.2 Backend: llama.cpp
Handle:
llamacpp
URL: http://localhost:33831
LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features.
llamacpp
docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time.
# [Optional] Pull the llamacpp
# images ahead of starting the service
harbor pull llamacpp
Start Harbor with llamacpp
service:
harbor up llamacpp
You can find GGUF models to run on Huggingface here. After you find a model you want to run, grab the URL from the browser address bar and pass it to the harbor config
# Quick lookup for the models
harbof hf find gguf
# 1. With llama.cpp own cache:
#
# - Set the model to run, will be downloaded when llamacpp starts
# Accepts a full URL to the GGUF file (from Browser address bar)
harbor llamacpp model https://huggingface.co/user/repo/file.gguf
# 2. Shared HuggingFace Hub cache, single file:
#
# - Locate the GGUF to download, for example:
# https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/blob/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Download a single file: <user/repo> <file.gguf>
harbor hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Locate the file in the cache
harbor find Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# - Set the GGUF to llama.cpp
# "/app/models/hub" is where the HuggingFace cache is mounted in the container
# "llamacpp.model.specifier" is the "raw" config key for model specifier
# "-m /path/to/file.gguf" is llama.cpp native CLI argument to set the model to run
harbor config set llamacpp.model.specifier -m /app/models/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
# 3. Shared HuggingFace Hub cache, whole repo:
#
# - Locate and download the repo in its entirety
harbor hf download av-codes/Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Find the files from the repo
harbor find Trinity-2-Codestral-22B-Q4_K_M-GGUF
# - Set the GGUF to llama.cpp
# "/app/models/hub" is where the HuggingFace cache is mounted in the container
# "llamacpp.model.specifier" is the "raw" config key for model specifier
# "-m /path/to/file.gguf" is llama.cpp native CLI argument to set the model to run
harbor config set llamacpp.model.specifier -m /app/models/hub/models--av-codes--Trinity-2-Codestral-22B-Q4_K_M-GGUF/snapshots/c0a1f7283809423d193025e92eec6f287425ed59/trinity-2-codestral-22b-q4_k_m.gguf
Note
Please, note that this procedure doesn't download the model. If model is not found in the cache, it will be downloaded on the next start of llamacpp
service.
Downloaded models are stored in the global llama.cpp
cache on your local machine (same as native version uses). The server can only run one model at a time and must be restarted to switch models.
You can provide additional arguments to the llama.cpp
CLI via the LLAMACPP_EXTRA_ARGS
. It can be set either with Harbor CLI or in the .env
file.
# See llama.cpp server args
harbor run llamacpp --server --help
# Set the extra arguments
harbor llamacpp args '--max-tokens 1024 -ngl 100'
# Edit the .env file
HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100"
You can add llamacpp
to default services in Harbor:
# Add llamacpp to the default services
# Will always start when running `harbor up`
harbor defaults add llamacpp
# Remove llamacpp from the default services
harbor defaults rm llamacpp
llama.cpp
comes with a lot of helper tools/CLIs, which all can be accessed via the harbor exec llamacpp
command (once the service is running).
# Show the list of available llama.cpp CLIs
harbor exec llamacpp ls
# See the help for one of the CLIs
harbor exec llamacpp ./scripts/llama-bench --help