-
Notifications
You must be signed in to change notification settings - Fork 8
Quick setup for llama cpp python backend
Uche Ogbuji edited this page Jun 13, 2023
·
2 revisions
The llama-cpp-python project itself has installation notes on its PyPI page, but with some bits missing.
Windows: "how to run llama.cpp on windows"
An M1/M2 Mac llama.cpp install recipe I found. After this you should be able to easily install llama-cpp-python as a Python package. Reports that this recipe also works with WSL2 on Windows.
llama-cpp-python comes with its own copy of llama.cpp. Need to set build flags & ensure the wheel will be rebuilt
pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U --no-cache-dir llama-cpp-python
Can then run the server e.g. like so:
# Replace with whatever GGML module you've downloaded
HOST=0.0.0.0 python3 -m llama_cpp.server --n_gpu_layers=40 --model /opt/mlai/cache/huggingface/dl/TheBloke_Nous-Hermes-13B-GGML/nous-hermes-13b.ggmlv3.q4_K_M.bin
In my demo case --n_gpu_layers=40
uses under 8GB of VRAM. Tweak to taste.
Make sure you see BLAS=1
in the startup flags, to confirm GPU is being used.