Important:
- For a simple automatic install, use the one-click installers provided in the original repo.
- This tech is absolutely bleeding edge, methods and tools change on a daily basis, consider this page as outdates as soon as it's updated, things break - regularily
- Look for more recent tutorials on youtube and reddit
reddit comments but will also eventually be outdated again
# 1 install WSL2 on Windows 11, then:
sudo apt update
sudo apt-get install build-essential
sudo apt install git -y
# optional: install a better terminal experience, otherwise skip to step 4
# 2 install brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/$USER/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew doctor
# 3 install oh-my-posh
brew install jandedobbeleer/oh-my-posh/oh-my-posh
$(brew --prefix oh-my-posh)/themes
# copy the path and add it below to the second eval line:
sudo nano ~/.bashrc
# add this to the end:
# eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
# eval "$(oh-my-posh init bash --config '/home/linuxbrew/.linuxbrew/opt/oh-my-posh/themes/atomic.omp.json')"
# plugins=(
# git
# # other plugins
# )
# CTRL+X to end editing
# Y to save changes
# ENTER to finally exit
source ~/.bashrc
exec bash
# 4 install mamba instead of conda, because it's faster https://mamba.readthedocs.io/en/latest/installation.html
mkdir github
mkdir downloads
cd downloads
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-$(uname)-$(uname -m).sh
# 5 install the correct cuda toolkit 11.7, not 12.x
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
naon ~/.bashrc
# add the following line, in order to add the cuda library to the environment variable
# export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
# after the plugins=() code block, above conda initialize
# CTRL+X to end editing
# Y to save changes
# ENTER to finally exit
source ~/.bashrc
cd ..
# 6 install ooba's textgen
mamba create --name textgen python=3.10.9
mamba activate textgen
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio -f https://download.pytorch.org/whl/cu117/torch_stable.html
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
# 7 Install 4bit support through GPTQ-for-LLaMa
mkdir repositories
cd repositories
# choose ONE of the following:
# A) for fast triton https://www.reddit.com/r/LocalLLaMA/comments/13g8v5q/fastest_inference_branch_of_gptqforllama_and/
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b fastest-inference-4bit
# B) for triton
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b triton
# C) for newer cuda
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
# D) for widely compatible old cuda
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
# groupsize, act-order, true-sequential
# --act-order (quantizing columns in order of decreasing activation size)
# --true-sequential (performing sequential quantization even within a single Transformer block)
# Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
# --groupsize
# Currently, groupsize and act-order do not work together and you must choose one of them.
# Ooba: There is a pytorch branch from qwop, that allows you to use groupsize and act-order together.
# Models without group-size (better for the 7b model)
# Models with group-size (better from 13b upwards)
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
cd ..
cd ..
# 8 Test ooba with a 4bit GPTQ model
python download-model.py 4bit/WizardLM-13B-Uncensored-4bit-128g
python server.py --wbits 4 --model_type llama --groupsize 128 --chat
# 9 install llama.cpp
cd repositories
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
nano ~/.bashrc
# add the cuda bin folder to the path environment variable in order for make to find nvcc:
# export PATH=/usr/local/cuda/bin:$PATH
# after the export LD_LIBRARY_PATH line
# CTRL+X to end editing
# Y to save changes
# ENTER to finally exit
source ~/.bashrc
make LLAMA_CUBLAS=1
cd models
wget https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/resolve/main/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
cd ..
# 10 test llama.cpp with GPU support
./main -t 8 -m models/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: write a story about llamas ### Response:" --n-gpu-layers 30
cd ..
cd ..
# 11 prepare ooba's textgen for llama.cpp support, by compiling llama-cpp-python with cuda GPU support
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
Installation guide from 2023-03-01 (outdated)
- Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
wsl --install
You may be prompted to restart your computer. If so, save your work and restart.- Install Windows Terminal from Windows Store
- Install Ubuntu on Windows Store
- Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
- Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
- When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
- Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
- If you prefer to use Windows Terminal from now on, close this console and start Windows Terminal then open a new Ubuntu console by clicking the drop down icon on top of Terminal and choose Ubuntu. Otherwise stay in the existing console window.
sudo apt update
sudo apt upgrade
sudo apt install git
sudo apt install wget
mkdir downloads
cd downloads/
wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
chmod +x ./Anaconda3-2022.05-Linux-x86_64.sh
./Anaconda3-2022.05-Linux-x86_64.sh
and follow the defaultssudo apt install build-essential
cd ..
conda create -n textgen python=3.10.9
conda activate textgen
pip3 install torch torchvision torchaudio
mkdir github
cd github
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install chardet cchardet
If you want to try the triton branch, skip to Newer GPTQ-Triton
- Works on Windows, Linux, WSL2.
- Supports 3 & 4 bit models
- Only supports no-act-order models
- Slower than triton
- Works best with
--groupsize 128 --wbits 4
and no-act-order models
mkdir repositories
cd repositories
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
(or try the newerhttps://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda
build)cd GPTQ-for-LLaMa
python -m pip install -r requirements.txt
python setup_cuda.py install
if this gives an error about g++, try installing the correct g++ version:conda install -y -k gxx_linux-64=11.2.0
cd ../..
This triton branch or this one:
- Works on Linux and WSL2
- Supports 4 bit quantized models
- Is faster than cuda
- Works best with the
--groupsize 128 --wbits 4
flags and act-order models
mkdir repositories
cd repositories
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
(or tryhttps://github.com/fpgaminer/GPTQ-triton
)cd GPTQ-for-LLaMa
pip install -r requirements.txt
cd ../..
Alternatively you can try AutoGPTQ to install cuda, older llama-cuda, or triton variants:
- run one of these:
pip install auto-gptq
to install cuda branch for newer modelspip install auto-gptq[llama]
if your transformers is outdated or you are using older models that don't support itpip install auto-gptq[triton]
to install triton branch for triton compatible models
cd ../..
If you want to open the webui from within your home network, enable port forwarding on your windows machine, with this command in an administrator terminal:
netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860
- Either always run
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
before running the sever.py below - Or trying to install
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
Allows for faster, but non-deterministic inference. Optional:
pip install xformers
- then use the
--xformers
flag later, when running the server.py below
You're done with the Ubuntu / WSL2 installation, you can skip to Download models section.
- Download and install miniconda
- Download and install git for windows
- Open
Anaconda Prompt (Miniconda 3)
from the Start Menu
- It should load in
C:\Users\yourusername>
mkdir github
cd github
conda create --name textgen python=3.10
conda activate textgen
conda install pip
conda install -y -k pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia
git clone https://github.com/oobabooga/text-generation-webui.git
python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl
cd text-generation-webui
pip install -r requirements.txt --upgrade
mkdir repositories
cd repositories
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
python -m pip install -r requirements.txt
python setup_cuda.py install
might fail, continue with the next command if sopip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
skip this command, if the previous one didn't failcd ..\..\..\
(go back to text-generation-webui)pip install faust-cchardet
pip install chardet
- Still in your terminal, make sure you are in the /text-generation-webui/ folder and type
python download-model.py
- select other to download a custom model
- paste the huggingface user/directory, for example:
TheBloke/wizardLM-7B-GGML
and let it download the model files
The base command to run. You have to add further flags, depending on the model and environment you want to run in:
- if you are on WSL2 Ubuntu, run
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
always, before running the server.py python server.py --model-menu --chat
--model-menu
to allow the change of models in the UI--chat
loads the chat instead of the text completion UI--wbits 4
loads a 4-bit quantized model--groupsize 128
if the model specifies groupsize, add this parameter--model_type llama
if the model name is unknown, specify it's base model. if you run llama derrived models like vicuna, alpaca, gpt4-x, codecapybara or wizardLM you have to define it asllama
. If you load OPT or GPT-J models, define the flag accordingly--xformers
if you have properly installed xformers and want faster but nondeterministic answer generation
If you get a cuda lib not found
error, especially on Windows WSL2 Ubuntu, try executing export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
before running the server.py above
pip install faust-cchardet
pip install chardet
or the other way around. Then try to start the server again.
On Windows Native, try:
pip uninstall bitsandbytes
pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git
- here are some discussion, but some solutions are for Windows WSL2, some for Windows native
Or try these prebuilt wheel on windows:
- https://github.com/TimDettmers/bitsandbytes/files/11084955/bitsandbytes-0.37.2-py3-none-any.whl.zip
- https://github.com/acpopescu/bitsandbytes/releases/tag/v0.37.2-win.0
- And more help on windows support here and here
Still having problems, try to manually copy the libraries
On Linux or Windows WSL2 Ubuntu, try:
- make sure you run
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
before running the server.py every time! - alternatively, you can try
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
and see if it works without the above command
pip install xformers==0.0.16rc425
Use llama.cpp, HN discussion
See an up to date list of most models you can run locally: awesome-ai open-models
See the awesome-ai LLM section for more tools, GUIs etc.