Dual 3090 AI Inference Workstation
- DUAL 3090 AI Inference Workstation - https://www.youtube.com/watch?v=3sdmkrcmZw0
- Power Supply: Corsair 1000W SFF-L *Afilliate Link
Warning I'm getting a bit of coil whine under load with this PSU and graphics card combination - Motherboard: BD790i
- CPU: CPU AMD Ryzen 9 7945HX 5.4 GHz - 16 Cores 32 Threads *Built into motherboard
- RAM: Crucial 5600 96GB Kit *Affiliate Link
- Case: Geometric Future Model 8 *Affiliate Link
- Storage: Crucial T705 4TB Gen 5 NVME *Affiliate Link
- GPU: 2x Nvidia 3090 Founders Edition *Purchased Used on Ebay
- Network Adapters:
- 2.5Gbps Realtek NIC (built into motherboard)
- (Optional) 10Gbps AQC107 NIC in nvme form factor
*No Driver needed on rocky linux 9, autodetects as Aquantia Ethernet
- Cooling:
- 5 x Noctua NF-A12x25 *Affiliate Link
- 3 x Noctua NF-A14 *Affiliate Link
- 1 x Noctua NF-A12x15 *Affiliate Link
- Screws:
- M2.5 Screws for Minisforum CPU Fan Bracket *Affiliate Link
- (Optional) Misc Assorted PC Screws *Affiliate Link
- Cables
- 1 x JMT PCI-E 4.0 x16 1 to 2 PCIe Bifurcation *Affiliate Link
- 1 x PCIE 4.0 Extension Cable Length 250mm *Affiliate Link
- 1 x EZDIY-FAB Vertical GPU Mount with High-Speed PCIE 4.0 Riser Cable *Affiliate Link
- 2 x 6pin + 2pin PCIe Power Extension Cables *Affiliate Link
- 1 x USB 3.1 Type-E to Type C USB3.0 Motherboard Header Adapter Male to Female
- 1 x USB C Extension Cable *Affiliate Link
- 1 x USB Connector USB Extension Cable USB2.0 to 9Pin Conector 9 Pin Male to External USB A
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
Reboot
You will find that xorg places processes on your GPUs rather than using the iGPU on our CPU. This can cause out of memory errors when running AI workloads.
To prevent this you can comment out the x-org configuration found within /etc/X11/xorg.conf.d/ This will cause x-org not to see the nvidia driver and therefore it won't use it for window management
#Section "OutputClass"
# Identifier "nvidia"
# MatchDriver "nvidia-drm"
# Driver "nvidia"
# Option "AllowEmptyInitialConfiguration"
# Option "PrimaryGPU" "no"
# Option "SLI" "Auto"
# Option "BaseMosaic" "on"
#EndSection
Section "OutputClass"
Identifier "intel"
MatchDriver "i915"
Driver "modesetting"
EndSection
clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
Reference: https://github.com/ggerganov/llama.cpp
Compile llama.cpp with nvidia support
export PATH=$PATH:/usr/local/cuda-12.5/bin
make LLAMA_CUDA=1
Download Mixtral 8x7B Instruct GGUF quant
https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/blob/main/mixtral-8x7b-v0.1.Q4_K_M.gguf
Reference: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main
Download Dolphin Starcoder2 7B quant:
https://huggingface.co/bartowski/dolphincoder-starcoder2-7b-GGUF/resolve/main/dolphincoder-starcoder2-7b-Q6_K.gguf?download=true
reference: https://huggingface.co/bartowski/dolphincoder-starcoder2-7b-GGUF/tree/main
Place the models into the llama cpp models folder
Start the Instruct Server
./llama-server --port 8080 -m models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -ngl 99
Start the Autocomplete Server
./llama-server --port 8081 -m models/dolphincoder-starcoder2-7b-Q6_K.gguf -ngl 99
Install VScode
VSCode
Install VScode Extension "Continue"
https://github.com/continuedev/continue
Configure continue
Open the configure configuration using the vscode command Palette
Reference: https://docs.continue.dev/reference/Model%20Providers/llamacpp
{
"models": [
{
"title": "Mixtral 8x7B",
"provider": "llama.cpp",
"model": "mistral-8x7b",
"apiBase": "http://localhost:8080",
"systemMessage": "You are an expert software developer. You give helpful and concise responses. if asked to write something like a function, comment or docblock wrap it in code ticks for easy copy paste"
}
],
"customCommands": [
{
"name": "test",
"prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
"description": "Write unit tests for highlighted code"
}
],
"tabAutocompleteModel": {
"title": "Dolphin Starcoder2",
"provider": "llama.cpp",
"model": "starcoder2:7b",
"apiBase": "http://localhost:8081",
"useCopyBuffer": false,
"maxPromptTokens": 4000,
"prefixPercentage": 0.5,
"multilineCompletions": "always",
"debounceDelay": 150
},
"allowAnonymousTelemetry": false
}