Skip to content

Commit

Permalink
chore: llama multinode example
Browse files Browse the repository at this point in the history
  • Loading branch information
b0xtch committed Jun 28, 2024
1 parent 242e006 commit 5d2635e
Show file tree
Hide file tree
Showing 6 changed files with 892 additions and 0 deletions.
5 changes: 5 additions & 0 deletions candle-examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ serde_json = { workspace = true }
symphonia = { version = "0.5.3", features = ["all"], optional = true }
tokenizers = { workspace = true, features = ["onig"] }
cpal= { version = "0.15.2", optional = true }
futures = "0.3.30"

[dev-dependencies]
anyhow = { workspace = true }
Expand Down Expand Up @@ -73,6 +74,10 @@ depth_anything_v2 = ["palette", "enterpolation"]
name = "llama_multiprocess"
required-features = ["cuda", "nccl", "flash-attn"]

[[example]]
name = "llama_multinode"
required-features = ["cuda", "nccl"]

[[example]]
name = "reinforcement-learning"
required-features = ["pyo3"]
Expand Down
120 changes: 120 additions & 0 deletions candle-examples/examples/llama_multinode/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Llama Multinode

This project implements a distributed version of the Llama language model using Rust, CUDA, and NCCL for multi-node, multi-GPU inference.

## TL;DR

To quickly set up and run the project on the master node, use this single command:

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && \
. "$HOME/.cargo/env" && \
git clone -b chore/llama_multinode https://github.com/b0xtch/candle.git && \
cd candle && \
git submodule update --init --recursive && \
pip install --upgrade huggingface_hub && \
echo '<hf_token>' | huggingface-cli login && \
source nccl_env_vars.sh && \
RUST_BACKTRACE=1 cargo run --example llama_multinode --release --features="cuda nccl" -- \
--num-nodes 2 \
--node-rank 0 \
--master-addr 10.0.10.30 \
--master-port 29500 \
--num-gpus-per-node 1 \
--model-id "meta-llama/Meta-Llama-3-8B" \
--dtype bf16 \
--prompt "Once upon a time"
```

Note: Replace `10.0.10.30` with the private IP of your master node.

## Prerequisites

- CUDA-capable GPUs
- CUDA Toolkit 12.1 or later
- Rust toolchain
- Docker (for containerized setup)

## Setup on AWS Nodes

Follow these steps to set up and run the project on AWS nodes:

1. Install Rust:
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
```

2. Clone the repository and update submodules:
```bash
git clone -b chore/llama_multinode https://github.com/b0xtch/candle.git
cd candle
git submodule update --init --recursive
```

3. Install and set up Hugging Face CLI:
```bash
pip install --upgrade huggingface_hub
echo '<hf_token>' | huggingface-cli login
```

4. Set up NCCL environment variables:
```bash
source nccl_env_vars.sh
```

## Running the Distributed Llama Model

### On the Master Node

Run the following command, replacing `10.0.10.30` with the private IP of your master node:

```bash
RUST_BACKTRACE=1 cargo run --example llama_multinode --release --features="cuda nccl" -- \
--num-nodes 2 \
--node-rank 0 \
--master-addr 10.0.10.30 \
--master-port 29500 \
--num-gpus-per-node 1 \
--model-id "meta-llama/Meta-Llama-3-8B" \
--dtype bf16 \
--prompt "Once upon a time"
```

### On Worker Nodes

Run the following command on each worker node, replacing `10.0.10.30` with the private IP of your master node:

```bash
RUST_BACKTRACE=1 cargo run --example llama_multinode --release --features="cuda nccl" -- \
--num-nodes 2 \
--node-rank 1 \
--master-addr 10.0.10.30 \
--master-port 29500 \
--num-gpus-per-node 1 \
--model-id "meta-llama/Meta-Llama-3-8B" \
--dtype bf16 \
--prompt "Once upon a time"
```

Note: Increment the `--node-rank` for each additional worker node.

## Troubleshooting

- Ensure all nodes can communicate with each other over the specified port (29500 in this example).
- Check that the CUDA and NCCL versions are compatible across all nodes.
- Verify that the Hugging Face token has the necessary permissions to access the model.

## Misc

> Optional: Run the NVIDIA CUDA container:
```bash
docker run -it --gpus all nvidia/cuda:12.1.1-devel-ubuntu20.04 /bin/bash
```

> Optional: Inside the container, install necessary dependencies:
```bash
apt-get -y update && \
apt-get -y install curl git pkg-config libssl-dev
```
---
Loading

0 comments on commit 5d2635e

Please sign in to comment.