Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM deploy script and example client scripts #32

Merged
merged 9 commits into from
Nov 6, 2024
47 changes: 44 additions & 3 deletions vllm-tt-metal-llama3-70b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenst

If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup).


### Docker Run - vLLM llama3 inference server

Run the container from the project root at `tt-inference-server`:
Expand All @@ -40,12 +39,33 @@ docker run \
ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-685ef1303b5a-54b9157d852b
```

By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`.
This can be run manually if you override the the container default command with an interactive shell via `bash`.
In an interactive shell you can start the vLLM API server via:
```bash
# run server manually
python examples/offline_inference_tt.py
python src/run_vllm_api_server.py
```

The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell.

### Example clients

You can use `docker exec -it <container-id> bash` to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies):

#### Run example clients from within Docker container:
```bash
# oneliner to enter interactive shell on most recently ran container
docker exec -it $(docker ps -q | head -n1) bash

# inside interactive shell, run example clients script making requests to vLLM server:
cd ~/src
# this example runs a single request from alpaca eval, expecting and parsing the streaming response
python example_requests_client_alpaca_eval.py --stream True --n_samples 1 --num_full_iterations 1 --batch_size 1
# this example runs a full-dataset stress test with 32 simultaneous users making requests
python example_requests_client_alpaca_eval.py --stream True --n_samples 805 --num_full_iterations 1 --batch_size 32
```

The vLLM inference API server takes 3-5 minutes to start up (~60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. You can use `docker exec -it <container-id> bash` to create a shell in the docker container or run the client scripts on the host ensuring the correct port mappings and python dependencies are available:

## First run setup

Expand Down Expand Up @@ -80,11 +100,32 @@ sudo cpupower frequency-set -g performance

### 4. Docker image

Either download or build the Docker image using the docker file.

#### Option A: GitHub Container Registry

```bash
# pull image from GHCR
docker pull ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-685ef1303b5a-54b9157d852b
```

#### Option B: Build Docker Image

```bash
# build image
export TT_METAL_DOCKERFILE_VERSION=v0.53.0-rc27
export TT_METAL_COMMIT_SHA_OR_TAG=685ef1303b5abdfda63183fdd4fd6ed51b496833
export TT_METAL_COMMIT_DOCKER_TAG=${TT_METAL_COMMIT_SHA_OR_TAG:0:12}
export TT_VLLM_COMMIT_SHA_OR_TAG=54b9157d852b0fa219613c00abbaa5a35f221049
export TT_VLLM_COMMIT_DOCKER_TAG=${TT_VLLM_COMMIT_SHA_OR_TAG:0:12}
docker build \
-t ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-${TT_METAL_COMMIT_DOCKER_TAG}-${TT_VLLM_COMMIT_DOCKER_TAG} \
--build-arg TT_METAL_DOCKERFILE_VERSION=${TT_METAL_DOCKERFILE_VERSION} \
--build-arg TT_METAL_COMMIT_SHA_OR_TAG=${TT_METAL_COMMIT_SHA_OR_TAG} \
--build-arg TT_VLLM_COMMIT_SHA_OR_TAG=${TT_VLLM_COMMIT_SHA_OR_TAG} \
. -f vllm.llama3.src.base.inference.v0.52.0.Dockerfile
```

### 5. Automated Setup: environment variables and weights files

The script `vllm-tt-metal-llama3-70b/setup.sh` automates:
Expand Down
5 changes: 5 additions & 0 deletions vllm-tt-metal-llama3-70b/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# inference server requirements
pyjwt==2.7.0
requests==2.32.3
datasets==3.1.0
openai==1.53.1
Empty file.
112 changes: 112 additions & 0 deletions vllm-tt-metal-llama3-70b/src/example_openai_client_alpaca_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# SPDX-License-Identifier: Apache-2.0
#
# SPDX-FileCopyrightText: © 2024 Tenstorrent AI ULC

import threading
import logging
import time

from openai import OpenAI

from example_requests_client_alpaca_eval import (
parse_args,
get_api_base_url,
load_dataset_samples,
get_authorization,
test_api_call_threaded_full_queue,
)

logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Thread-safe data collection
responses_lock = threading.Lock()
responses = []


def call_inference_api(prompt, response_idx, stream=True, headers=None, client=None):
# set API prompt and optional parameters
req_time = time.time()
full_text = ""
num_tokens = 0
try:
# Use OpenAI client to call API
completion = client.completions.create(
model="meta-llama/Meta-Llama-3.1-70B",
prompt=prompt,
temperature=1,
max_tokens=2048,
top_p=0.9,
stop=["<|eot_id|>"],
stream=stream,
)
if stream:
for event in completion:
if event.choices[0].finish_reason is not None:
break
if num_tokens == 0:
first_token_time = time.time()
ttft = first_token_time - req_time
num_tokens += 1
content = event.choices[0].text
full_text += content
else:
full_text = completion.choices[0].text
# Assuming tokens were returned with response (using len to mock token length)
num_tokens = len(full_text.split())
first_token_time = req_time # Simplify for non-stream
ttft = time.time() - req_time
except Exception as e:
logger.error(f"Error calling API: {e}")
elapsed_time = time.time() - req_time
logger.error(
f"Before error: elapsed_time={elapsed_time}, num_tokens: {num_tokens}, full_text: {full_text}"
)
full_text = "ERROR"
num_tokens = 0
first_token_time = time.time()
ttft = 0.001

num_tokens = max(num_tokens, 2)
throughput_time = max(time.time() - first_token_time, 0.0001)
response_data = {
"response_idx": response_idx,
"prompt": prompt,
"response": full_text,
"num_tokens": num_tokens,
"tps": (num_tokens - 1) / throughput_time,
"ttft": ttft,
}

with responses_lock:
responses.append(response_data)
return response_data


if __name__ == "__main__":
logger.info(
"Note: OpenAI API client adds additional latency of ~10 ms to the API call."
)
args = parse_args()
prompts = load_dataset_samples(args.n_samples)
headers = {"Authorization": f"Bearer {get_authorization()}"}
base_url = get_api_base_url()
logging.info(f"BASE_API_URL: {base_url}")
client = OpenAI(
base_url=base_url,
api_key=get_authorization(),
)
test_api_call_threaded_full_queue(
prompts=prompts,
batch_size=args.batch_size,
num_full_iterations=args.num_full_iterations,
call_func=call_inference_api,
call_func_kwargs={
"stream": args.stream,
"headers": headers,
"client": client,
},
)
Loading