tenstorrent · tstescoTT · Nov 6, 2024 · Oct 23, 2024 · Oct 23, 2024 · Oct 31, 2024
diff --git a/vllm-tt-metal-llama3-70b/README.md b/vllm-tt-metal-llama3-70b/README.md
@@ -19,7 +19,6 @@ This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenst
 
 If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup).
 
-
 ### Docker Run - vLLM llama3 inference server
 
 Run the container from the project root at `tt-inference-server`:
@@ -40,12 +39,33 @@ docker run \
   ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-685ef1303b5a-54b9157d852b
 ```
 
+By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`.
+This can be run manually if you override the the container default command with an interactive shell via `bash`. 
+In an interactive shell you can start the vLLM API server via:
 ```bash
 # run server manually
-python examples/offline_inference_tt.py
+python src/run_vllm_api_server.py
+```
+
+The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. 
+
+### Example clients
+
+You can use `docker exec -it <container-id> bash` to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies):
+
+#### Run example clients from within Docker container:
+```bash
+# oneliner to enter interactive shell on most recently ran container
+docker exec -it $(docker ps -q | head -n1) bash
+
+# inside interactive shell, run example clients script making requests to vLLM server:
+cd ~/src
+# this example runs a single request from alpaca eval, expecting and parsing the streaming response
+python example_requests_client_alpaca_eval.py --stream True --n_samples 1 --num_full_iterations 1 --batch_size 1
+# this example runs a full-dataset stress test with 32 simultaneous users making requests
+python example_requests_client_alpaca_eval.py --stream True --n_samples 805 --num_full_iterations 1 --batch_size 32
 ```
 
-The vLLM inference API server takes 3-5 minutes to start up (~60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. You can use `docker exec -it <container-id> bash` to create a shell in the docker container or run the client scripts on the host ensuring the correct port mappings and python dependencies are available:
 
 ## First run setup
 
@@ -80,11 +100,32 @@ sudo cpupower frequency-set -g performance
 
 ### 4. Docker image
 
+Either download or build the Docker image using the docker file.
+
+#### Option A: GitHub Container Registry
+
 ```bash
 # pull image from GHCR
 docker pull ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-685ef1303b5a-54b9157d852b
 ```
 
+#### Option B: Build Docker Image
+
+```bash
+# build image
+export TT_METAL_DOCKERFILE_VERSION=v0.53.0-rc27
+export TT_METAL_COMMIT_SHA_OR_TAG=685ef1303b5abdfda63183fdd4fd6ed51b496833
+export TT_METAL_COMMIT_DOCKER_TAG=${TT_METAL_COMMIT_SHA_OR_TAG:0:12}
+export TT_VLLM_COMMIT_SHA_OR_TAG=54b9157d852b0fa219613c00abbaa5a35f221049
+export TT_VLLM_COMMIT_DOCKER_TAG=${TT_VLLM_COMMIT_SHA_OR_TAG:0:12}
+docker build \
+  -t ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-${TT_METAL_COMMIT_DOCKER_TAG}-${TT_VLLM_COMMIT_DOCKER_TAG} \
+  --build-arg TT_METAL_DOCKERFILE_VERSION=${TT_METAL_DOCKERFILE_VERSION} \
+  --build-arg TT_METAL_COMMIT_SHA_OR_TAG=${TT_METAL_COMMIT_SHA_OR_TAG} \
+  --build-arg TT_VLLM_COMMIT_SHA_OR_TAG=${TT_VLLM_COMMIT_SHA_OR_TAG} \
+  . -f vllm.llama3.src.base.inference.v0.52.0.Dockerfile
+```
+
 ### 5. Automated Setup: environment variables and weights files
 
 The script `vllm-tt-metal-llama3-70b/setup.sh` automates:

diff --git a/vllm-tt-metal-llama3-70b/requirements.txt b/vllm-tt-metal-llama3-70b/requirements.txt
@@ -0,0 +1,5 @@
+# inference server requirements
+pyjwt==2.7.0
+requests==2.32.3
+datasets==3.1.0
+openai==1.53.1
diff --git a/vllm-tt-metal-llama3-70b/src/__init__.py b/vllm-tt-metal-llama3-70b/src/__init__.py
diff --git a/vllm-tt-metal-llama3-70b/src/example_openai_client_alpaca_eval.py b/vllm-tt-metal-llama3-70b/src/example_openai_client_alpaca_eval.py
@@ -0,0 +1,112 @@
+# SPDX-License-Identifier: Apache-2.0
+#
+# SPDX-FileCopyrightText: © 2024 Tenstorrent AI ULC
+
+import threading
+import logging
+import time
+
+from openai import OpenAI
+
+from example_requests_client_alpaca_eval import (
+    parse_args,
+    get_api_base_url,
+    load_dataset_samples,
+    get_authorization,
+    test_api_call_threaded_full_queue,
+)
+
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+# Thread-safe data collection
+responses_lock = threading.Lock()
+responses = []
+
+
+def call_inference_api(prompt, response_idx, stream=True, headers=None, client=None):
+    # set API prompt and optional parameters
+    req_time = time.time()
+    full_text = ""
+    num_tokens = 0
+    try:
+        # Use OpenAI client to call API
+        completion = client.completions.create(
+            model="meta-llama/Meta-Llama-3.1-70B",
+            prompt=prompt,
+            temperature=1,
+            max_tokens=2048,
+            top_p=0.9,
+            stop=["<|eot_id|>"],
+            stream=stream,
+        )
+        if stream:
+            for event in completion:
+                if event.choices[0].finish_reason is not None:
+                    break
+                if num_tokens == 0:
+                    first_token_time = time.time()
+                    ttft = first_token_time - req_time
+                num_tokens += 1
+                content = event.choices[0].text
+                full_text += content
+        else:
+            full_text = completion.choices[0].text
+            # Assuming tokens were returned with response (using len to mock token length)
+            num_tokens = len(full_text.split())
+            first_token_time = req_time  # Simplify for non-stream
+            ttft = time.time() - req_time
+    except Exception as e:
+        logger.error(f"Error calling API: {e}")
+        elapsed_time = time.time() - req_time
+        logger.error(
+            f"Before error: elapsed_time={elapsed_time}, num_tokens: {num_tokens}, full_text: {full_text}"
+        )
+        full_text = "ERROR"
+        num_tokens = 0
+        first_token_time = time.time()
+        ttft = 0.001
+
+    num_tokens = max(num_tokens, 2)
+    throughput_time = max(time.time() - first_token_time, 0.0001)
+    response_data = {
+        "response_idx": response_idx,
+        "prompt": prompt,
+        "response": full_text,
+        "num_tokens": num_tokens,
+        "tps": (num_tokens - 1) / throughput_time,
+        "ttft": ttft,
+    }
+
+    with responses_lock:
+        responses.append(response_data)
+    return response_data
+
+
+if __name__ == "__main__":
+    logger.info(
+        "Note: OpenAI API client adds additional latency of ~10 ms to the API call."
+    )
+    args = parse_args()
+    prompts = load_dataset_samples(args.n_samples)
+    headers = {"Authorization": f"Bearer {get_authorization()}"}
+    base_url = get_api_base_url()
+    logging.info(f"BASE_API_URL: {base_url}")
+    client = OpenAI(
+        base_url=base_url,
+        api_key=get_authorization(),
+    )
+    test_api_call_threaded_full_queue(
+        prompts=prompts,
+        batch_size=args.batch_size,
+        num_full_iterations=args.num_full_iterations,
+        call_func=call_inference_api,
+        call_func_kwargs={
+            "stream": args.stream,
+            "headers": headers,
+            "client": client,
+        },
+    )