diff --git a/genai-perf/README.md b/genai-perf/README.md
index 8df0b009..84e7b4d4 100644
--- a/genai-perf/README.md
+++ b/genai-perf/README.md
@@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine.
### Serve GPT-2 TensorRT-LLM model using Triton CLI
You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
-on Triton CLI github repo to run GPT-2 model locally.
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
The full instructions are copied below for convenience:
```bash
@@ -139,12 +139,11 @@ docker run -ti \
--network=host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
- -v ${HOME}/models:/root/models \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
# Install the Triton CLI
-pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.8
+pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.11
# Build TRT LLM engine and generate a Triton model repository pointing at it
triton remove -m all
@@ -156,48 +155,27 @@ triton start
### Running GenAI-Perf
-Now we can run GenAI-Perf from Triton Inference Server SDK container:
+Now we can run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="24.08"
-
-docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
-genai-perf profile \
- -m gpt2 \
- --service-kind triton \
- --backend tensorrtllm \
- --num-prompts 100 \
- --random-seed 123 \
- --synthetic-input-tokens-mean 200 \
- --synthetic-input-tokens-stddev 0 \
- --streaming \
- --output-tokens-mean 100 \
- --output-tokens-stddev 0 \
- --output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming
```
Example output:
```
- LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
-│ Time to first token (ms) │ 11.70 │ 9.88 │ 17.21 │ 14.35 │ 12.01 │ 11.87 │
-│ Inter token latency (ms) │ 1.46 │ 1.08 │ 1.89 │ 1.87 │ 1.62 │ 1.52 │
-│ Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │
-│ Output sequence length │ 103.39 │ 95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │
-│ Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │
-└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
-Output token throughput (per sec): 635.61
-Request throughput (per sec): 6.15
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 16.26 │ 12.39 │ 17.25 │ 17.09 │ 16.68 │ 16.56 │
+│ Inter token latency (ms) │ 1.85 │ 1.55 │ 2.04 │ 2.02 │ 1.97 │ 1.92 │
+│ Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │
+│ Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │
+│ Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │
+│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
See [Tutorial](docs/tutorial.md) for additional examples.
diff --git a/genai-perf/docs/tutorial.md b/genai-perf/docs/tutorial.md
index 1a31511d..1b5464de 100644
--- a/genai-perf/docs/tutorial.md
+++ b/genai-perf/docs/tutorial.md
@@ -28,192 +28,121 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Profile Large Language Models with GenAI-Perf
-- [Profile GPT2 running on Triton + TensorRT-LLM](#tensorrt-llm)
-- [Profile GPT2 running on Triton + vLLM](#triton-vllm)
-- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat)
-- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions)
-
----
-
-## Profile GPT2 running on Triton + TensorRT-LLM
-
-### Run GPT2 on Triton Inference Server using TensorRT-LLM
-
-
-See instructions
-
-Run Triton Inference Server with TensorRT-LLM backend container:
+This tutorial will demonstrate how you can use GenAI-Perf to measure the performance of
+various inference endpoints such as
+[KServe inference protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
+and [OpenAI API](https://platform.openai.com/docs/api-reference/introduction)
+that are widely used across the industry.
-```bash
-export RELEASE="24.08"
+### Table of Contents
-docker run -it --net=host --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3
+- [Profile GPT2 running on Triton + TensorRT-LLM Backend](#tensorrt-llm)
+- [Profile GPT2 running on Triton + vLLM Backend](#triton-vllm)
+- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat)
+- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions)
-# Install Triton CLI (~5 min):
-pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8"
+
-# Download model:
-triton import -m gpt2 --backend tensorrtllm
+## Profile GPT-2 running on Triton + TensorRT-LLM
-# Run server:
-triton start
-```
-
-
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="24.08"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend tensorrtllm \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+ --streaming
```
Example output:
```
- NVIDIA GenAI-Perf | LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ Time to first token (ns) │ 13,266,974 │ 11,818,732 │ 18,351,779 │ 16,513,479 │ 13,741,986 │ 13,544,376 │
-│ Inter token latency (ns) │ 2,069,766 │ 42,023 │ 15,307,799 │ 3,256,375 │ 3,020,580 │ 2,090,930 │
-│ Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │
-│ Output sequence length │ 104 │ 100 │ 129 │ 128 │ 109 │ 105 │
-│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │
-└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
-Output token throughput (per sec): 460.42
-Request throughput (per sec): 4.44
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 13.68 │ 11.07 │ 21.50 │ 18.81 │ 14.29 │ 13.97 │
+│ Inter token latency (ms) │ 1.86 │ 1.28 │ 2.11 │ 2.11 │ 2.01 │ 1.95 │
+│ Request latency (ms) │ 203.70 │ 180.33 │ 228.30 │ 225.45 │ 216.48 │ 211.72 │
+│ Output sequence length │ 103.46 │ 95.00 │ 134.00 │ 122.96 │ 108.00 │ 104.75 │
+│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
+│ Output token throughput (per sec) │ 504.02 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 4.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
-## Profile GPT2 running on Triton + vLLM
-
-### Run GPT2 on Triton Inference Server using vLLM
-
-
-See instructions
-
-Run Triton Inference Server with vLLM backend container:
-
-```bash
-export RELEASE="24.08"
-
-
-docker run -it --net=host --gpus=1 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3
+## Profile GPT-2 running on Triton + vLLM
-# Install Triton CLI (~5 min):
-pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8"
-
-# Download model:
-triton import -m gpt2 --backend vllm
-
-# Run server:
-triton start
-```
-
-
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-vllm-model)
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the vLLM backend.
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="24.08"
-
-docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend vllm \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+ --streaming
```
Example output:
```
- NVIDIA GenAI-Perf | LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ Time to first token (ns) │ 15,786,560 │ 11,437,189 │ 49,550,549 │ 40,129,652 │ 21,248,091 │ 17,824,695 │
-│ Inter token latency (ns) │ 3,543,380 │ 591,898 │ 10,013,690 │ 6,152,260 │ 5,039,278 │ 4,060,982 │
-│ Request latency (ns) │ 388,415,721 │ 312,552,612 │ 528,229,817 │ 518,189,390 │ 484,281,365 │ 459,417,637 │
-│ Output sequence length │ 113 │ 105 │ 123 │ 122 │ 119 │ 115 │
-│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │
-└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
-Output token throughput (per sec): 290.24
-Request throughput (per sec): 2.57
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 22.04 │ 14.00 │ 26.02 │ 25.73 │ 24.41 │ 24.06 │
+│ Inter token latency (ms) │ 4.58 │ 3.45 │ 5.34 │ 5.33 │ 5.11 │ 4.86 │
+│ Request latency (ms) │ 542.48 │ 468.10 │ 622.39 │ 615.67 │ 584.73 │ 555.90 │
+│ Output sequence length │ 115.15 │ 103.00 │ 143.00 │ 138.00 │ 120.00 │ 118.50 │
+│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
+│ Output token throughput (per sec) │ 212.04 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 1.84 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
-## Profile Zephyr running on OpenAI Chat API-Compatible Server
-
-### Run Zephyr on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server
+## Profile Zephyr-7B-Beta running on OpenAI Chat API-Compatible Server
-
-See instructions
-
-Run the vLLM inference server:
+Serve the model on the vLLM server with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) endpoint:
```bash
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model HuggingFaceH4/zephyr-7b-beta --dtype float16
```
-
-
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="24.08"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m HuggingFaceH4/zephyr-7b-beta \
--service-kind openai \
--endpoint-type chat \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
+ --streaming \
--tokenizer HuggingFaceH4/zephyr-7b-beta
```
@@ -234,54 +163,33 @@ Example output:
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
```
-## Profile GPT2 running on OpenAI Completions API-Compatible Server
-
-### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server
+## Profile GPT-2 running on OpenAI Completions API-Compatible Server
-
-See instructions
-
-Run the vLLM inference server:
+Serve the model on the vLLM server with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) endpoint:
```bash
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
```
-
-
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="24.08"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind openai \
- --endpoint v1/completions \
--endpoint-type completions \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 100 \
- --output-tokens-stddev 0 \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8000
+ --output-tokens-stddev 0
```
Example output:
```
- NVIDIA GenAI-Perf | LLM Metrics
+ NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
diff --git a/templates/genai-perf-templates/README_template b/templates/genai-perf-templates/README_template
index 2c742b22..eb88a141 100644
--- a/templates/genai-perf-templates/README_template
+++ b/templates/genai-perf-templates/README_template
@@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine.
### Serve GPT-2 TensorRT-LLM model using Triton CLI
You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
-on Triton CLI github repo to run GPT-2 model locally.
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
The full instructions are copied below for convenience:
```bash
@@ -139,7 +139,6 @@ docker run -ti \
--network=host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
- -v ${HOME}/models:/root/models \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tritonserver:{{ release }}-trtllm-python-py3
@@ -156,48 +155,27 @@ triton start
### Running GenAI-Perf
-Now we can run GenAI-Perf from Triton Inference Server SDK container:
+Now we can run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="{{ release }}"
-
-docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
-genai-perf profile \
- -m gpt2 \
- --service-kind triton \
- --backend tensorrtllm \
- --num-prompts 100 \
- --random-seed 123 \
- --synthetic-input-tokens-mean 200 \
- --synthetic-input-tokens-stddev 0 \
- --streaming \
- --output-tokens-mean 100 \
- --output-tokens-stddev 0 \
- --output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming
```
Example output:
```
- LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
-│ Time to first token (ms) │ 11.70 │ 9.88 │ 17.21 │ 14.35 │ 12.01 │ 11.87 │
-│ Inter token latency (ms) │ 1.46 │ 1.08 │ 1.89 │ 1.87 │ 1.62 │ 1.52 │
-│ Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │
-│ Output sequence length │ 103.39 │ 95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │
-│ Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │
-└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
-Output token throughput (per sec): 635.61
-Request throughput (per sec): 6.15
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 16.26 │ 12.39 │ 17.25 │ 17.09 │ 16.68 │ 16.56 │
+│ Inter token latency (ms) │ 1.85 │ 1.55 │ 2.04 │ 2.02 │ 1.97 │ 1.92 │
+│ Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │
+│ Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │
+│ Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │
+│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
See [Tutorial](docs/tutorial.md) for additional examples.
diff --git a/templates/genai-perf-templates/tutorial_template b/templates/genai-perf-templates/tutorial_template
index d36fc88b..43271fe9 100644
--- a/templates/genai-perf-templates/tutorial_template
+++ b/templates/genai-perf-templates/tutorial_template
@@ -28,192 +28,121 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Profile Large Language Models with GenAI-Perf
-- [Profile GPT2 running on Triton + TensorRT-LLM](#tensorrt-llm)
-- [Profile GPT2 running on Triton + vLLM](#triton-vllm)
-- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat)
-- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions)
-
----
-
-## Profile GPT2 running on Triton + TensorRT-LLM
-
-### Run GPT2 on Triton Inference Server using TensorRT-LLM
-
-
-See instructions
-
-Run Triton Inference Server with TensorRT-LLM backend container:
+This tutorial will demonstrate how you can use GenAI-Perf to measure the performance of
+various inference endpoints such as
+[KServe inference protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
+and [OpenAI API](https://platform.openai.com/docs/api-reference/introduction)
+that are widely used across the industry.
-```bash
-export RELEASE="{{ release }}"
+### Table of Contents
-docker run -it --net=host --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3
+- [Profile GPT2 running on Triton + TensorRT-LLM Backend](#tensorrt-llm)
+- [Profile GPT2 running on Triton + vLLM Backend](#triton-vllm)
+- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat)
+- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions)
-# Install Triton CLI (~5 min):
-pip install "git+https://github.com/triton-inference-server/triton_cli@{{ triton_cli_version }}"
+
-# Download model:
-triton import -m gpt2 --backend tensorrtllm
+## Profile GPT-2 running on Triton + TensorRT-LLM
-# Run server:
-triton start
-```
-
-
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="{{ release }}"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend tensorrtllm \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+ --streaming
```
Example output:
```
- NVIDIA GenAI-Perf | LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ Time to first token (ns) │ 13,266,974 │ 11,818,732 │ 18,351,779 │ 16,513,479 │ 13,741,986 │ 13,544,376 │
-│ Inter token latency (ns) │ 2,069,766 │ 42,023 │ 15,307,799 │ 3,256,375 │ 3,020,580 │ 2,090,930 │
-│ Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │
-│ Output sequence length │ 104 │ 100 │ 129 │ 128 │ 109 │ 105 │
-│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │
-└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
-Output token throughput (per sec): 460.42
-Request throughput (per sec): 4.44
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 13.68 │ 11.07 │ 21.50 │ 18.81 │ 14.29 │ 13.97 │
+│ Inter token latency (ms) │ 1.86 │ 1.28 │ 2.11 │ 2.11 │ 2.01 │ 1.95 │
+│ Request latency (ms) │ 203.70 │ 180.33 │ 228.30 │ 225.45 │ 216.48 │ 211.72 │
+│ Output sequence length │ 103.46 │ 95.00 │ 134.00 │ 122.96 │ 108.00 │ 104.75 │
+│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
+│ Output token throughput (per sec) │ 504.02 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 4.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
-## Profile GPT2 running on Triton + vLLM
-
-### Run GPT2 on Triton Inference Server using vLLM
-
-
-See instructions
-
-Run Triton Inference Server with vLLM backend container:
-
-```bash
-export RELEASE="{{ release }}"
-
-
-docker run -it --net=host --gpus=1 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3
+## Profile GPT-2 running on Triton + vLLM
-# Install Triton CLI (~5 min):
-pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8"
-
-# Download model:
-triton import -m gpt2 --backend vllm
-
-# Run server:
-triton start
-```
-
-
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-vllm-model)
+in the Triton CLI Github repository to serve GPT-2 on the Triton server with the vLLM backend.
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="{{ release }}"
-
-docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend vllm \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8001
+ --streaming
```
Example output:
```
- NVIDIA GenAI-Perf | LLM Metrics
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ Time to first token (ns) │ 15,786,560 │ 11,437,189 │ 49,550,549 │ 40,129,652 │ 21,248,091 │ 17,824,695 │
-│ Inter token latency (ns) │ 3,543,380 │ 591,898 │ 10,013,690 │ 6,152,260 │ 5,039,278 │ 4,060,982 │
-│ Request latency (ns) │ 388,415,721 │ 312,552,612 │ 528,229,817 │ 518,189,390 │ 484,281,365 │ 459,417,637 │
-│ Output sequence length │ 113 │ 105 │ 123 │ 122 │ 119 │ 115 │
-│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │
-└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
-Output token throughput (per sec): 290.24
-Request throughput (per sec): 2.57
+ NVIDIA GenAI-Perf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ Time to first token (ms) │ 22.04 │ 14.00 │ 26.02 │ 25.73 │ 24.41 │ 24.06 │
+│ Inter token latency (ms) │ 4.58 │ 3.45 │ 5.34 │ 5.33 │ 5.11 │ 4.86 │
+│ Request latency (ms) │ 542.48 │ 468.10 │ 622.39 │ 615.67 │ 584.73 │ 555.90 │
+│ Output sequence length │ 115.15 │ 103.00 │ 143.00 │ 138.00 │ 120.00 │ 118.50 │
+│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
+│ Output token throughput (per sec) │ 212.04 │ N/A │ N/A │ N/A │ N/A │ N/A │
+│ Request throughput (per sec) │ 1.84 │ N/A │ N/A │ N/A │ N/A │ N/A │
+└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```
-## Profile Zephyr running on OpenAI Chat API-Compatible Server
-
-### Run Zephyr on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server
+## Profile Zephyr-7B-Beta running on OpenAI Chat API-Compatible Server
-
-See instructions
-
-Run the vLLM inference server:
+Serve the model on the vLLM server with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) endpoint:
```bash
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model HuggingFaceH4/zephyr-7b-beta --dtype float16
```
-
-
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="{{ release }}"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m HuggingFaceH4/zephyr-7b-beta \
--service-kind openai \
--endpoint-type chat \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
- --streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
+ --streaming \
--tokenizer HuggingFaceH4/zephyr-7b-beta
```
@@ -234,48 +163,27 @@ Example output:
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
```
-## Profile GPT2 running on OpenAI Completions API-Compatible Server
-
-### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server
+## Profile GPT-2 running on OpenAI Completions API-Compatible Server
-
-See instructions
-
-Run the vLLM inference server:
+Serve the model on the vLLM server with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) endpoint:
```bash
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
```
-
-
### Run GenAI-Perf
-Run GenAI-Perf from Triton Inference Server SDK container:
+Run GenAI-Perf inside the Triton Inference Server SDK container:
```bash
-export RELEASE="{{ release }}"
-
-docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
-
-
-# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind openai \
- --endpoint v1/completions \
--endpoint-type completions \
- --num-prompts 100 \
- --random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 100 \
- --output-tokens-stddev 0 \
- --tokenizer hf-internal-testing/llama-tokenizer \
- --concurrency 1 \
- --measurement-interval 4000 \
- --profile-export-file my_profile_export.json \
- --url localhost:8000
+ --output-tokens-stddev 0
```
Example output:
diff --git a/templates/template_vars.yaml b/templates/template_vars.yaml
index 12e88eb6..373d0fef 100644
--- a/templates/template_vars.yaml
+++ b/templates/template_vars.yaml
@@ -1,6 +1,6 @@
General:
release: 24.08
- triton_cli_version: 0.0.8
+ triton_cli_version: 0.0.11
genai_perf_version: 0.0.6dev
README:
@@ -46,4 +46,4 @@ tutorial:
version:
filename: __init__.py
template: genai-perf-templates/version_template
- output_dir: ../genai-perf/genai_perf/
\ No newline at end of file
+ output_dir: ../genai-perf/genai_perf/