diff --git a/genai-perf/README.md b/genai-perf/README.md index 8df0b009..84e7b4d4 100644 --- a/genai-perf/README.md +++ b/genai-perf/README.md @@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine. ### Serve GPT-2 TensorRT-LLM model using Triton CLI You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model) -on Triton CLI github repo to run GPT-2 model locally. +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend. The full instructions are copied below for convenience: ```bash @@ -139,12 +139,11 @@ docker run -ti \ --network=host \ --shm-size=1g --ulimit memlock=-1 \ -v /tmp:/tmp \ - -v ${HOME}/models:/root/models \ -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 # Install the Triton CLI -pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.8 +pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.11 # Build TRT LLM engine and generate a Triton model repository pointing at it triton remove -m all @@ -156,48 +155,27 @@ triton start ### Running GenAI-Perf -Now we can run GenAI-Perf from Triton Inference Server SDK container: +Now we can run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="24.08" - -docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: -genai-perf profile \ - -m gpt2 \ - --service-kind triton \ - --backend tensorrtllm \ - --num-prompts 100 \ - --random-seed 123 \ - --synthetic-input-tokens-mean 200 \ - --synthetic-input-tokens-stddev 0 \ - --streaming \ - --output-tokens-mean 100 \ - --output-tokens-stddev 0 \ - --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 +genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming ``` Example output: ``` - LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ -│ Time to first token (ms) │ 11.70 │ 9.88 │ 17.21 │ 14.35 │ 12.01 │ 11.87 │ -│ Inter token latency (ms) │ 1.46 │ 1.08 │ 1.89 │ 1.87 │ 1.62 │ 1.52 │ -│ Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │ -│ Output sequence length │ 103.39 │ 95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │ -│ Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │ -└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ -Output token throughput (per sec): 635.61 -Request throughput (per sec): 6.15 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 16.26 │ 12.39 │ 17.25 │ 17.09 │ 16.68 │ 16.56 │ +│ Inter token latency (ms) │ 1.85 │ 1.55 │ 2.04 │ 2.02 │ 1.97 │ 1.92 │ +│ Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │ +│ Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │ +│ Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │ +│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` See [Tutorial](docs/tutorial.md) for additional examples. diff --git a/genai-perf/docs/tutorial.md b/genai-perf/docs/tutorial.md index 1a31511d..1b5464de 100644 --- a/genai-perf/docs/tutorial.md +++ b/genai-perf/docs/tutorial.md @@ -28,192 +28,121 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # Profile Large Language Models with GenAI-Perf -- [Profile GPT2 running on Triton + TensorRT-LLM](#tensorrt-llm) -- [Profile GPT2 running on Triton + vLLM](#triton-vllm) -- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat) -- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions) - ---- - -## Profile GPT2 running on Triton + TensorRT-LLM - -### Run GPT2 on Triton Inference Server using TensorRT-LLM - -
-See instructions - -Run Triton Inference Server with TensorRT-LLM backend container: +This tutorial will demonstrate how you can use GenAI-Perf to measure the performance of +various inference endpoints such as +[KServe inference protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) +and [OpenAI API](https://platform.openai.com/docs/api-reference/introduction) +that are widely used across the industry. -```bash -export RELEASE="24.08" +### Table of Contents -docker run -it --net=host --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3 +- [Profile GPT2 running on Triton + TensorRT-LLM Backend](#tensorrt-llm) +- [Profile GPT2 running on Triton + vLLM Backend](#triton-vllm) +- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat) +- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions) -# Install Triton CLI (~5 min): -pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8" +
-# Download model: -triton import -m gpt2 --backend tensorrtllm +## Profile GPT-2 running on Triton + TensorRT-LLM -# Run server: -triton start -``` - -
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model) +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend. ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="24.08" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind triton \ --backend tensorrtllm \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 + --streaming ``` Example output: ``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ -│ Time to first token (ns) │ 13,266,974 │ 11,818,732 │ 18,351,779 │ 16,513,479 │ 13,741,986 │ 13,544,376 │ -│ Inter token latency (ns) │ 2,069,766 │ 42,023 │ 15,307,799 │ 3,256,375 │ 3,020,580 │ 2,090,930 │ -│ Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │ -│ Output sequence length │ 104 │ 100 │ 129 │ 128 │ 109 │ 105 │ -│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │ -└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ -Output token throughput (per sec): 460.42 -Request throughput (per sec): 4.44 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 13.68 │ 11.07 │ 21.50 │ 18.81 │ 14.29 │ 13.97 │ +│ Inter token latency (ms) │ 1.86 │ 1.28 │ 2.11 │ 2.11 │ 2.01 │ 1.95 │ +│ Request latency (ms) │ 203.70 │ 180.33 │ 228.30 │ 225.45 │ 216.48 │ 211.72 │ +│ Output sequence length │ 103.46 │ 95.00 │ 134.00 │ 122.96 │ 108.00 │ 104.75 │ +│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ +│ Output token throughput (per sec) │ 504.02 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 4.87 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` -## Profile GPT2 running on Triton + vLLM - -### Run GPT2 on Triton Inference Server using vLLM - -
-See instructions - -Run Triton Inference Server with vLLM backend container: - -```bash -export RELEASE="24.08" - - -docker run -it --net=host --gpus=1 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3 +## Profile GPT-2 running on Triton + vLLM -# Install Triton CLI (~5 min): -pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8" - -# Download model: -triton import -m gpt2 --backend vllm - -# Run server: -triton start -``` - -
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-vllm-model) +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the vLLM backend. ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="24.08" - -docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind triton \ --backend vllm \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 + --streaming ``` Example output: ``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ -│ Time to first token (ns) │ 15,786,560 │ 11,437,189 │ 49,550,549 │ 40,129,652 │ 21,248,091 │ 17,824,695 │ -│ Inter token latency (ns) │ 3,543,380 │ 591,898 │ 10,013,690 │ 6,152,260 │ 5,039,278 │ 4,060,982 │ -│ Request latency (ns) │ 388,415,721 │ 312,552,612 │ 528,229,817 │ 518,189,390 │ 484,281,365 │ 459,417,637 │ -│ Output sequence length │ 113 │ 105 │ 123 │ 122 │ 119 │ 115 │ -│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │ -└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ -Output token throughput (per sec): 290.24 -Request throughput (per sec): 2.57 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 22.04 │ 14.00 │ 26.02 │ 25.73 │ 24.41 │ 24.06 │ +│ Inter token latency (ms) │ 4.58 │ 3.45 │ 5.34 │ 5.33 │ 5.11 │ 4.86 │ +│ Request latency (ms) │ 542.48 │ 468.10 │ 622.39 │ 615.67 │ 584.73 │ 555.90 │ +│ Output sequence length │ 115.15 │ 103.00 │ 143.00 │ 138.00 │ 120.00 │ 118.50 │ +│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ +│ Output token throughput (per sec) │ 212.04 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 1.84 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` -## Profile Zephyr running on OpenAI Chat API-Compatible Server - -### Run Zephyr on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server +## Profile Zephyr-7B-Beta running on OpenAI Chat API-Compatible Server -
-See instructions - -Run the vLLM inference server: +Serve the model on the vLLM server with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) endpoint: ```bash docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model HuggingFaceH4/zephyr-7b-beta --dtype float16 ``` -
- ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="24.08" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m HuggingFaceH4/zephyr-7b-beta \ --service-kind openai \ --endpoint-type chat \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ + --streaming \ --tokenizer HuggingFaceH4/zephyr-7b-beta ``` @@ -234,54 +163,33 @@ Example output: └───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ ``` -## Profile GPT2 running on OpenAI Completions API-Compatible Server - -### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server +## Profile GPT-2 running on OpenAI Completions API-Compatible Server -
-See instructions - -Run the vLLM inference server: +Serve the model on the vLLM server with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) endpoint: ```bash docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024 ``` -
- ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="24.08" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind openai \ - --endpoint v1/completions \ --endpoint-type completions \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --output-tokens-mean 100 \ - --output-tokens-stddev 0 \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8000 + --output-tokens-stddev 0 ``` Example output: ``` - NVIDIA GenAI-Perf | LLM Metrics + NVIDIA GenAI-Perf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ diff --git a/templates/genai-perf-templates/README_template b/templates/genai-perf-templates/README_template index 2c742b22..eb88a141 100644 --- a/templates/genai-perf-templates/README_template +++ b/templates/genai-perf-templates/README_template @@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine. ### Serve GPT-2 TensorRT-LLM model using Triton CLI You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model) -on Triton CLI github repo to run GPT-2 model locally. +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend. The full instructions are copied below for convenience: ```bash @@ -139,7 +139,6 @@ docker run -ti \ --network=host \ --shm-size=1g --ulimit memlock=-1 \ -v /tmp:/tmp \ - -v ${HOME}/models:/root/models \ -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/tritonserver:{{ release }}-trtllm-python-py3 @@ -156,48 +155,27 @@ triton start ### Running GenAI-Perf -Now we can run GenAI-Perf from Triton Inference Server SDK container: +Now we can run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="{{ release }}" - -docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: -genai-perf profile \ - -m gpt2 \ - --service-kind triton \ - --backend tensorrtllm \ - --num-prompts 100 \ - --random-seed 123 \ - --synthetic-input-tokens-mean 200 \ - --synthetic-input-tokens-stddev 0 \ - --streaming \ - --output-tokens-mean 100 \ - --output-tokens-stddev 0 \ - --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 +genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming ``` Example output: ``` - LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ -│ Time to first token (ms) │ 11.70 │ 9.88 │ 17.21 │ 14.35 │ 12.01 │ 11.87 │ -│ Inter token latency (ms) │ 1.46 │ 1.08 │ 1.89 │ 1.87 │ 1.62 │ 1.52 │ -│ Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │ -│ Output sequence length │ 103.39 │ 95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │ -│ Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │ -└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ -Output token throughput (per sec): 635.61 -Request throughput (per sec): 6.15 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 16.26 │ 12.39 │ 17.25 │ 17.09 │ 16.68 │ 16.56 │ +│ Inter token latency (ms) │ 1.85 │ 1.55 │ 2.04 │ 2.02 │ 1.97 │ 1.92 │ +│ Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │ +│ Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │ +│ Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │ +│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` See [Tutorial](docs/tutorial.md) for additional examples. diff --git a/templates/genai-perf-templates/tutorial_template b/templates/genai-perf-templates/tutorial_template index d36fc88b..43271fe9 100644 --- a/templates/genai-perf-templates/tutorial_template +++ b/templates/genai-perf-templates/tutorial_template @@ -28,192 +28,121 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # Profile Large Language Models with GenAI-Perf -- [Profile GPT2 running on Triton + TensorRT-LLM](#tensorrt-llm) -- [Profile GPT2 running on Triton + vLLM](#triton-vllm) -- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat) -- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions) - ---- - -## Profile GPT2 running on Triton + TensorRT-LLM - -### Run GPT2 on Triton Inference Server using TensorRT-LLM - -
-See instructions - -Run Triton Inference Server with TensorRT-LLM backend container: +This tutorial will demonstrate how you can use GenAI-Perf to measure the performance of +various inference endpoints such as +[KServe inference protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) +and [OpenAI API](https://platform.openai.com/docs/api-reference/introduction) +that are widely used across the industry. -```bash -export RELEASE="{{ release }}" +### Table of Contents -docker run -it --net=host --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3 +- [Profile GPT2 running on Triton + TensorRT-LLM Backend](#tensorrt-llm) +- [Profile GPT2 running on Triton + vLLM Backend](#triton-vllm) +- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat) +- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions) -# Install Triton CLI (~5 min): -pip install "git+https://github.com/triton-inference-server/triton_cli@{{ triton_cli_version }}" +
-# Download model: -triton import -m gpt2 --backend tensorrtllm +## Profile GPT-2 running on Triton + TensorRT-LLM -# Run server: -triton start -``` - -
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model) +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend. ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="{{ release }}" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind triton \ --backend tensorrtllm \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 + --streaming ``` Example output: ``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ -│ Time to first token (ns) │ 13,266,974 │ 11,818,732 │ 18,351,779 │ 16,513,479 │ 13,741,986 │ 13,544,376 │ -│ Inter token latency (ns) │ 2,069,766 │ 42,023 │ 15,307,799 │ 3,256,375 │ 3,020,580 │ 2,090,930 │ -│ Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │ -│ Output sequence length │ 104 │ 100 │ 129 │ 128 │ 109 │ 105 │ -│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │ -└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ -Output token throughput (per sec): 460.42 -Request throughput (per sec): 4.44 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 13.68 │ 11.07 │ 21.50 │ 18.81 │ 14.29 │ 13.97 │ +│ Inter token latency (ms) │ 1.86 │ 1.28 │ 2.11 │ 2.11 │ 2.01 │ 1.95 │ +│ Request latency (ms) │ 203.70 │ 180.33 │ 228.30 │ 225.45 │ 216.48 │ 211.72 │ +│ Output sequence length │ 103.46 │ 95.00 │ 134.00 │ 122.96 │ 108.00 │ 104.75 │ +│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ +│ Output token throughput (per sec) │ 504.02 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 4.87 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` -## Profile GPT2 running on Triton + vLLM - -### Run GPT2 on Triton Inference Server using vLLM - -
-See instructions - -Run Triton Inference Server with vLLM backend container: - -```bash -export RELEASE="{{ release }}" - - -docker run -it --net=host --gpus=1 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3 +## Profile GPT-2 running on Triton + vLLM -# Install Triton CLI (~5 min): -pip install "git+https://github.com/triton-inference-server/triton_cli@0.0.8" - -# Download model: -triton import -m gpt2 --backend vllm - -# Run server: -triton start -``` - -
+You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-vllm-model) +in the Triton CLI Github repository to serve GPT-2 on the Triton server with the vLLM backend. ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="{{ release }}" - -docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind triton \ --backend vllm \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 + --streaming ``` Example output: ``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ -│ Time to first token (ns) │ 15,786,560 │ 11,437,189 │ 49,550,549 │ 40,129,652 │ 21,248,091 │ 17,824,695 │ -│ Inter token latency (ns) │ 3,543,380 │ 591,898 │ 10,013,690 │ 6,152,260 │ 5,039,278 │ 4,060,982 │ -│ Request latency (ns) │ 388,415,721 │ 312,552,612 │ 528,229,817 │ 518,189,390 │ 484,281,365 │ 459,417,637 │ -│ Output sequence length │ 113 │ 105 │ 123 │ 122 │ 119 │ 115 │ -│ Input sequence length │ 199 │ 199 │ 199 │ 199 │ 199 │ 199 │ -└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ -Output token throughput (per sec): 290.24 -Request throughput (per sec): 2.57 + NVIDIA GenAI-Perf | LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 22.04 │ 14.00 │ 26.02 │ 25.73 │ 24.41 │ 24.06 │ +│ Inter token latency (ms) │ 4.58 │ 3.45 │ 5.34 │ 5.33 │ 5.11 │ 4.86 │ +│ Request latency (ms) │ 542.48 │ 468.10 │ 622.39 │ 615.67 │ 584.73 │ 555.90 │ +│ Output sequence length │ 115.15 │ 103.00 │ 143.00 │ 138.00 │ 120.00 │ 118.50 │ +│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ +│ Output token throughput (per sec) │ 212.04 │ N/A │ N/A │ N/A │ N/A │ N/A │ +│ Request throughput (per sec) │ 1.84 │ N/A │ N/A │ N/A │ N/A │ N/A │ +└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` -## Profile Zephyr running on OpenAI Chat API-Compatible Server - -### Run Zephyr on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server +## Profile Zephyr-7B-Beta running on OpenAI Chat API-Compatible Server -
-See instructions - -Run the vLLM inference server: +Serve the model on the vLLM server with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) endpoint: ```bash docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model HuggingFaceH4/zephyr-7b-beta --dtype float16 ``` -
- ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="{{ release }}" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - -# Run GenAI-Perf in the container: genai-perf profile \ -m HuggingFaceH4/zephyr-7b-beta \ --service-kind openai \ --endpoint-type chat \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ - --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ + --streaming \ --tokenizer HuggingFaceH4/zephyr-7b-beta ``` @@ -234,48 +163,27 @@ Example output: └───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ ``` -## Profile GPT2 running on OpenAI Completions API-Compatible Server - -### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server +## Profile GPT-2 running on OpenAI Completions API-Compatible Server -
-See instructions - -Run the vLLM inference server: +Serve the model on the vLLM server with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) endpoint: ```bash docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024 ``` -
- ### Run GenAI-Perf -Run GenAI-Perf from Triton Inference Server SDK container: +Run GenAI-Perf inside the Triton Inference Server SDK container: ```bash -export RELEASE="{{ release }}" - -docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk - - -# Run GenAI-Perf in the container: genai-perf profile \ -m gpt2 \ --service-kind openai \ - --endpoint v1/completions \ --endpoint-type completions \ - --num-prompts 100 \ - --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --output-tokens-mean 100 \ - --output-tokens-stddev 0 \ - --tokenizer hf-internal-testing/llama-tokenizer \ - --concurrency 1 \ - --measurement-interval 4000 \ - --profile-export-file my_profile_export.json \ - --url localhost:8000 + --output-tokens-stddev 0 ``` Example output: diff --git a/templates/template_vars.yaml b/templates/template_vars.yaml index 12e88eb6..373d0fef 100644 --- a/templates/template_vars.yaml +++ b/templates/template_vars.yaml @@ -1,6 +1,6 @@ General: release: 24.08 - triton_cli_version: 0.0.8 + triton_cli_version: 0.0.11 genai_perf_version: 0.0.6dev README: @@ -46,4 +46,4 @@ tutorial: version: filename: __init__.py template: genai-perf-templates/version_template - output_dir: ../genai-perf/genai_perf/ \ No newline at end of file + output_dir: ../genai-perf/genai_perf/