Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GenAI-Perf README and tutorial doc #89

Merged
merged 10 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 16 additions & 38 deletions genai-perf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine.
### Serve GPT-2 TensorRT-LLM model using Triton CLI

You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
on Triton CLI github repo to run GPT-2 model locally.
in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
The full instructions are copied below for convenience:

```bash
Expand All @@ -139,12 +139,11 @@ docker run -ti \
--network=host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
-v ${HOME}/models:/root/models \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

# Install the Triton CLI
pip install git+https://github.com/triton-inference-server/[email protected].8
pip install git+https://github.com/triton-inference-server/[email protected].11

# Build TRT LLM engine and generate a Triton model repository pointing at it
triton remove -m all
Expand All @@ -156,48 +155,27 @@ triton start

### Running GenAI-Perf

Now we can run GenAI-Perf from Triton Inference Server SDK container:
Now we can run GenAI-Perf inside the Triton Inference Server SDK container:

```bash
export RELEASE="24.08"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend tensorrtllm \
--num-prompts 100 \
--random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
--tokenizer hf-internal-testing/llama-tokenizer \
--concurrency 1 \
--measurement-interval 4000 \
--profile-export-file my_profile_export.json \
--url localhost:8001
genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming
```

Example output:

```
LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time to first token (ms) │ 11.70 9.88 │ 17.2114.3512.0111.87
│ Inter token latency (ms) │ 1.46 │ 1.081.891.87 │ 1.62 │ 1.52
│ Request latency (ms) │ 161.24153.45200.74200.66179.43162.23
│ Output sequence length │ 103.39 95.00 │ 134.00 │ 120.08107.30105.00 │
│ Input sequence length │ 200.01200.00 │ 201.00 │ 200.13200.00 │ 200.00 │
└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Output token throughput (per sec): 635.61
Request throughput (per sec): 6.15
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
Time to first token (ms) │ 16.2612.39 │ 17.2517.0916.6816.56
debermudez marked this conversation as resolved.
Show resolved Hide resolved
Inter token latency (ms) │ 1.85 │ 1.552.042.02 │ 1.97 │ 1.92
Request latency (ms) │ 499.20451.01554.61548.69526.13514.19
Output sequence length │ 261.90256.00 │ 298.00 │ 296.60270.00265.00 │
Input sequence length │ 550.06550.00 │ 553.00 │ 551.60550.00 │ 550.00 │
│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```

See [Tutorial](docs/tutorial.md) for additional examples.
Expand Down
Loading
Loading