Skip to content

Commit

Permalink
Update GenAI-Perf README and tutorial doc (#89)
Browse files Browse the repository at this point in the history
* update doc

* remove duplicate commands

* make commands more simpler for user

* Remove redundant sdk container docker commands

* add links

* update template

* fix type

* add article and capitalize github

* set env directly
  • Loading branch information
nv-hwoo authored Sep 19, 2024
1 parent 292cdd9 commit 942a5be
Show file tree
Hide file tree
Showing 5 changed files with 140 additions and 368 deletions.
54 changes: 16 additions & 38 deletions genai-perf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine.
### Serve GPT-2 TensorRT-LLM model using Triton CLI

You can follow the [quickstart guide](https://github.com/triton-inference-server/triton_cli?tab=readme-ov-file#serving-a-trt-llm-model)
on Triton CLI github repo to run GPT-2 model locally.
in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.
The full instructions are copied below for convenience:

```bash
Expand All @@ -139,12 +139,11 @@ docker run -ti \
--network=host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
-v ${HOME}/models:/root/models \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

# Install the Triton CLI
pip install git+https://github.com/triton-inference-server/[email protected].8
pip install git+https://github.com/triton-inference-server/[email protected].11

# Build TRT LLM engine and generate a Triton model repository pointing at it
triton remove -m all
Expand All @@ -156,48 +155,27 @@ triton start

### Running GenAI-Perf

Now we can run GenAI-Perf from Triton Inference Server SDK container:
Now we can run GenAI-Perf inside the Triton Inference Server SDK container:

```bash
export RELEASE="24.08"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
-m gpt2 \
--service-kind triton \
--backend tensorrtllm \
--num-prompts 100 \
--random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
--tokenizer hf-internal-testing/llama-tokenizer \
--concurrency 1 \
--measurement-interval 4000 \
--profile-export-file my_profile_export.json \
--url localhost:8001
genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming
```

Example output:

```
LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time to first token (ms) │ 11.70 9.88 │ 17.2114.3512.0111.87
│ Inter token latency (ms) │ 1.46 │ 1.081.891.87 │ 1.62 │ 1.52
│ Request latency (ms) │ 161.24153.45200.74200.66179.43162.23
│ Output sequence length │ 103.39 95.00 │ 134.00 │ 120.08107.30105.00 │
│ Input sequence length │ 200.01200.00 │ 201.00 │ 200.13200.00 │ 200.00 │
└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Output token throughput (per sec): 635.61
Request throughput (per sec): 6.15
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
Time to first token (ms) │ 16.2612.39 │ 17.2517.0916.6816.56
Inter token latency (ms) │ 1.85 │ 1.552.042.02 │ 1.97 │ 1.92
Request latency (ms) │ 499.20451.01554.61548.69526.13514.19
Output sequence length │ 261.90256.00 │ 298.00 │ 296.60270.00265.00 │
Input sequence length │ 550.06550.00 │ 553.00 │ 551.60550.00 │ 550.00 │
│ Output token throughput (per sec) │ 520.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request throughput (per sec) │ 1.99 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```

See [Tutorial](docs/tutorial.md) for additional examples.
Expand Down
Loading

0 comments on commit 942a5be

Please sign in to comment.