From 70a7bf9c7aaa0752d59966cd4a4d30de661e9f33 Mon Sep 17 00:00:00 2001
From: Hyunjae Woo <hwoo@nvidia.com>
Date: Tue, 10 Oct 2023 22:28:08 -0700
Subject: [PATCH] Update doc to use python script

---
 src/c++/perf_analyzer/docs/llm.md | 56 +++++++++----------------------
 1 file changed, 15 insertions(+), 41 deletions(-)

diff --git a/src/c++/perf_analyzer/docs/llm.md b/src/c++/perf_analyzer/docs/llm.md
index a6d4f54c6..c906370ab 100644
--- a/src/c++/perf_analyzer/docs/llm.md
+++ b/src/c++/perf_analyzer/docs/llm.md
@@ -127,55 +127,29 @@ python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 256 --ign
 #   Prompt size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
 ```
 
-### Benchmark 3: Profiling Continuous Batch Size
+## Benchmark 3: Profiling Continuous Batch Size
 
 In this benchmarking scenario, we want to measure the effect of continuous
 batch size on token-to-token latency. We systematically issue requests to the
 server of fixed input sizes and request the model to compute a fixed amount of
 tokens in order to increase the continuous batching size over time.
 
-#### 1. Generate prompts input data JSON
-
-```bash
-# open a new shell in the same directory you were in when running the above command
-echo '
-{
-    "data": [
-        {
-            "PROMPT": [
-                "Hello, my name is"
-            ],
-            "STREAM": [
-                true
-            ],
-            "SAMPLING_PARAMETERS": [
-                "{\"max_tokens\":16,\"ignore_eos\":true}"
-            ]
-        }
-    ]
-}
-' > prompts.json
-```
+#### Example
 
-#### 2. Run Perf Analyzer
+In this benchmark, we are interested in how continuous batch size affects token-to-token latency
+by increasing the number of concurrent requests to the model.
+Perf Analyzer will run in [periodic concurrency mode](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/inference_load_modes.md#periodic-concurrency-mode)
+that periodically launches a new concurrent request to the model using `--periodic-concurrency-range START END STEP` option.
+In this example, Perf Analyzer starts with a single request and launches the new ones until the total number reaches 30.
+You can also specify the timing of the new requests: For example, setting the `--request-period` to 50 will make
+Perf Analyzer to wait for all the requests to receives 50 responses before it launches the new requests.
 
 ```bash
-perf_analyzer \
-    -m vllm \
-    -i grpc \
-    --async \
-    --streaming \
-    --input-data=prompts.json \
-    --profile-export-file=profile_export.json \
-    --periodic-concurrency-range=1:20:1
-    --request-period=10
-```
+python profile.py -m vllm --periodic-concurrency-range 1 30 1 --request-period 50 --max-tokens 256 --ignore-eos
 
-#### 3. Calculate average token-to-token latency
-
-```bash
-python3 examples/calculate_avg_token_to_token_latency.py
-# Average token-to-token latency: 0.003090155677419355 s
+# Sample output
+# [ Benchmark Summary ]
+#   Prompt size: 100, Average first-token latency: 0.0397 sec, Average token-token latency: 0.0105 sec
+#   Prompt size: 300, Average first-token latency: 0.0269 sec, Average token-token latency: 0.0094 sec
+#   Prompt size: 500, Average first-token latency: 0.0357 sec, Average token-token latency: 0.0119 sec
 ```
-
-#### 4. Repeat steps 1-3 with different period concurrency range start/end/step and different request period to measure effects of continuous batch size on token-to-token latency (generation).