From fada8bf264f22842afcf2cd4b261887ea67e7d5a Mon Sep 17 00:00:00 2001 From: Andy Dai Date: Wed, 21 Aug 2024 14:57:52 -0700 Subject: [PATCH] Add goodput tutorial --- genai-perf/docs/goodput.md | 132 +++++++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 genai-perf/docs/goodput.md diff --git a/genai-perf/docs/goodput.md b/genai-perf/docs/goodput.md new file mode 100644 index 00000000..92487627 --- /dev/null +++ b/genai-perf/docs/goodput.md @@ -0,0 +1,132 @@ + + +# Context + +Goodput, defined as the number of completed requests per second that meet the +constraints, provides an enhanced measure of AI serving performance by +accounting for both cost efficiency and user satisfaction. + +# Tutorials + +This is a tutorial on how to benchmark models using goodput. + +## Examples + +- [LLM Examples](#LLM) + +- [Embedding Model Examples](#embeddings) + +## Profile LLM Goodput + +### Run GPT2 on OpenAI Chat Completions API-compatible server + +```bash +docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024 +``` + +### Run GenAI-Perf with Goodput Constraints + +```bash +genai-perf profile \ + -m gpt2 \ + --service-kind openai \ + --endpoint-type chat \ + --measurement-interval 1000 \ + --streaming \ + --goodput time_to_first_token:4.35 inter_token_latency:1.1 +``` + +Example output: + +``` + LLM Metrics +┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ +│ Time to first token (ms) │ 4.44 │ 3.63 │ 23.85 │ 13.30 │ 5.15 │ 4.20 │ +│ Inter token latency (ms) │ 1.03 │ 0.76 │ 1.92 │ 1.63 │ 1.18 │ 1.08 │ +│ Request latency (ms) │ 22.31 │ 7.67 │ 45.83 │ 41.11 │ 25.18 │ 21.97 │ +│ Output sequence length │ 18.54 │ 5.00 │ 24.00 │ 22.41 │ 21.00 │ 20.00 │ +│ Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.82 │ 550.00 │ 550.00 │ +└──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ +Output token throughput (per sec): 827.09 +Request throughput (per sec): 44.62 +Request goodput (per sec): 30.95 +``` + +## Profile Embeddings Models Goodput + +### Create a Sample Embeddings Input File + +To create a sample embeddings input file, use the following command: + +```bash +echo '{"text": "What was the first car ever driven?"} +{"text": "Who served as the 5th President of the United States of America?"} +{"text": "Is the Sydney Opera House located in Australia?"} +{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl +``` + +This will generate a file named embeddings.jsonl with the following content: +```jsonl +{"text": "What was the first car ever driven?"} +{"text": "Who served as the 5th President of the United States of America?"} +{"text": "Is the Sydney Opera House located in Australia?"} +{"text": "In what state did they film Shrek 2?"} +``` + +### Start an OpenAI Embeddings-Compatible Server + +To start an OpenAI embeddings-compatible server, run the following command: +```bash +docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024 +``` + +### Run GenAI-Perf with goodput constraints + +```bash +genai-perf profile \ + -m intfloat/e5-mistral-7b-instruct \ + --service-kind openai \ + --endpoint-type embeddings \ + --batch-size 2 \ + --input-file embeddings.jsonl \ + --measurement-interval 1000 \ + --goodput request_latency:22.5 +``` +Example output: + +``` + Embeddings Metrics +┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩ +│ Request latency (ms) │ 22.23 │ 21.67 │ 31.96 │ 22.90 │ 22.48 │ 22.31 │ +└──────────────────────┴───────┴───────┴───────┴───────┴───────┴───────┘ +Request throughput (per sec): 44.73 +Request goodput (per sec): 40.28 +``` \ No newline at end of file