diff --git a/src/c++/perf_analyzer/genai-perf/README.md b/src/c++/perf_analyzer/genai-perf/README.md index b114364d7..d9f288996 100644 --- a/src/c++/perf_analyzer/genai-perf/README.md +++ b/src/c++/perf_analyzer/genai-perf/README.md @@ -373,7 +373,7 @@ model config to not echo the input tokens in the output. (default: tensorrtllm) Set a custom endpoint that differs from the OpenAI defaults. (default: `None`) -##### `--endpoint-type {chat,completions}` +##### `--endpoint-type {chat,completions,embeddings}` The endpoint-type to send requests to on the server. This is only used with the `openai` service-kind. (default: `None`) @@ -398,7 +398,8 @@ URL of the endpoint to target for benchmarking. (default: `None`) ##### `--batch-size ` The batch size of the requests GenAI-Perf should send. -This is currently only supported with the embeddings endpoint type. +This is currently only supported with the +[embeddings endpoint type](docs/embeddings.md). (default: `1`) ##### `--extra-inputs ` diff --git a/src/c++/perf_analyzer/genai-perf/docs/embeddings.md b/src/c++/perf_analyzer/genai-perf/docs/embeddings.md new file mode 100644 index 000000000..e61c397d9 --- /dev/null +++ b/src/c++/perf_analyzer/genai-perf/docs/embeddings.md @@ -0,0 +1,93 @@ + + +# Profiling Embeddings Models with GenAI-Perf + +GenAI-Perf allows you to profile embedding models running on an +[OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)-compatible server. + +## Creating a Sample Embeddings Input File + +To create a sample embeddings input file, use the following command: + +```bash +echo '{"text": "What was the first car ever driven?"} +{"text": "Who served as the 5th President of the United States of America?"} +{"text": "Is the Sydney Opera House located in Australia?"} +{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl +``` + +This will generate a file named embeddings.jsonl with the following content: +```jsonl +{"text": "What was the first car ever driven?"} +{"text": "Who served as the 5th President of the United States of America?"} +{"text": "Is the Sydney Opera House located in Australia?"} +{"text": "In what state did they film Shrek 2?"} +``` + +## Starting an OpenAI Embeddings-Compatible Server +To start an OpenAI embeddings-compatible server, run the following command: +```bash +docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024 +``` + +## Running GenAI-Perf +To profile embeddings models using GenAI-Perf, use the following command: + +```bash +genai-perf \ + -m intfloat/e5-mistral-7b-instruct \ + --service-kind openai \ + --endpoint-type embeddings \ + --batch-size 2 \ + --input-file embeddings.jsonl +``` + +This will use default values for optional arguments. You can also pass in +additional arguments with the `--extra-inputs` [flag](../README.md#input-options). +For example, you could use this command: + +```bash +genai-perf \ + -m intfloat/e5-mistral-7b-instruct \ + --service-kind openai \ + --endpoint-type embeddings \ + --extra-inputs user:sample_user +``` + +Example output: + +``` + Embeddings Metrics +┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓ +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ +┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩ +│ Request latency (ms) │ 42.21 │ 28.18 │ 318.61 │ 56.50 │ 49.21 │ 43.07 │ +└──────────────────────┴───────┴───────┴────────┴───────┴───────┴───────┘ +Request throughput (per sec): 23.63 +``` \ No newline at end of file