Document how to profile embeddings models (#717)

triton-inference-server · Jun 28, 2024 · 90c60a6 · 90c60a6
1 parent e423332
commit 90c60a6
Show file tree

Hide file tree

Showing 2 changed files with 96 additions and 2 deletions.
diff --git a/src/c++/perf_analyzer/genai-perf/README.md b/src/c++/perf_analyzer/genai-perf/README.md
@@ -373,7 +373,7 @@ model config to not echo the input tokens in the output. (default: tensorrtllm)
 
 Set a custom endpoint that differs from the OpenAI defaults. (default: `None`)
 
-##### `--endpoint-type {chat,completions}`
+##### `--endpoint-type {chat,completions,embeddings}`
 
 The endpoint-type to send requests to on the server. This is only used with the
 `openai` service-kind. (default: `None`)
@@ -398,7 +398,8 @@ URL of the endpoint to target for benchmarking. (default: `None`)
 ##### `--batch-size <int>`
 
 The batch size of the requests GenAI-Perf should send.
-This is currently only supported with the embeddings endpoint type.
+This is currently only supported with the
+[embeddings endpoint type](docs/embeddings.md).
 (default: `1`)
 
 ##### `--extra-inputs <str>`

diff --git a/src/c++/perf_analyzer/genai-perf/docs/embeddings.md b/src/c++/perf_analyzer/genai-perf/docs/embeddings.md
@@ -0,0 +1,93 @@
+<!--
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Profiling Embeddings Models with GenAI-Perf
+
+GenAI-Perf allows you to profile embedding models running on an
+[OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)-compatible server.
+
+## Creating a Sample Embeddings Input File
+
+To create a sample embeddings input file, use the following command:
+
+```bash
+echo '{"text": "What was the first car ever driven?"}
+{"text": "Who served as the 5th President of the United States of America?"}
+{"text": "Is the Sydney Opera House located in Australia?"}
+{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
+```
+
+This will generate a file named embeddings.jsonl with the following content:
+```jsonl
+{"text": "What was the first car ever driven?"}
+{"text": "Who served as the 5th President of the United States of America?"}
+{"text": "Is the Sydney Opera House located in Australia?"}
+{"text": "In what state did they film Shrek 2?"}
+```
+
+## Starting an OpenAI Embeddings-Compatible Server
+To start an OpenAI embeddings-compatible server, run the following command:
+```bash
+docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024
+```
+
+## Running GenAI-Perf
+To profile embeddings models using GenAI-Perf, use the following command:
+
+```bash
+genai-perf \
+    -m intfloat/e5-mistral-7b-instruct \
+    --service-kind openai \
+    --endpoint-type embeddings \
+    --batch-size 2 \
+    --input-file embeddings.jsonl
+```
+
+This will use default values for optional arguments. You can also pass in
+additional arguments with the `--extra-inputs` [flag](../README.md#input-options).
+For example, you could use this command:
+
+```bash
+genai-perf \
+    -m intfloat/e5-mistral-7b-instruct \
+    --service-kind openai \
+    --endpoint-type embeddings \
+    --extra-inputs user:sample_user
+```
+
+Example output:
+
+```
+                          Embeddings Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
+┃ Statistic            ┃ avg   ┃ min   ┃ max    ┃ p99   ┃ p90   ┃ p75   ┃
+┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
+│ Request latency (ms) │ 42.21 │ 28.18 │ 318.61 │ 56.50 │ 49.21 │ 43.07 │
+└──────────────────────┴───────┴───────┴────────┴───────┴───────┴───────┘
+Request throughput (per sec): 23.63
+```