triton-inference-server · debermudez · Aug 22, 2024 · Aug 22, 2024 · Aug 22, 2024 · Aug 22, 2024
diff --git a/templates/genai-perf-templates/README_template b/templates/genai-perf-templates/README_template
diff --git a/templates/genai-perf-templates/compare_template b/templates/genai-perf-templates/compare_template
@@ -0,0 +1,250 @@
+<!--
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# GenAI-Perf Compare Subcommand
+
+There are two approaches for the users to use the `compare` subcommand to create
+plots across multiple runs. First is to directly pass the profile export files
+with `--files` option
+
+## Running initially with `--files` option
+
+If the user does not have a YAML configuration file,
+they can run the `compare` subcommand with the `--files` option to generate a
+set of default plots as well as a pre-filled YAML config file for the plots.
+
+```bash
+genai-perf compare --files profile1.json profile2.json profile3.json
+```
+
+This will generate the default plots and compare across the three runs.
+GenAI-Perf will also generate an initial YAML configuration file `config.yaml`
+that is pre-filled with plot configurations as following:
+
+```yaml
+plot1:
+  title: Time to First Token
+  x_metric: ''
+  y_metric: time_to_first_tokens
+  x_label: Time to First Token (ms)
+  y_label: ''
+  width: 1200
+  height: 700
+  type: box
+  paths:
+  - profile1.json
+  - profile2.json
+  - profile3.json
+  output: compare
+plot2:
+  title: Request Latency
+  x_metric: ''
+  y_metric: request_latencies
+  x_label: Request Latency (ms)
+  y_label: ''
+  width: 1200
+  height: 700
+  type: box
+  paths:
+  - profile1.json
+  - profile2.json
+  - profile3.json
+  output: compare
+plot3:
+  title: Distribution of Input Sequence Lengths to Output Sequence Lengths
+  x_metric: input_sequence_lengths
+  y_metric: output_sequence_lengths
+  x_label: Input Sequence Length
+  y_label: Output Sequence Length
+  width: 1200
+  height: 450
+  type: heatmap
+  paths:
+  - profile1.json
+  - profile2.json
+  - profile3.json
+  output: compare
+plot4:
+  title: Time to First Token vs Input Sequence Lengths
+  x_metric: input_sequence_lengths
+  y_metric: time_to_first_tokens
+  x_label: Input Sequence Length
+  y_label: Time to First Token (ms)
+  width: 1200
+  height: 700
+  type: scatter
+  paths:
+  - profile1.json
+  - profile2.json
+  - profile3.json
+  output: compare
+plot5:
+  title: Token-to-Token Latency vs Output Token Position
+  x_metric: token_positions
+  y_metric: inter_token_latencies
+  x_label: Output Token Position
+  y_label: Token-to-Token Latency (ms)
+  width: 1200
+  height: 700
+  type: scatter
+  paths:
+  - profile1.json
+  - profile2.json
+  - profile3.json
+  output: compare
+```
+
+Once the user has the YAML configuration file,
+they can repeat the process of editing the config file and running with
+`--config` option to re-generate the plots iteratively.
+
+```bash
+# edit
+vi config.yaml
+
+# re-generate the plots
+genai-perf compare --config config.yaml
+```
+
+## Running directly with `--config` option
+
+If the user would like to create a custom plot (other than the default ones provided),
+they can build their own YAML configuration file that contains the information
+about the plots they would like to generate.
+For instance, if the user would like to see how the inter token latencies change
+by the number of output tokens, which is not part of the default plots,
+they could add the following YAML block to the file:
+
+```yaml
+plot1:
+  title: Inter Token Latency vs Output Tokens
+  x_metric: num_output_tokens
+  y_metric: inter_token_latencies
+  x_label: Num Output Tokens
+  y_label: Avg ITL (ms)
+  width: 1200
+  height: 450
+  type: scatter
+  paths:
+    - <path-to-profile-export-file>
+    - <path-to-profile-export-file>
+  output: compare
+```
+
+After adding the lines, the user can run the following command to generate the
+plots specified in the configuration file (in this case, `config.yaml`):
+
+```bash
+genai-perf compare --config config.yaml
+```
+
+The user can check the generated plots under the output directory:
+```
+compare/
+├── inter_token_latency_vs_output_tokens.jpeg
+└── ...
+```
+
+## YAML Schema
+
+Here are more details about the YAML configuration file and its stricture.
+The general YAML schema for the plot configuration looks as following:
+
+```yaml
+plot1:
+  title: [str]
+  x_metric: [str]
+  y_metric: [str]
+  x_label: [str]
+  y_label: [str]
+  width: [int]
+  height: [int]
+  type: [scatter,box,heatmap]
+  paths:
+    - [str]
+    - ...
+  output: [str]
+
+plot2:
+  title: [str]
+  x_metric: [str]
+  y_metric: [str]
+  x_label: [str]
+  y_label: [str]
+  width: [int]
+  height: [int]
+  type: [scatter,box,heatmap]
+  paths:
+    - [str]
+    - ...
+  output: [str]
+
+# add more plots
+```
+
+The user can add as many plots they would like to generate by adding the plot
+blocks in the configuration file (they have a key pattern of `plot<#>`,
+but that is not required and the user can set it to any arbitrary string).
+For each plot block, the user can specify the following configurations:
+- `title`: The title of the plot.
+- `x_metric`: The name of the metric to be used on the x-axis.
+- `y_metric`: The name of the metric to be used on the y-axis.
+- `x_label`: The x-axis label (or description)
+- `y_label`: The y-axis label (or description)
+- `width`: The width of the entire plot
+- `height`: The height of the entire plot
+- `type`: The type of the plot. It must be one of the three: `scatter`, `box`,
+or `heatmap`.
+- `paths`: List of paths to the profile export files to compare.
+- `output`: The path to the output directory to store all the plots and YAML
+configuration file.
+
+> [!Note]
+> User *MUST* provide at least one valid path to the profile export file.
+
+
+
+## Example Plots
+
+Here are the list of sample plots that gets created by default from running the
+`compare` subcommand:
+
+### Distribution of Input Sequence Lengths to Output Sequence Lengths
+<img src="assets/distribution_of_input_sequence_lengths_to_output_sequence_lengths.jpeg" width="800" height="300" />
+
+### Request Latency Analysis
+<img src="assets/request_latency.jpeg" width="800" height="300" />
+
+### Time to First Token Analysis
+<img src="assets/time_to_first_token.jpeg" width="800" height="300" />
+
+### Time to First Token vs. Input Sequence Lengths
+<img src="assets/time_to_first_token_vs_input_sequence_lengths.jpeg" width="800" height="300" />
+
+### Token-to-Token Latency vs. Output Token Position
+<img src="assets/token-to-token_latency_vs_output_token_position.jpeg" width="800" height="300" />
diff --git a/templates/genai-perf-templates/embeddings_template b/templates/genai-perf-templates/embeddings_template
@@ -0,0 +1,106 @@
+<!--
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Profile Embeddings Models with GenAI-Perf
+
+GenAI-Perf allows you to profile embedding models running on an
+[OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)-compatible server.
+
+## Create a Sample Embeddings Input File
+
+To create a sample embeddings input file, use the following command:
+
+```bash
+echo '{"text": "What was the first car ever driven?"}
+{"text": "Who served as the 5th President of the United States of America?"}
+{"text": "Is the Sydney Opera House located in Australia?"}
+{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
+```
+
+This will generate a file named embeddings.jsonl with the following content:
+```jsonl
+{"text": "What was the first car ever driven?"}
+{"text": "Who served as the 5th President of the United States of America?"}
+{"text": "Is the Sydney Opera House located in Australia?"}
+{"text": "In what state did they film Shrek 2?"}
+```
+
+## Start an OpenAI Embeddings-Compatible Server
+To start an OpenAI embeddings-compatible server, run the following command:
+```bash
+docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024
+```
+
+## Run GenAI-Perf
+To profile embeddings models using GenAI-Perf, use the following command:
+
+```bash
+genai-perf profile \
+    -m intfloat/e5-mistral-7b-instruct \
+    --service-kind openai \
+    --endpoint-type embeddings \
+    --batch-size 2 \
+    --input-file embeddings.jsonl
+```
+
+* `-m intfloat/e5-mistral-7b-instruct` is to specify what model you want to run
+  (`intfloat/e5-mistral-7b-instruct`)
+* `--service-kind openai` is to specify that the server type is OpenAI-API
+  compatible
+* `--endpoint-type embeddings` is to specify that the sent requests should be
+  formatted to follow the [embeddings
+  API](https://platform.openai.com/docs/api-reference/embeddings/create)
+* `--batch-size 2` is to specify that each request will contain the inputs for 2
+  individual inferences, making a batch size of 2
+* `--input-file embeddings.jsonl` is to specify the input data to be used for
+  inferencing
+
+This will use default values for optional arguments. You can also pass in
+additional arguments with the `--extra-inputs` [flag](../README.md#input-options).
+For example, you could use this command:
+
+```bash
+genai-perf profile \
+    -m intfloat/e5-mistral-7b-instruct \
+    --service-kind openai \
+    --endpoint-type embeddings \
+    --extra-inputs user:sample_user
+```
+
+Example output:
+
+```
+                          Embeddings Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
+┃ Statistic            ┃ avg   ┃ min   ┃ max    ┃ p99   ┃ p90   ┃ p75   ┃
+┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
+│ Request latency (ms) │ 42.21 │ 28.18 │ 318.61 │ 56.50 │ 49.21 │ 43.07 │
+└──────────────────────┴───────┴───────┴────────┴───────┴───────┴───────┘
+Request throughput (per sec): 23.63
+```
+