Merge branch 'main' of https://github.com/triton-inference-server/client

into lkomali-fix-pa-test-failure
triton-inference-server · Jul 23, 2024 · ff51454 · ff51454
2 parents b977ce5 + 3e2d54f
commit ff51454
Show file tree

Hide file tree

Showing 23 changed files with 946 additions and 115 deletions.
diff --git a/src/c++/perf_analyzer/genai-perf/docs/multi_modal.md b/src/c++/perf_analyzer/genai-perf/docs/multi_modal.md
@@ -0,0 +1,122 @@
+<!--
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Profile Vision-Language Models with GenAI-Perf
+
+GenAI-Perf allows you to profile Vision-Language Models (VLM) running on
+[OpenAI Chat Completions API](https://platform.openai.com/docs/guides/chat-completions)-compatible server
+by sending [multi-modal content](https://platform.openai.com/docs/guides/vision) to the server.
+Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:
+1. The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.
+2. The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.
+
+Before we dive into the two approaches,
+you can start OpenAI API compatible server with a VLM model using following command:
+
+```bash
+docker run --runtime nvidia --gpus all \
+    -p 8000:8000 --ipc=host \
+    vllm/vllm-openai:latest \
+    --model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16
+```
+
+
+## Approach 1: Synthetic Multi-Modal Data Generation
+
+GenAI-Perf can generate synthetic multi-modal data such as texts or images using
+the parameters provide by the user through CLI.
+
+```bash
+genai-perf profile \
+    -m llava-hf/llava-v1.6-mistral-7b-hf \
+    --service-kind openai \
+    --endpoint-type vision \
+    --image-width-mean 512 \
+    --image-width-stddev 30 \
+    --image-height-mean 512 \
+    --image-height-stddev 30 \
+    --image-format png \
+    --synthetic-input-tokens-mean 100 \
+    --synthetic-input-tokens-stddev 0 \
+    --streaming
+```
+
+> [!Note]
+> Under the hood, GenAI-Perf generates synthetic images using a few source images
+> under the `llm_inputs/source_images` directory.
+> If you would like to add/remove/edit the source images,
+> you can do so by directly editing the source images under the directory.
+> GenAI-Perf will pickup the images under the directory automatically when
+> generating the synthetic images.
+
+
+## Approach 2: Bring Your Own Data (BYOD)
+
+Instead of letting GenAI-Perf create the synthetic data,
+you can also provide GenAI-Perf with your own data using
+[`--input-file`](../README.md#--input-file-path) CLI option.
+The file needs to be in JSONL format and should contain both the prompt and
+the filepath to the image to send.
+
+For instance, an example of input file would look something as following:
+```bash
+// input.jsonl
+{"text_input": "What is in this image?", "image": "path/to/image1.png"}
+{"text_input": "What is the color of the dog?", "image": "path/to/image2.jpeg"}
+{"text_input": "Describe the scene in the picture.", "image": "path/to/image3.png"}
+...
+```
+
+After you create the file, you can run GenAI-Perf using the following command:
+
+```bash
+genai-perf profile \
+    -m llava-hf/llava-v1.6-mistral-7b-hf \
+    --service-kind openai \
+    --endpoint-type vision \
+    --input-file input.jsonl \
+    --streaming
+```
+
+Running GenAI-Perf using either approach will give you an example output that
+looks like below:
+
+```bash
+                                         LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
+┃                Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
+│ Time to first token (ms) │   321.05 │   291.30 │   537.07 │   497.88 │   318.46 │   317.35 │
+│ Inter token latency (ms) │    12.28 │    11.44 │    12.88 │    12.87 │    12.81 │    12.53 │
+│     Request latency (ms) │ 1,866.23 │ 1,044.70 │ 2,832.22 │ 2,779.63 │ 2,534.64 │ 2,054.03 │
+│   Output sequence length │   126.68 │    59.00 │   204.00 │   200.58 │   177.80 │   147.50 │
+│    Input sequence length │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │
+└──────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
+Output token throughput (per sec): 67.40
+Request throughput (per sec): 0.53
+```
diff --git a/src/c++/perf_analyzer/genai-perf/docs/tutorial.md b/src/c++/perf_analyzer/genai-perf/docs/tutorial.md
@@ -71,7 +71,6 @@ export RELEASE="yy.mm" # e.g. export RELEASE="24.06"
 docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
 
 # Run GenAI-Perf in the container:
-```bash
 genai-perf profile \
   -m gpt2 \
   --service-kind triton \
@@ -145,7 +144,6 @@ export RELEASE="yy.mm" # e.g. export RELEASE="24.06"
 docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
 
 # Run GenAI-Perf in the container:
-```bash
 genai-perf profile \
   -m gpt2 \
   --service-kind triton \
@@ -207,7 +205,6 @@ export RELEASE="yy.mm" # e.g. export RELEASE="24.06"
 docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
 
 # Run GenAI-Perf in the container:
-```bash
 genai-perf profile \
   -m gpt2 \
   --service-kind openai \
@@ -270,7 +267,6 @@ docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-
 
 
 # Run GenAI-Perf in the container:
-```bash
 genai-perf profile \
   -m gpt2 \
   --service-kind openai \