Skip to content

Commit

Permalink
Updating documentation for LLM metric support
Browse files Browse the repository at this point in the history
  • Loading branch information
nv-braf committed Apr 9, 2024
1 parent 8298d83 commit ad399c0
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 11 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,13 @@ limitations under the License.
# Triton Model Analyzer

> [!Warning]
>
> ##### LATEST RELEASE
>
> You are currently on the `main` branch which tracks under-development progress towards the next release. <br>
> The latest release of the Triton Model Analyzer is 1.38.0 and is available on branch
> [r24.03](https://github.com/triton-inference-server/model_analyzer/tree/r24.03).

Triton Model Analyzer is a CLI tool which can help you find a more optimal configuration, on a given piece of hardware, for single, multiple, ensemble, or BLS models running on a [Triton Inference Server](https://github.com/triton-inference-server/server/). Model Analyzer will also generate reports to help you better understand the trade-offs of the different configurations along with their compute and memory requirements.
<br><br>

Expand Down Expand Up @@ -55,6 +56,9 @@ Triton Model Analyzer is a CLI tool which can help you find a more optimal confi
- [Multi-Model Search](docs/config_search.md#multi-model-search-mode): Model Analyzer can help you
find the optimal settings when profiling multiple concurrent models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm

- [LLM Search](docs/config_search.md#llm-search-mode): Model Analyzer can help you
find the optimal settings when profiling large language models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm

### Other Features

- [Detailed and summary reports](docs/report.md): Model Analyzer is able to generate
Expand Down
26 changes: 16 additions & 10 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,9 @@ cpu_only_composing_models: <comma-delimited-string-list>
# Allows custom configuration of perf analyzer instances used by model analyzer
[ perf_analyzer_flags: <dict> ]
# Allows custom configuration of GenAI-perf instances used by model analyzer
[ genai_perf_flags: <dict> ]
# Allows custom configuration of the environment variables for tritonserver instances
# launched by model analyzer
[ triton_server_environment: <dict> ]
Expand Down Expand Up @@ -375,7 +378,7 @@ of the types of constraints allowed:
| `perf_throughput` | inf / sec | min | Specify minimum desired throughput. |
| `perf_latency_p99` | ms | max | Specify maximum tolerable latency or latency budget. |
| `output_token_throughput` | tok / sec | min | Specify minimum desired output token throughput. |
| `inter_token_latency_p99` | ms | max | Specify maximum tolerable input token latency. |
| `inter_token_latency_p99` | ms | max | Specify maximum tolerable inter token latency. |
| `time_to_first_token_p99` | ms | max | Specify maximum tolerable time to first token latency. |
| `gpu_used_memory` | MB | max | Specify maximum GPU memory used by model. |

Expand Down Expand Up @@ -457,15 +460,18 @@ profile_models:
Objectives specify the sorting criteria for the final results. The fields below
are supported under this object type:

| Option Name | Description |
| :----------------- | :----------------------------------------------------- |
| `perf_throughput` | Use throughput as the objective. |
| `perf_latency_p99` | Use latency as the objective. |
| `gpu_used_memory` | Use GPU memory used by the model as the objective. |
| `gpu_free_memory` | Use GPU memory not used by the model as the objective. |
| `gpu_utilization` | Use the GPU utilization as the objective. |
| `cpu_used_ram` | Use RAM used by the model as the objective. |
| `cpu_free_ram` | Use RAM not used by the model as the objective. |
| Option Name | Description |
| :------------------------ | :----------------------------------------------------- |
| `perf_throughput` | Use throughput as the objective. |
| `perf_latency_p99` | Use latency as the objective. |
| `gpu_used_memory` | Use GPU memory used by the model as the objective. |
| `gpu_free_memory` | Use GPU memory not used by the model as the objective. |
| `gpu_utilization` | Use the GPU utilization as the objective. |
| `cpu_used_ram` | Use RAM used by the model as the objective. |
| `cpu_free_ram` | Use RAM not used by the model as the objective. |
| `output_token_throughput` | Use output token throughput as the objective. |
| `inter_token_latency_p99` | Use inter token latency as the objective. |
| `time_to_first_token_p99` | Use time to first token latency as the objective. |

An example `objectives` that will sort the results by throughput looks like
below:
Expand Down
30 changes: 30 additions & 0 deletions docs/config_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ limitations under the License.
- [Quick Search Mode](#quick-search-mode)
- [Ensemble Model Search](#ensemble-model-search)
- [BLS Model Search](#bls-model-search)
- [LLM Search](#llm-search)
- [Multi-Model Search Mode](#multi-model-search-mode)

<br>
Expand Down Expand Up @@ -303,6 +304,35 @@ After Model Analyzer has found the best config(s), it will then sweep the top-N

---

## LLM Search

_This mode has the following limitations:_

- Summary/Detailed reports do not include the new metrics

LLMs can be optimized using either Quick or Brute search mode by setting `--model-type LLM`. You can specify CLI options to the GenAI-Perf tool using `genai_perf_flags`. See the [GenAI-Perf CLI](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/genai-perf/README.md#cli) documentation for a list of the flags that can be specified.

_An example model analyzer YAML config for a LLM:_

```yaml
model_repository: /path/to/model/repository/
model_type: LLM
client_prototcol: grpc
genai_perf_flags:
backend: vllm
streaming: true
```

For LLMs there are three new metrics being reported: **Inter-token Latency**, **Time to First Token Latency** and **Output Token Throughput**.

These new metrics can be specified as either objectives or constraints.

_**NOTE: In order to enable these new metrics you must enable `streaming` in `genai_perf_flags` and the `client protocol` must be set to `gRPC`**_

---

## Multi-Model Search Mode

_This mode has the following limitations:_
Expand Down

0 comments on commit ad399c0

Please sign in to comment.