The steps below will guide you through using Model Analyzer in Docker mode to profile and analyze a simple BLS model: bls.
1. Create a new directory and enter it
mkdir <new_dir> && cd <new_dir>
2. Start a git repository
git init && git remote add -f origin https://github.com/triton-inference-server/model_analyzer.git
3. Enable sparse checkout, and download the examples directory, which contains the bls and add models
git config core.sparseCheckout true && \
echo 'examples' >> .git/info/sparse-checkout && \
git pull origin main
1. Pull the SDK container:
docker pull nvcr.io/nvidia/tritonserver:24.11-py3-sdk
2. Run the SDK container
docker run -it --gpus 1 \
--shm-size 2G \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd)/examples/quick-start:$(pwd)/examples/quick-start \
--net=host nvcr.io/nvidia/tritonserver:24.11-py3-sdk
Important: The example above uses a single GPU. If you are running on multiple GPUs, you may need to increase the shared memory size accordingly
The examples/quick-start directory is an example Triton Model Repository that contains the BLS model bls
which calculates the sum of two inputs using add
model.
An example model analyzer YAML config that performs a BLS model search
model_repository: <path-to-examples-quick-start>
profile_models:
- bls
bls_composing_models: add
perf_analyzer_flags:
input-data: <path-to-examples-quick-start>/bls_input_data.json
triton_launch_mode: docker
triton_docker_shm_size: 2G
output_model_repository_path: <path-to-output-model-repo>/<output_dir>
export_path: profile_results
Important: You must specify an <output_dir>
subdirectory. You cannot have output_model_repository_path
point directly to <path-to-output-model-repo>
Important: If you already ran this earlier in the container, you can overwrite earlier results by adding the override_output_model_repository: true
field to the YAML file.
Important: All models must be in the same repository
Important: bls
model takes "MODEL_NAME" as one of its inputs. We must include "add" in the input data JSON file as "MODEL_NAME" for this example to function. Otherwise, Perf Analyzer will produce random data for "MODEL_NAME," resulting in failed inferences.
Run the Model Analyzer profile
subcommand inside the container with:
model-analyzer profile -f /path/to/config.yml
The Model analyzer uses Quick Search algorithm for profiling the BLS model. After the quick search is completed, Model Analyzer will then sweep concurrencies for the top three configurations and then create a summary report and CSV outputs.
Here is an example result summary, run on a Tesla V100 GPU:
You will note that the top model configuration has a higher throughput than the other configurations.
The measured data and summary report will be placed inside the
./profile_results
directory. The directory will be structured as follows.
$HOME
|-- model_analyzer
|-- profile_results
|-- perf_analyzer_error.log
|-- plots
| |-- detailed
| | |-- bls_config_7
| | | `-- latency_breakdown.png
| | |-- bls_config_8
| | | `-- latency_breakdown.png
| | `-- bls_config_9
| | `-- latency_breakdown.png
| `-- simple
| |-- bls
| | |-- gpu_mem_v_latency.png
| | `-- throughput_v_latency.png
| |-- bls_config_7
| | |-- cpu_mem_v_latency.png
| | |-- gpu_mem_v_latency.png
| | |-- gpu_power_v_latency.png
| | `-- gpu_util_v_latency.png
| |-- bls_config_8
| | |-- cpu_mem_v_latency.png
| | |-- gpu_mem_v_latency.png
| | |-- gpu_power_v_latency.png
| | `-- gpu_util_v_latency.png
| `-- bls_config_9
| |-- cpu_mem_v_latency.png
| |-- gpu_mem_v_latency.png
| |-- gpu_power_v_latency.png
| `-- gpu_util_v_latency.png
|-- reports
| |-- detailed
| | |-- bls_config_7
| | | `-- detailed_report.pdf
| | |-- bls_config_8
| | | `-- detailed_report.pdf
| | `-- bls_config_9
| | `-- detailed_report.pdf
| `-- summaries
| `-- bls
| `-- result_summary.pdf
`-- results
|-- metrics-model-gpu.csv
|-- metrics-model-inference.csv
`-- metrics-server-only.csv
Note: Above configurations, bls_config_7, bls_config_8, and bls_config_9 are generated as the top configurations when running profiling on a single Tesla V100 GPU. However, running on multiple GPUs or different model GPUs may result in different top configurations.