The steps below will guide you on how to start using Perf Analyzer.
export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3
# inside triton container
git clone --depth 1 https://github.com/triton-inference-server/server
mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository
# inside triton container
tritonserver --model-repository $(pwd)/model_repository &> server.log &
# confirm server is ready, look for 'HTTP/1.1 200 OK'
curl -v localhost:8000/v2/health/ready
# detach (CTRL-p CTRL-q)
docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
# inside sdk container
perf_analyzer -m simple
$ perf_analyzer -m simple
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 25348
Throughput: 1407.84 infer/sec
Avg latency: 708 usec (standard deviation 663 usec)
p50 latency: 690 usec
p90 latency: 881 usec
p95 latency: 926 usec
p99 latency: 1031 usec
Avg HTTP time: 700 usec (send/recv 102 usec + response wait 598 usec)
Server:
Inference count: 25348
Execution count: 25348
Successful request count: 25348
Avg request latency: 382 usec (overhead 41 usec + queue 41 usec + compute input 26 usec + compute infer 257 usec + compute output 16 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1407.84 infer/sec, latency 708 usec
We can see from the output that the model was able to complete approximately 1407.84 inferences per second, with an average latency of 708 microseconds per inference request. Concurrency of 1 meant that Perf Analyzer attempted to always have 1 outgoing request at all times.