Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #736

pigeonsoup · 2024-07-05T09:12:00Z

Hello,

I've been deploying my VQA (Vision Query Answer) model using Triton Server and utilizing the perf_analyzer tool for benchmarking. However, using random data for the VQA model leads to undefined behavior, making it crucial to use real input data, which is challenging to construct. Below is the command I used with perf_analyzer:

perf_analyzer -m <model_name> --request-rate-range=10 --measurement-interval=30000 --string-data '{"imageBase64Str": "/9j/4AAQS...D//Z", "textPrompt": "\u8bf7\u5e2e\...\u3002"}'

The model expects a JSON-formatted string as input, with two fields: 'imageBase64Str', which contains base64-encoded image data, and 'textPrompt', which is the text input.

Fortunately, this method works. However, the request rate is disappointingly slow, averaging 500ms per request, even when I set --request-rate-range=10. I encountered the following warning:

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 100.00% of the requests were delayed.

I'm facing difficulties in benchmarking my model effectively, as it isn't receiving a sufficient number of requests at present. I suspect that the large size of the base64 data in the '--string-data' option is contributing to the slowdown. Is there a faster or better way to send requests that could help me achieve a more accurate benchmark?

Best regards,

The text was updated successfully, but these errors were encountered:

dyastremsky · 2024-07-24T14:51:30Z

Do you want to try the async client (--async mode)? The async client is likely to work better for you, though there is a fix in 24.07 that will help it send requests at exactly the rate you request (in some cases, it can fall behind). If you want to use the synchronous client, you can also set a higher concurrency value or batching (-b).

CC: @matthewkotila in case I'm missing why a synchronous client is having trouble achieving the requested rate. I would assume that there'd be enough workers (16) to send all the requests for the request rate (-b and --num-workers were not specified), but Matt deeply knows the details of how PA works.

matthewkotila · 2024-08-06T21:04:38Z

Async seems like a good approach to try. I can't remember the default behavior, but you'll want to make sure that --max-threads is at least as much as your desired request rate.

pigeonsoup changed the title ~~perf_analyzer make request for multi-modal input~~ Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer Jul 5, 2024

dyastremsky added the question Further information is requested label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #736

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #736

pigeonsoup commented Jul 5, 2024 •

edited

Loading

dyastremsky commented Jul 24, 2024

matthewkotila commented Aug 6, 2024

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #736

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #736

Comments

pigeonsoup commented Jul 5, 2024 • edited Loading

dyastremsky commented Jul 24, 2024

matthewkotila commented Aug 6, 2024

pigeonsoup commented Jul 5, 2024 •

edited

Loading