How many instances can Triton support for parallel inference at most? #7641

wwdok · 2024-09-22T13:27:13Z

Suppose I have a 24GB memory GPU, A30, and my model is a 200MB wav2lip model. If I choose TensorRT as the inference framework, can we estimate the maximum number of models that can run in parallel using Triton? For example, if the Triton context consumes x GB of GPU memory, and TensorRT workspace occupies 2 GB, is the maximum number of parallel instances (24 - x) / 2? If so, what is x? Currently, I am using GPU virtualization as the solution to execute concurrent model inference, but this solution can only support up to 5 parallel instances. I want to know if Triton can be an altenative and support more instances compared to it, as this is very important for cost savings.

My use case is to use wav2lip to generate new lip-synced video based on audio. During inference, I take 4 frames of the video as a batch for a inference, so multiple inferences are needed to get the complete synthesized video result. The simplified workflow is shown below:

What do you think about how Triton can be integrated into this workflow? Which features of Triton can be used?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many instances can Triton support for parallel inference at most? #7641

How many instances can Triton support for parallel inference at most? #7641

wwdok commented Sep 22, 2024

How many instances can Triton support for parallel inference at most? #7641

How many instances can Triton support for parallel inference at most? #7641

Comments

wwdok commented Sep 22, 2024