Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many instances can Triton support for parallel inference at most? #7641

Open
wwdok opened this issue Sep 22, 2024 · 0 comments
Open

How many instances can Triton support for parallel inference at most? #7641

wwdok opened this issue Sep 22, 2024 · 0 comments

Comments

@wwdok
Copy link

wwdok commented Sep 22, 2024

Suppose I have a 24GB memory GPU, A30, and my model is a 200MB wav2lip model. If I choose TensorRT as the inference framework, can we estimate the maximum number of models that can run in parallel using Triton? For example, if the Triton context consumes x GB of GPU memory, and TensorRT workspace occupies 2 GB, is the maximum number of parallel instances (24 - x) / 2? If so, what is x? Currently, I am using GPU virtualization as the solution to execute concurrent model inference, but this solution can only support up to 5 parallel instances. I want to know if Triton can be an altenative and support more instances compared to it, as this is very important for cost savings.

My use case is to use wav2lip to generate new lip-synced video based on audio. During inference, I take 4 frames of the video as a batch for a inference, so multiple inferences are needed to get the complete synthesized video result. The simplified workflow is shown below:

image
What do you think about how Triton can be integrated into this workflow? Which features of Triton can be used?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant