-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Control the Number of Model Replicas in ModelMesh Serving #500
Comments
Hi, @michael-nammi, have you found the solution ? |
I found this doc from the model-mesh repository. Hope this will help. |
I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting |
Unfortunately, there is no way to set a fixed number of replicas of a certain model, you may only control it indirectly via concurrency settings of the serving runtime servers. As far as I know, there 2 logics of model scaling:
|
So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting |
It seems I had 2 Prometheus jobs collecting the same metrics and after aggregating got duplicated results. I have 7 models and 7 copies, actually. |
The idea here is that Model Mesh control the number of replicas of the serving runtimes, not the models. You can definitely set the The fact that your model scaled up, that means there was available capacity in your serving runtimes. Scaling up in that case makes sense to me. If you want to prevent that scaling up, I think that you can set the By the way, I have just deployed modelmesh on my staging environment in my company, although we are gonna deploy it on production environment soon, perhaps there is something I miss about how modelmesh control the number of replicas (that thing baffled me a lot). |
Description
I am working with ModelMesh Serving deployed on a Kubernetes cluster and I am looking for a way to control the number of replicas for a specific model. My setup includes a Triton runtime with two pods, and I'm serving a model mobilenet. I aim to ensure that the model replicas can be configured to a specific number.
Cluster State:
The state of pods in my cluster is as follows:
Inference service status
The InferenceService for mobilenet (example-mobilenet-isvc) has minReplicas set to 2, as shown in the description below:
ETCD Keys and Values:
Relevant data from ETCD suggests only one replica is active for the model as per the instanceIds and count:
Question:
How can one ensure that ModelMesh Serving adheres to the minReplicas configuration for a specific model? The documentation does not seem to discuss in depth about scaling individual model replicas across the serving pods. Is there a way to control the model replicas in modelmesh serving?
The text was updated successfully, but these errors were encountered: