-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] The cache-max-entry-count
working off percentages makes it difficult to setup multiple servers
#2732
Comments
I wish a single api server could serve multiple models and share the kv cache between them. That would be the best solution to this. |
We will not do multiple models sharing kv cache.
It serves two models, each model owns a kv cache, which at most save 10000 tokens, respectively. |
I think that would be better, though I think I'd prefer to specify it in absolute byte amounts or even percentages before (and not after) the model is loaded. Right now, the way the fractions are used are very awkward. If I put in 1.0 or 0.99 I get an out of memory error. If I put in 0.96, it actually silently freezes and doesn't load the API server properly (at least on the I can't use multiple models without making use of containers and allocating them the specific GPU ids that I want them to use.
On review, I think your suggestion would be the most precise, but if you commit to it, do give the user a way of getting how much memory each token takes. |
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response. |
Motivation
I have a DGX machine, and I want to run multiple models on it. If I launched multiple servers concurrently (with tp=8), it would be ambiguous how much memory they would actually take up. By that, consider what would happen if I spun up a 1gb model with this set to 0.5, and then spun up a 100gb model with this set to 0.5.
The 1gb model would get half of the 640gb of memory that the machine has, which is 320gb, but the large model would only allocate 110gb for it's kv cache.
If I lauched them in the opposite order, the large machine would get 270gb, and the smaller one would get ~135gb.
For that reason, it'd be more convenient to be able to put in cache sizes in absolute amounts.
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: