Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] The cache-max-entry-count working off percentages makes it difficult to setup multiple servers #2732

Open
mrakgr opened this issue Nov 9, 2024 · 4 comments
Assignees

Comments

@mrakgr
Copy link

mrakgr commented Nov 9, 2024

Motivation

  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float

I have a DGX machine, and I want to run multiple models on it. If I launched multiple servers concurrently (with tp=8), it would be ambiguous how much memory they would actually take up. By that, consider what would happen if I spun up a 1gb model with this set to 0.5, and then spun up a 100gb model with this set to 0.5.

The 1gb model would get half of the 640gb of memory that the machine has, which is 320gb, but the large model would only allocate 110gb for it's kv cache.

If I lauched them in the opposite order, the large machine would get 270gb, and the smaller one would get ~135gb.

For that reason, it'd be more convenient to be able to put in cache sizes in absolute amounts.

Related resources

No response

Additional context

No response

@mrakgr
Copy link
Author

mrakgr commented Nov 9, 2024

I wish a single api server could serve multiple models and share the kv cache between them. That would be the best solution to this.

@lvhan028 lvhan028 self-assigned this Nov 13, 2024
@lvhan028
Copy link
Collaborator

We will not do multiple models sharing kv cache.
Instead, we will make --cache-max-entry-count referring to the number of cached kv tokens shared by the inference instances of a model.
For example,

lmdeploy serve api_server InternLM/internlm2_5-7b-chat --cache-max-entry-count 10000
lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --cache-max-entry-count 10000

It serves two models, each model owns a kv cache, which at most save 10000 tokens, respectively.

@mrakgr
Copy link
Author

mrakgr commented Nov 13, 2024

I think that would be better, though I think I'd prefer to specify it in absolute byte amounts or even percentages before (and not after) the model is loaded. Right now, the way the fractions are used are very awkward. If I put in 1.0 or 0.99 I get an out of memory error. If I put in 0.96, it actually silently freezes and doesn't load the API server properly (at least on the nvidia/Llama-3.1-Nemotron-70B-Instruct-HF on the HGX machine that I am testing it on.) With 0.95 it works.

I can't use multiple models without making use of containers and allocating them the specific GPU ids that I want them to use.

It serves two models, each model owns a kv cache, which at most save 10000 tokens, respectively.

On review, I think your suggestion would be the most precise, but if you commit to it, do give the user a way of getting how much memory each token takes.

Copy link

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants