[Feature] The `cache-max-entry-count` working off percentages makes it difficult to setup multiple servers #2732

mrakgr · 2024-11-09T09:01:57Z

Motivation

  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float

I have a DGX machine, and I want to run multiple models on it. If I launched multiple servers concurrently (with tp=8), it would be ambiguous how much memory they would actually take up. By that, consider what would happen if I spun up a 1gb model with this set to 0.5, and then spun up a 100gb model with this set to 0.5.

The 1gb model would get half of the 640gb of memory that the machine has, which is 320gb, but the large model would only allocate 110gb for it's kv cache.

If I lauched them in the opposite order, the large machine would get 270gb, and the smaller one would get ~135gb.

For that reason, it'd be more convenient to be able to put in cache sizes in absolute amounts.

Related resources

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

mrakgr · 2024-11-09T09:03:30Z

I wish a single api server could serve multiple models and share the kv cache between them. That would be the best solution to this.

lvhan028 · 2024-11-13T16:04:07Z

We will not do multiple models sharing kv cache.
Instead, we will make --cache-max-entry-count referring to the number of cached kv tokens shared by the inference instances of a model.
For example,

lmdeploy serve api_server InternLM/internlm2_5-7b-chat --cache-max-entry-count 10000
lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --cache-max-entry-count 10000

It serves two models, each model owns a kv cache, which at most save 10000 tokens, respectively.

mrakgr · 2024-11-13T17:11:46Z

I think that would be better, though I think I'd prefer to specify it in absolute byte amounts or even percentages before (and not after) the model is loaded. Right now, the way the fractions are used are very awkward. If I put in 1.0 or 0.99 I get an out of memory error. If I put in 0.96, it actually silently freezes and doesn't load the API server properly (at least on the nvidia/Llama-3.1-Nemotron-70B-Instruct-HF on the HGX machine that I am testing it on.) With 0.95 it works.

I can't use multiple models without making use of containers and allocating them the specific GPU ids that I want them to use.

It serves two models, each model owns a kv cache, which at most save 10000 tokens, respectively.

On review, I think your suggestion would be the most precise, but if you commit to it, do give the user a way of getting how much memory each token takes.

github-actions · 2024-11-21T02:44:48Z

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

lvhan028 self-assigned this Nov 13, 2024

lvhan028 added the awaiting response label Nov 13, 2024

github-actions bot added the Stale label Nov 21, 2024

lvhan028 removed awaiting response Stale labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] The `cache-max-entry-count` working off percentages makes it difficult to setup multiple servers #2732

[Feature] The `cache-max-entry-count` working off percentages makes it difficult to setup multiple servers #2732

mrakgr commented Nov 9, 2024

mrakgr commented Nov 9, 2024

lvhan028 commented Nov 13, 2024

mrakgr commented Nov 13, 2024

github-actions bot commented Nov 21, 2024

[Feature] The cache-max-entry-count working off percentages makes it difficult to setup multiple servers #2732

[Feature] The cache-max-entry-count working off percentages makes it difficult to setup multiple servers #2732

Comments

mrakgr commented Nov 9, 2024

Motivation

Related resources

Additional context

mrakgr commented Nov 9, 2024

lvhan028 commented Nov 13, 2024

mrakgr commented Nov 13, 2024

github-actions bot commented Nov 21, 2024

[Feature] The `cache-max-entry-count` working off percentages makes it difficult to setup multiple servers #2732

[Feature] The `cache-max-entry-count` working off percentages makes it difficult to setup multiple servers #2732