Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefix cache stats to usage #2018

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ispobock
Copy link
Contributor

Motivation

#1942

Modification

Log prefix cache statistics.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

src/turbomind/models/llama/SequenceManager.cc Outdated Show resolved Hide resolved

// match prefix cache
for (int i = 0; i < sequences.size(); i++) {
if (!sequences[i]->prompt.empty() && sequences[i]->blocks.empty()) {
auto& seq = const_cast<Sequence&>(*sequences[i]);
block_trie_->match(seq);
seq.cache_len = seq.blocks.size() * block_seq_len_;
if (rank_ == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing for each sequence, will it be costly? May we consider doing a complete count and then printing only once? What is your suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when tp>1, for each device, the count will be the same, so we don't need to print all of them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that the statistics may be calculated and printed once for each batch, not for each sequence and has nothing to do with tp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we can calculate and log an average hit_rate for a batch of sequences in one iteration?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be acceptable. The amount of logs has decreased significantly, and at the same time, we can also intuitively understand the hit rate at the batch level. Do you have any suggestions? @lzhangzz @irexyc @lvhan028

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@josephrocca josephrocca Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs For my use case, what would be the most useful is to simply add the number of prefix tokens that were cached to the API response. E.g. the last EventStream object. Otherwise I need need to write my own parser for the logs to simply get the info on each request.

E.g. something like this would be great as the final EventStream item:

{
  "id": "15964",
  "object": "text_completion",
  "created": 1720934754,
  "model": "repo/modelname", 
  "choices": [{"index":0,"text":"\n","logprobs":null,"finish_reason":"stop"}],
  "request_summary": {
    "input_tokens": 2542,
    "output_tokens": 230,
    "prefix_cached_tokens": 1740,
    "prefill_time": 0.4,
    "decoding_time": 2.3
  }
}

I've added some other example info there that would be useful to me, just as an example of how request_summary (or usage or similar) could be extended in the future - a couple inspired by this issue:

There is precedent for adding extra info to the final EventStream object - e.g. the (full) generated_text in final TGI EventStream item, and usage in final OpenAI API EventStream item:

An optional field that will only be present when you set stream_options: {"include_usage": true} in your request. When present, it contains a null value except for the last chunk which contains the token usage statistics for the entire request.

This reason this would be beneficial for me is because I collect statistics at my load balancing server, keyed by GPU type, hosting providers, etc. - and if I simply needed to inspect the request_summary field of the final EventStream object, then this makes things very easy for me.

However, if this is not possible, I will certainly benefit from the current logging approach. Thanks for your work on this @ispobock!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @josephrocca Adding prefix_cached_tokens to the usage of the final EventStream item is technically feasible. But this will involve API changes. Currently, only non-streaming responses have usage field, which contains request-level information such as prompt_tokens, completion_tokens, and total_tokens. @AllentDan may help confirm this. Whether to add prefix_cached_tokens to usage and whether to attach usage to the streaming response still need to be discussed.
cc: @lvhan028 @lzhangzz @irexyc @zhyncs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ispobock I think the streaming response returns the "usage" when the last token is generated.

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 13, 2024

@zhulinJulia24 The unit-test and pr_ete_test failed because No space left on device. May you help fix this? Thanks.

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 15, 2024

@zhulinJulia24 The unit-test and pr_ete_test failed because No space left on device. May you help fix this? Thanks.

cc @zhulinJulia24

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 15, 2024

/usr/bin/ld: final link failed: No space left on device

@ispobock
Copy link
Contributor Author

@lvhan028 @AllentDan Could you help review this PR?Do you have any suggestions for the API changes mentioned in #2018 (comment)?

@lvhan028
Copy link
Collaborator

Hi, @ispobock
I don't think adding logs meets the requirement of @josephrocca
LMDeploy returns usage to the client when the generation is completed, including "prompt_tokens", "completion_tokens" and 'total_tokens'
Perhaps we can extend the output tensors by adding statistics information about prefix_caching.

std::unordered_map<std::string, ft::Tensor> output_tensors = std::unordered_map<std::string, ft::Tensor>{
      {"output_ids",
       ft::Tensor{ft::MEMORY_CPU,
                  ft::TYPE_UINT32,
                  std::vector<size_t>{request_batch_size, beam_width, (size_t)instance_->session_len},
                  d_output_ids_}},
      {"sequence_length",
       ft::Tensor{ft::MEMORY_CPU,
                  ft::TYPE_UINT32,
                  std::vector<size_t>{request_batch_size, beam_width},
                  d_sequence_lengths_}}};

@ispobock ispobock changed the title Add log info for prefix cache Add prefix cache stats to usage Jul 24, 2024
@ispobock
Copy link
Contributor Author

@lvhan028 @josephrocca The prefix_cached_tokens is added to the usage (ref the following example). Please help check and review.

server:

lmdeploy serve api_server /workdir/llama2_13b_chat --server-name 0.0.0.0 --server-port 23333 --tp 1 --enable-prefix-caching

request:

curl -X POST http://0.0.0.0:23333/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workdir/llama2_13b_chat",
"prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information. Please introduce yourself.",
"max_tokens": 50,
"temperature": 1
}'

response:

{"id":"4","object":"text_completion","created":1721867570,"model":"/workdir/llama2_13b_chat","choices":[{"index":0,"text":"\n\n Hello! I'm just an AI, I'm here to help answer any questions you may have. My name is LLaMA, and I'm here to assist you in a helpful, respectful, and honest manner.","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":117,"total_tokens":168,"completion_tokens":51,"prefix_cached_tokens":64}}

@josephrocca
Copy link

That's great @ispobock! Thank you! I use stream:true for my use case - is the same usage data added to the final event stream JSON line too, like with OpenAI's API?

@ispobock
Copy link
Contributor Author

@josephrocca Added usage to final stream response, pls check again

@lvhan028
Copy link
Collaborator

lvhan028 commented Aug 9, 2024

Hi, All, given we are at a breaking point #2090, I am going to postpone this PR until #2090 is merged
I am sorry for the inconvenience

@akai-shuuichi
Copy link

大家好,鉴于我们正处于临界点#2090,我将推迟此 PR,直到#2090合并为止, 对于给您带来的不便,我深表歉意

这个有计划发布吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants