cache wrong code #34232

mdy666 · 2024-10-18T03:02:22Z

System Info

Although this method is un-useful, but it's little wrong

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

nan

Expected behavior

fix

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-10-18T07:52:39Z

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

mdy666 · 2024-10-18T07:59:50Z

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

maybe it should be "value_cache" rather than "key_cache", but i don't know it well

zucchini-nlp · 2024-10-18T08:31:30Z

Oh right, didn't notice it! Yes, that needs to be fixed and weird we didn't catch any tests failing. Feel free to open a PR if you are willing to 😄 and tag @gante for review. If you don't have bandwidth, we'll make sure to fix it soon.

Thanks for reporting!

mearcstapa-gqz · 2024-10-18T08:58:46Z

@zucchini-nlp Hi I want to use this thread to ask a somewhat related question. I want to basically to extend the "Re-use Cache to continue generation" tutorial https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation to batched case. But the model gives erroneous output. By preliminary debugging, I suspect it's because of the default left padding used, so that the cache positions are not aligned correctly. (not sure whether the bug from issue contributes as well or not ) I want to know that are there any existing code that I can refer to? Thanks!

zucchini-nlp · 2024-10-18T09:32:53Z

@mearcstapa-gqz you mean use a batched cache from pre-fill stage in batched generation or use one same pre-fill prompt but continue generate with multiple texts at once? Please share your minimal code and I'll see what might be the error, as expanding to batched generation should be straightforward unless I am missing anything

mearcstapa-gqz · 2024-10-18T10:11:23Z

@zucchini-nlp Thanks! On second look, I noticed that the example provided https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation ~~is indeed batched. I got it wrong when I saw "max_batch_size=1" in the argument in StaticCache.~~ the example code use a for-loop for prompts

My use case is the same as the example code. I have
texts=[f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}{query}<|im_end|>\n<|im_start|>assistant\n" for query in queries]
And a want to cache the f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}" part.

I would try to debug my self then, should it fails, I would provide a minimal code and ask for help again.

mearcstapa-gqz · 2024-10-21T07:59:42Z

@zucchini-nlp
the example code use a for-loop for prompts. I can't figure out how to set up the past_key_values to make it work like normal batch inference. Here's a minimal code.

import os
import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left" 
# Curiously, setting tokenizer.padding_side = "right" yields coherent(? but if I switch my model to "Qwen/Qwen2-VL-2B-Instruct", padding_side right produces gibberish also) result for both get_output(inputs) and get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)). 
# But there's a warning "A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer." 
# What are the implications??
# https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side

INITIAL_PROMPT = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n'
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]

inputs = tokenizer([INITIAL_PROMPT + prompt + '<|im_end|>\n<|im_start|>assistant\n' for prompt in prompts], return_tensors="pt", padding=True).to("cuda")


def get_output(inputs, past_key_values=None):
    generated_ids = model.generate(**inputs, past_key_values=past_key_values,max_new_tokens=20)

    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_texts = tokenizer.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    for o in output_texts:
        print(o)

get_output(inputs) # normal batch inference


prompt_cache = DynamicCache()
inputs_initial_prompt = tokenizer([INITIAL_PROMPT] * 2, return_tensors="pt").to("cuda")

with torch.no_grad():
     prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values

get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)) # incoherent output, how to set past_key_values properly?

zucchini-nlp · 2024-10-21T08:55:31Z

Hmm you're right, in case we want to do batching the padding will not be set correctly because the initial prompt has no padding while the subsequent calls will be padded on the left. So we'll end up with sequences as follows:

INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT

I don't see an easy way to overcome this unless we start supporting nested tensors. Also cc @gante , if you have any ideas or maybe we add this to out TODO list

mearcstapa-gqz · 2024-10-21T09:32:39Z

@zucchini-nlp
May I ask why something like this won't work?

Acutally I tried to make the input look like
INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT
with


texts = [f"{query}<|im_end|>\n<|im_start|>assistant\n" for query in ["Help me to write a blogpost about travelling.", "What is the capital of France?"]]

inputs = processor(
    text=texts, images=None, padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)

inputs = BatchFeature(data={
    'input_ids': torch.concat([inputs_initial_prompt.input_ids, inputs.input_ids], -1),
    'attention_mask': torch.concat([inputs_initial_prompt.attention_mask, inputs.attention_mask], -1)
})

Is it because the attention_mask passed is actually generated inside the model?

zucchini-nlp · 2024-10-21T15:34:16Z

Please see: #25420 (comment) for why padding-side/batching matters when generating

lzl-mt · 2024-11-08T11:05:44Z

I have a question: why is the KV cache already computed for INITIAL_PROMPT, but the current input still needs to append INITIAL_PROMPT? Wouldn't this lead to calculating INIT twice?
inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda") in https://huggingface.co/docs/transformers/kv_cache#re-use-cache-to-continue-generation
Thx! @zucchini-nlp @mearcstapa-gqz

mayankagarwals · 2024-11-12T09:12:33Z

Hi 👋
would like to raise a PR for this issue if it's still open

zucchini-nlp · 2024-11-15T10:17:50Z

@lzl-mt it is from the way generate() works, currently we crop input ids and remove all prev ids that are already in the cache. Si there will be no twice computation :)

@mayankagarwals sorry, didn't see your comment. Since the person who opened the issue was off for a few weeks, I opened the fix myself in #34746

mdy666 added the bug label Oct 18, 2024

mearcstapa-gqz mentioned this issue Oct 29, 2024

[Usage]: prefix caching support for multimodal models vllm-project/vllm#9790

Open

1 task

zucchini-nlp linked a pull request Nov 15, 2024 that will close this issue

Fix low memory beam search #34746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache wrong code #34232

cache wrong code #34232

mdy666 commented Oct 18, 2024

zucchini-nlp commented Oct 18, 2024

mdy666 commented Oct 18, 2024

zucchini-nlp commented Oct 18, 2024

mearcstapa-gqz commented Oct 18, 2024 •

edited

Loading

zucchini-nlp commented Oct 18, 2024

mearcstapa-gqz commented Oct 18, 2024 •

edited

Loading

mearcstapa-gqz commented Oct 21, 2024 •

edited

Loading

zucchini-nlp commented Oct 21, 2024

mearcstapa-gqz commented Oct 21, 2024

zucchini-nlp commented Oct 21, 2024

lzl-mt commented Nov 8, 2024

mayankagarwals commented Nov 12, 2024

zucchini-nlp commented Nov 15, 2024

cache wrong code #34232

cache wrong code #34232

Comments

mdy666 commented Oct 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Oct 18, 2024

mdy666 commented Oct 18, 2024

zucchini-nlp commented Oct 18, 2024

mearcstapa-gqz commented Oct 18, 2024 • edited Loading

zucchini-nlp commented Oct 18, 2024

mearcstapa-gqz commented Oct 18, 2024 • edited Loading

mearcstapa-gqz commented Oct 21, 2024 • edited Loading

zucchini-nlp commented Oct 21, 2024

mearcstapa-gqz commented Oct 21, 2024

zucchini-nlp commented Oct 21, 2024

lzl-mt commented Nov 8, 2024

mayankagarwals commented Nov 12, 2024

zucchini-nlp commented Nov 15, 2024

mearcstapa-gqz commented Oct 18, 2024 •

edited

Loading

mearcstapa-gqz commented Oct 18, 2024 •

edited

Loading

mearcstapa-gqz commented Oct 21, 2024 •

edited

Loading