Qwen2VL exhibits significant performance differences under different attention implementations. #35749

masn1310 · 2025-01-17T10:36:52Z

System Info

transformers=4.47.1
pytorh=2.3.0
flash-attn=2.7.2
python=3.10

Who can help?

@amyeroberts @qubvel @zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I'm using the lmms-eval framework to evaluate qwen2vl models on various of benchmarks.

here are the scrips:

python3 -m accelerate.commands.launch \
    --main_process_port=28175 \
    --mixed_precision=bf16 \
    --num_processes=2 \
    -m lmms_eval \
    --model qwen2_vl_with_kvcache  \
    --model_args pretrained=/share/home/models/Qwen2-VL-7B-Instruct,use_flash_attention_2=true\
    --tasks chartqa  \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix chartqa \
    --output_path ./logs/qwen2vl/chatqa/

Expected behavior

Recently, I've been using Qwen2VL-7B for evaluation under the lmms-eval framework and discovered some confusing phenomena.

Taking the ChartQA task as an example, when both the vision and LLM utilize flash-attention2, I can achieve a score of 81.56. However, when both vision and LLM use eager attention, the score drops significantly to 72.64.

To explore further, I conducted additional experiments and found that regardless of which attention implementation the vision module uses, the score remains around 82.
However, when the vision module uses flash-attention2 while the LLM employs eager attention, the score drops to just 0.0008, and the model loses its generative ability, endlessly repeating one or two words.

LLM Attention	Vision: Flash	Vision: Eager
Flash	81.56	82.00
Eager	0.0008	72.64

the model's response under 0.0008 setting:
"The value of the the the the the the the the the the the the the"
"````````````````````````````````````````````````"
"A is a person assistant. A is a person assistant. A is a person"
"The following are the the the the the the the the the the the the the"

The above results are all based on BF16 precision.
I also conducted a check regarding precision. For all modules use eager attention, I converted QKV to float to ensure that attention calculations during the forward pass were in FP32. Unfortunately, the final result remained the same as BF16 (72.64).

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-01-17T14:31:55Z

Definitely an interesting bug if it reproduces, cc @zucchini-nlp

zucchini-nlp · 2025-01-17T14:51:09Z

Will definitely look at it later next week, afaik we had a bug with Qwen2 text-only LM returning nan values with eager attention and float16. So might be loosely related to that

Btw, do you know if this used to work better when model was released?

masn1310 added the bug label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2VL exhibits significant performance differences under different attention implementations. #35749

Qwen2VL exhibits significant performance differences under different attention implementations. #35749

masn1310 commented Jan 17, 2025

Rocketknight1 commented Jan 17, 2025

zucchini-nlp commented Jan 17, 2025 •

edited

Loading

Qwen2VL exhibits significant performance differences under different attention implementations. #35749

Qwen2VL exhibits significant performance differences under different attention implementations. #35749

Comments

masn1310 commented Jan 17, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Jan 17, 2025

zucchini-nlp commented Jan 17, 2025 • edited Loading

zucchini-nlp commented Jan 17, 2025 •

edited

Loading