如何改善Qwen2-VL的爆显存现象 #2606

jianliao · 2024-10-15T19:27:31Z

jianliao
Oct 15, 2024

大家好，

最近我一直在尝试使用Qwen2-VL的2B和7B版本，遇到了一些问题。主要问题是模型非常容易爆显存。以下内容以2B模型为例。

测试环境与命令

测试图片

Server启动命令

lmdeploy serve api_server Qwen/Qwen2-VL-2B-Instruct --tp 2

测试结果

在我的系统上，运行上述命令后显存必然会爆。

问题重现与解决方案尝试

根据Qwen2-VL的HF Model Card文档，我确实能够重现这个问题。幸运的是，文档中也提供了解决方法：

图片按比例缩小或放大到统一的分辨率：这种方法可以减少输入图像的数据量，从而降低显存占用。
使用flash_attention_2 ：这种方式可以节省内存并提高性能。

两种方法都可行，但第二种方法因为不修改图片，不会丢失细节，因此更加好用。

下面是第二种方法的代码：

import torch

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
# )

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                # "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                # "image": "/home/jliao/Repos/qwen2-vl/htop_screen.png",
                "image": "/home/jliao/Repos/qwen2-vl/large.png",
                # "image": "/home/jliao/Repos/qwen2-vl/small.png",
                # "image": "/home/jliao/Repos/qwen2-vl/demo.jpeg"
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

寻求帮助

为了在lmdeploy上启用这两种解决方案，我查阅了相关文档：

对于第一种方法（图片按比例缩小或放大），我在文档中找到了基于pipeline API的Offline inference示例，但遗憾的是没有找到基于Server的示例。

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2-VL-2B-Instruct', log_level='INFO')

min_pixels = 64 * 28 * 28
max_pixels = 64 * 28 * 28
messages = [
    dict(role='user', content=[
        dict(type='text', text='Describe the two images in detail.'),
        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
    ])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

对于第二种方法（使用flash_attention_2），我没有头绪。不知道是否只有通过编写自定义模型扩展的方法才有可能实现。

请问是否有办法在lmdeploy上启用这两种解决方案？如果有相关的示例或文档链接，将不胜感激！

谢谢大家的帮助！

P.S. 有可能相关的Issues：#2565 #2590 #2582

Answered by irexyc

Oct 25, 2024

@jianliao

pipeline 的示例中，message的格式就是openai的格式，使用server的时候传这个message就可以了。

server 端配置是指设置全局的最大最小像素么？目前没这个功能，只能通过改代码来控制。具体位置的在这里，可以在下面加一行比如

if 'max_pixels' not in item:
    item.update(dict(max_pixels=64 * 28 * 28))

View full answer

lvhan028 · 2024-10-17T07:19:27Z

lvhan028
Oct 17, 2024
Maintainer

@irexyc

1 reply

lvhan028 Oct 17, 2024
Maintainer

related issue: #2565

jianliao · 2024-10-19T04:07:49Z

jianliao
Oct 19, 2024
Author

@Titan-p 提供了一个针对第一种限制像素大小的解决方案，亲测有效。

请问有没有能在Server端直接设置的方法呢？例如提供一个配置项或者某种扩展？

0 replies

lvhan028 · 2024-10-25T08:39:49Z

lvhan028
Oct 25, 2024
Maintainer

cc @irexyc

0 replies

irexyc · 2024-10-25T09:41:37Z

irexyc
Oct 25, 2024
Collaborator

@jianliao

pipeline 的示例中，message的格式就是openai的格式，使用server的时候传这个message就可以了。

server 端配置是指设置全局的最大最小像素么？目前没这个功能，只能通过改代码来控制。具体位置的在这里，可以在下面加一行比如

if 'max_pixels' not in item:
    item.update(dict(max_pixels=64 * 28 * 28))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何改善Qwen2-VL的爆显存现象 #2606

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

如何改善Qwen2-VL的爆显存现象 #2606

jianliao Oct 15, 2024

测试环境与命令

测试图片

Server启动命令

测试结果

问题重现与解决方案尝试

寻求帮助

Replies: 4 comments · 1 reply

lvhan028 Oct 17, 2024 Maintainer

lvhan028 Oct 17, 2024 Maintainer

jianliao Oct 19, 2024 Author

lvhan028 Oct 25, 2024 Maintainer

irexyc Oct 25, 2024 Collaborator

jianliao
Oct 15, 2024

Replies: 4 comments 1 reply

lvhan028
Oct 17, 2024
Maintainer

lvhan028 Oct 17, 2024
Maintainer

jianliao
Oct 19, 2024
Author

lvhan028
Oct 25, 2024
Maintainer

irexyc
Oct 25, 2024
Collaborator