how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

YSF-A · 2024-11-25T10:02:11Z

I try to inference qwen2-vl 7B with batch = 2 on 4090 (24G), and I got a oom. How to avoid the oom? In my opinion，compared to 4090(24G), batch = 2 is not too large.

I found that the llm engine is 15G, but after build ModelRunnerCpp，memory increases by almost 20G. Is it correct?

gujiewen · 2024-11-26T02:16:22Z

Have you set --kv_cache_free_gpu_memory_fraction to be smaller? By default, it's 0.9.

YSF-A · 2024-11-26T06:07:47Z

Have you set --kv_cache_free_gpu_memory_fraction to be smaller? By default, it's 0.9.

@gujiewen thanks for your reply. I set --kv_cache_free_gpu_memory_fraction to 0.8 and it works.

But the output is empty. I check the output_ids and output_token_ids are full of eos.

sunnyqgg · 2024-12-11T08:57:56Z

Hi @YSF-A . pls use the latest code and you can run it with INT4/INT8/FP8 referring to examples/qwen. Pls let me know if you have any other questions about it.

Thanks.

hello-11 assigned sunnyqgg Dec 2, 2024

hello-11 added the triaged Issue has been triaged by maintainers label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

YSF-A commented Nov 25, 2024

gujiewen commented Nov 26, 2024

YSF-A commented Nov 26, 2024

sunnyqgg commented Dec 11, 2024

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

Comments

YSF-A commented Nov 25, 2024

gujiewen commented Nov 26, 2024

YSF-A commented Nov 26, 2024

sunnyqgg commented Dec 11, 2024