Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

Open
YSF-A opened this issue Nov 25, 2024 · 3 comments
Open

how to avoid oom when inference qwen2-vl 7B with batch=2? #2496

YSF-A opened this issue Nov 25, 2024 · 3 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@YSF-A
Copy link

YSF-A commented Nov 25, 2024

I try to inference qwen2-vl 7B with batch = 2 on 4090 (24G), and I got a oom. How to avoid the oom? In my opinion,compared to 4090(24G), batch = 2 is not too large.

I found that the llm engine is 15G, but after build ModelRunnerCpp,memory increases by almost 20G. Is it correct?

@gujiewen
Copy link

Have you set --kv_cache_free_gpu_memory_fraction to be smaller? By default, it's 0.9.

@YSF-A
Copy link
Author

YSF-A commented Nov 26, 2024

Have you set --kv_cache_free_gpu_memory_fraction to be smaller? By default, it's 0.9.

@gujiewen thanks for your reply. I set --kv_cache_free_gpu_memory_fraction to 0.8 and it works.

But the output is empty. I check the output_ids and output_token_ids are full of eos.

@hello-11 hello-11 added the triaged Issue has been triaged by maintainers label Dec 2, 2024
@sunnyqgg
Copy link
Collaborator

Hi @YSF-A . pls use the latest code and you can run it with INT4/INT8/FP8 referring to examples/qwen. Pls let me know if you have any other questions about it.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants