Generation speed suddenly became very slow, what might be the cause? #8886
Unanswered
zhuwenfei-wintech
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The models I deployed suddenly became very slow without any indication, generation throughput around 1 token/s.
Here is the detail:
I deployed Qwen2-72B-Instruct-GPTQ-Int4 with vLLM v0.5.1. The deployment was done via docker compose and official docker image v0.5.1.
I deployed the same model with the same settings on two machines. One machine for production with 4x4090 GPU, 256G memory, Xeon 8352V 36 cores. One machine for development with 2xH100 GPU (only one was used for the model), 256G memory, AMD EPYC 9354 32-Core.
Both models on these two machines have been running for over one month without any problem but suddenly they both (not at the same time, but within 24 hours) became very slow. No error log, vRAM was OK, GPU utilization was OK, cpu, memory, everything was OK. I couldn't find a cause.
Here is part of the log, you can see the generation throughput suddenly decrease.
After a clear restart of the containers, the problem seems gone so far.
Any idea what might be the cause?
Beta Was this translation helpful? Give feedback.
All reactions