Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

Open
venki-lfc opened this issue Sep 11, 2024 · 1 comment

Comments

@venki-lfc
Copy link

When I try to load the model on vLLM using

python -m vllm.entrypoints.openai.api_server --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --enforce-eager --host 0.0.0.0 --port 8005 --model ./downloads/hub/models--neuralmagic--Qwen2-72B-Instruct-quantized.w4a16/

I get the following error

.
.
.
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 178, in create_weights
    verify_marlin_supports_shape(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 106, in verify_marlin_supports_shape
    raise ValueError(f"Weight input_size_per_partition = "
ValueError: Weight input_size_per_partition = 14784 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
[rank0]:[W911 15:08:56.043475996 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
ERROR 09-11 15:09:01 api_server.py:186] RPCServer process died before responding to readiness probe

This error does not occur when I use smaller models with single GPU. Llama 3.1 does not produce such an error for multiple GPUs as well. So could you please look into his issue? I really enjoy using Qwen2 for my projects and I find this model to be really good with RAG that's why I am requesting you to solve the quantization issue :)

@jklj077
Copy link
Collaborator

jklj077 commented Sep 12, 2024

Hi, if you quantize the 72B model on your own, please refer to troubleshooting of our docs to first pad the parameters before quantization. Also be aware that the padding size depends on your quantization settings (in particular, group_size or block_size) and inference settings (some efficient kernels require specific group_size or block_size).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants