Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

venki-lfc · 2024-09-11T15:16:24Z

When I try to load the model on vLLM using

python -m vllm.entrypoints.openai.api_server --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --enforce-eager --host 0.0.0.0 --port 8005 --model ./downloads/hub/models--neuralmagic--Qwen2-72B-Instruct-quantized.w4a16/

I get the following error

.
.
.
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 178, in create_weights
    verify_marlin_supports_shape(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 106, in verify_marlin_supports_shape
    raise ValueError(f"Weight input_size_per_partition = "
ValueError: Weight input_size_per_partition = 14784 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
[rank0]:[W911 15:08:56.043475996 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
ERROR 09-11 15:09:01 api_server.py:186] RPCServer process died before responding to readiness probe

This error does not occur when I use smaller models with single GPU. Llama 3.1 does not produce such an error for multiple GPUs as well. So could you please look into his issue? I really enjoy using Qwen2 for my projects and I find this model to be really good with RAG that's why I am requesting you to solve the quantization issue :)

The text was updated successfully, but these errors were encountered:

jklj077 · 2024-09-12T04:22:44Z

Hi, if you quantize the 72B model on your own, please refer to troubleshooting of our docs to first pad the parameters before quantization. Also be aware that the padding size depends on your quantization settings (in particular, group_size or block_size) and inference settings (some efficient kernels require specific group_size or block_size).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

venki-lfc commented Sep 11, 2024

jklj077 commented Sep 12, 2024 •

edited

Loading

Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

Value Error when loading quantized Qwen2-72B INT4 model using vLLM on multiple GPUs #907

Comments

venki-lfc commented Sep 11, 2024

jklj077 commented Sep 12, 2024 • edited Loading

jklj077 commented Sep 12, 2024 •

edited

Loading