[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 #1006

renne444 · 2024-10-08T10:03:06Z

Model Series

Qwen2

What are the models used?

Qwen2.5-72B-Instruct-GPTQ-Int8 and Qwen2-72B-Instruct-GPTQ-Int8

What is the scenario where the problem happened?

transformers

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

Python: 3.10
GPUs: 8 x NVIDIA L20
NVIDIA driver: 535.161.07
CUDA compiler: cuda_12.6

Package Version Editable project location

accelerate 0.34.2
aiohappyeyeballs 2.4.2
aiohttp 3.10.8
aiosignal 1.3.1
async-timeout 4.0.3
attrs 24.2.0
auto_gptq 0.8.0.dev0+cu121 /mnt/download/AutoGPTQ
certifi 2024.8.30
charset-normalizer 3.3.2
coloredlogs 15.0.1
datasets 3.0.1
dill 0.3.8
filelock 3.16.1
frozenlist 1.4.1
fsspec 2024.6.1
gekko 1.2.1
huggingface-hub 0.25.1
humanfriendly 10.0
idna 3.10
Jinja2 3.1.4
MarkupSafe 2.1.5
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.68
nvidia-nvtx-cu12 12.1.105
optimum 1.22.0
packaging 24.1
pandas 2.2.3
peft 0.13.0
pip 24.2
protobuf 5.28.2
psutil 6.0.0
pyarrow 17.0.0
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
regex 2024.9.11
requests 2.32.3
rouge 1.0.1
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 75.1.0
six 1.16.0
sympy 1.13.3
threadpoolctl 3.5.0
tokenizers 0.19.1
torch 2.4.1
tqdm 4.66.5
transformers 4.44.2
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
wheel 0.44.0
xxhash 3.5.0
yarl 1.13.1

Log output

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [05:54<00:00, 17.71s/it]Traceback (most recent call last):
  File "/xxx/qwen_inference_loss_hugging_face.py", line 35, in <module>
    generated_ids = model.generate(
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Description

Steps to reproduce

I follow the sample here and test Qwen2 and Qwen2.5 with transformers and get an unexpected error. Below is the detailed source code:

我在一台8张L20的机器上模仿了例程，并在Qwen2和2.5版本的GPTQ-Int8上都不符合预期。并且测试过，在bf16模型上是符合预期的。

代码如下：

from transformers import AutoModelForCausalLM, AutoTokenizer

# model_path = get_default_model_path("bf16")
model_path = "/mnt/Qwen/Qwen2.5-72B-Instruct-GPTQ-Int8"

device = "cuda"

max_memory = {0: "24GiB", 1: "34GiB", 2: "34GiB", 3: "34GiB", 4: "34GiB", 5: "34GiB", 6: "34GiB", 7: "34GiB"}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    max_memory=max_memory,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
attention_mask = model_inputs['attention_mask'].to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    attention_mask=attention_mask
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

jklj077 · 2024-10-09T11:33:15Z

Please first try upgrading the driver. We had similar reports from users using multiple RTX 4090s (also Ada Lovelace cards).

In addition, I'm not sure auto_gptq works with torch 2.4.1.

renne444 · 2024-10-10T08:10:41Z

@jklj077

十分感谢支持！

我已经根据Dockerfile里的版本信息，安装了对应版本的依赖库环境，安装了2.2.2版本的torch环境。并且在A100和L20上都分别使用先前提到的例程运行了0.5B、7B、72B版本的Qwen2.5模型，都出现了可以在vllm下可以跑通，但使用提供的transformers样例跑不通的情况。并且这个现象在单卡和多卡都存在。

我们A100机器驱动版本为525.105.17，L20机器驱动版本为535.161.07。在我们的云服务环境中，升级驱动可能会有很大的代价。

请问是否可以用现有版本的驱动，在Ampere或Ada Lovelace架构中跑通提供的例程？

环境：

Package                  Version
------------------------ -----------
accelerate               1.0.0
aiohappyeyeballs         2.4.3
aiohttp                  3.10.9
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    24.2.0
auto_gptq                0.7.1
autoawq                  0.2.5
autoawq_kernels          0.0.6
certifi                  2024.8.30
charset-normalizer       3.4.0
coloredlogs              15.0.1
datasets                 3.0.1
dill                     0.3.8
einops                   0.8.0
filelock                 3.16.1
frozenlist               1.4.1
fsspec                   2024.6.1
gekko                    1.2.1
huggingface-hub          0.25.2
humanfriendly            10.0
idna                     3.10
Jinja2                   3.1.4
MarkupSafe               3.0.1
mkl_fft                  1.3.10
mkl_random               1.2.7
mkl-service              2.4.0
mpmath                   1.3.0
multidict                6.1.0
multiprocess             0.70.16
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.6.77
nvidia-nvtx-cu12         12.1.105
optimum                  1.20.0
packaging                24.1
pandas                   2.2.3
peft                     0.13.1
pillow                   10.4.0
pip                      24.2
propcache                0.2.0
protobuf                 5.28.2
psutil                   6.0.0
pyarrow                  17.0.0
python-dateutil          2.9.0.post0
pytz                     2024.2
PyYAML                   6.0.2
regex                    2024.9.11
requests                 2.32.3
rouge                    1.0.1
safetensors              0.4.5
scipy                    1.14.1
sentencepiece            0.2.0
setuptools               72.1.0
six                      1.16.0
sympy                    1.13.3
tiktoken                 0.8.0
tokenizers               0.19.1
torch                    2.2.2
torchaudio               2.2.2
torchvision              0.17.2
tqdm                     4.66.5
transformers             4.40.2
triton                   2.2.0
typing_extensions        4.12.2
tzdata                   2024.2
urllib3                  2.2.3
wheel                    0.44.0
xxhash                   3.5.0
yarl                     1.14.0
zstandard                0.23.0

jklj077 · 2024-10-10T09:57:57Z

could you try the solution to the second issue at https://qwen.readthedocs.io/en/v2.0/quantization/gptq.html#troubleshooting (we have not met this issue with Qwen2.5 yet).

In general, if vllm works but auto_gptq doesn't, we recommend using vllm.

renne444 changed the title ~~[Bug]: Nvidia L20推理Qwen2.5 GPTQ-Int8模型不符合预期~~ [Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 #1006

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 #1006

renne444 commented Oct 8, 2024

jklj077 commented Oct 9, 2024

renne444 commented Oct 10, 2024

jklj077 commented Oct 10, 2024

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 #1006

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期 #1006

Comments

renne444 commented Oct 8, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Steps to reproduce

jklj077 commented Oct 9, 2024

renne444 commented Oct 10, 2024

jklj077 commented Oct 10, 2024