Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_inference 4bits error #270

Open
gjm441 opened this issue Jun 26, 2023 · 0 comments
Open

llama_inference 4bits error #270

gjm441 opened this issue Jun 26, 2023 · 0 comments

Comments

@gjm441
Copy link

gjm441 commented Jun 26, 2023

when I run the script:
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --eval --save llama7b-4bit-128g.pt &>baseline.txt &

I got the same ppl as readme,But when infer with saved int4 weight:
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
I get the error as follows:


Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
0%| | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 21, in matmul_248_kernel
KeyError: ('2-.-0-.-0-37ce7529e37ca1a0b8a47b63bc5fd4b0-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (16, 256, 32, 8), (True, True, True, True, True, True, (False, True), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/llama.py", line 483, in
model = load_quant(args.model, args.load, args.wbits, args.groupsize)
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/llama.py", line 319, in load_quant
quant.autotune_warmup_linear(model, transpose=not (eval))
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/quant_linear.py", line 419, in autotune_warmup_linear
matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq)
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
File "/home/anaconda3/envs/py39_ptq_jingming/lib/python3.9/site-packages/triton/testing.py", line 143, in do_bench
fn()
File "/home/jingming.guo/Mars/GPTQ-for-LLaMa/quant/custom_autotune.py", line 67, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "", line 41, in matmul_248_kernel
File "/home/anaconda3/envs/py39_ptq_jingming/lib/python3.9/site-packages/triton/compiler.py", line 1588, in compile
so_path = make_stub(name, signature, constants)
File "/home/anaconda3/envs/py39_ptq_jingming/lib/python3.9/site-packages/triton/compiler.py", line 1477, in make_stub
so = _build(name, src_path, tmpdir)
File "/home/anaconda3/envs/py39_ptq_jingming/lib/python3.9/site-packages/triton/compiler.py", line 1392, in _build
ret = subprocess.check_call(cc_cmd)
File "/home/anaconda3/envs/py39_ptq_jingming/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpaxcp747n/main.c', '-O3', '-I/usr/local/cuda/include', '-I/home/anaconda3/envs/py39_ptq_jingming/include/python3.9', '-I/tmp/tmpaxcp747n', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpaxcp747n/matmul_248_kernel.cpython-39-x86_64-linux-gnu.so', '-L/usr/share/man/man7']' returned non-zero exit status 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant