Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py #285

Open
prasanna opened this issue Dec 10, 2023 · 1 comment

Comments

@prasanna
Copy link

Got this error when running llama_inference.py:

$ CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
  0%|                                                                                                             | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
    return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
TypeError: do_bench() got an unexpected keyword argument 'percentiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 110, in <module>
    model = load_quant(args.model, args.load, args.wbits, args.groupsize, fused_mlp=args.fused_mlp)
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 66, in load_quant
    quant.autotune_warmup_linear(model, transpose=not (eval))
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 419, in autotune_warmup_linear
    matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
    matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 73, in _bench
    except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

The issue is in quant/custom_autotune.py:72. The param percentiles has been changed to quantiles in triton.testing.do_bench()

@SimWangArizona
Copy link

Maybe you can try this
huggingface/text-generation-inference@773aabd @prasanna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants