Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] lm_head quantization #2550

Open
youki-sada opened this issue Dec 9, 2024 · 0 comments
Open

[feature request] lm_head quantization #2550

youki-sada opened this issue Dec 9, 2024 · 0 comments

Comments

@youki-sada
Copy link

Recently, vocab_size is getting increased and weight size of lm_head exceeds 10GB in some LLMs.
However, there is no way to quantize lm_head. modelopt.torch.export.postprocess.update_lm_head_quantization ignores manual quant_cfg and disable it.

update_lm_head_quantization(modelopt==0.19.0)

638     # Disable lm_head quantization for TRTLLM
639     if get_quantization_format(lm_head) != QUANTIZATION_NONE:
640         disable_lm_head_quantization = True

662     # disable quantizer
663     if disable_lm_head_quantization:
664         if hasattr(input_quantizer, "_pre_quant_scale"):
665             disable_pre_quant_scale_and_resmooth(lm_head, delete_pre_quant_scale=True)
666
667         for quantizer in SequentialQuantizer.tensor_quantizer_iterator(lm_head.weight_quantizer):
668             quantizer.disable()
669
670         input_quantizer.disable()
671         print("Disable lm_head quantization for TRT-LLM export due to deployment limitations.")

Related issue: #1394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant