[feature request] lm_head quantization #2550

youki-sada · 2024-12-09T07:00:33Z

Recently, vocab_size is getting increased and weight size of lm_head exceeds 10GB in some LLMs.
However, there is no way to quantize lm_head. modelopt.torch.export.postprocess.update_lm_head_quantization ignores manual quant_cfg and disable it.

update_lm_head_quantization(modelopt==0.19.0)

638     # Disable lm_head quantization for TRTLLM
639     if get_quantization_format(lm_head) != QUANTIZATION_NONE:
640         disable_lm_head_quantization = True

662     # disable quantizer
663     if disable_lm_head_quantization:
664         if hasattr(input_quantizer, "_pre_quant_scale"):
665             disable_pre_quant_scale_and_resmooth(lm_head, delete_pre_quant_scale=True)
666
667         for quantizer in SequentialQuantizer.tensor_quantizer_iterator(lm_head.weight_quantizer):
668             quantizer.disable()
669
670         input_quantizer.disable()
671         print("Disable lm_head quantization for TRT-LLM export due to deployment limitations.")

Related issue: #1394

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] lm_head quantization #2550

[feature request] lm_head quantization #2550

youki-sada commented Dec 9, 2024

[feature request] lm_head quantization #2550

[feature request] lm_head quantization #2550

Comments

youki-sada commented Dec 9, 2024