You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, vocab_size is getting increased and weight size of lm_head exceeds 10GB in some LLMs.
However, there is no way to quantize lm_head. modelopt.torch.export.postprocess.update_lm_head_quantization ignores manual quant_cfg and disable it.
update_lm_head_quantization(modelopt==0.19.0)
638# Disable lm_head quantization for TRTLLM639ifget_quantization_format(lm_head) !=QUANTIZATION_NONE:
640disable_lm_head_quantization=True662# disable quantizer663ifdisable_lm_head_quantization:
664ifhasattr(input_quantizer, "_pre_quant_scale"):
665disable_pre_quant_scale_and_resmooth(lm_head, delete_pre_quant_scale=True)
666667forquantizerinSequentialQuantizer.tensor_quantizer_iterator(lm_head.weight_quantizer):
668quantizer.disable()
669670input_quantizer.disable()
671print("Disable lm_head quantization for TRT-LLM export due to deployment limitations.")
Recently, vocab_size is getting increased and weight size of lm_head exceeds 10GB in some LLMs.
However, there is no way to quantize lm_head.
modelopt.torch.export.postprocess.update_lm_head_quantization
ignores manual quant_cfg and disable it.update_lm_head_quantization(modelopt==0.19.0)
Related issue: #1394
The text was updated successfully, but these errors were encountered: