You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working to deploy the new tensorrt_llm.llmapi for a decoder heavy task. Input sequence ~128 tokens, output around 4k to 10k and small batchsize ~ 6 to 8 samples per batch.
I am trying to build a few formats of the Qwen/Qwen-QwQ-32B-Preview model to test throughput speeds on a 4 x L4 instance. vLLM with w4a16 is benchmarked; and I would like to try to get it faster with tensorrt_llm.
I understand that loading the model withbfloat16 is not supported for a tensorrt_llm build with fp8 or w4a8, so I have manually changed the huggingface model config to contain "torch_dtype": "float16",.
Building w4a16_awq & w4a8_awq worked sucessfully. I would like to tryw8a8 with fp8 kv_cache, however I get an error on the build after ~2 hours, see below.
Also, for my task do you have any recommendations to maximise throughput ? I have seen using MAX_UTILIZATION is good; is there anything on the below; or when I call the model that you recommend ?
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
Loading checkpoint shards: 100%|██████████| 17/17 [01:19<00:00, 4.69s/it]
Inserted 1347 quantizerss: 100%|██████████| 17/17 [01:19<00:00, 4.51s/it]
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py:132: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
return calibrate(model, config["algorithm"], forward_loop=forward_loop)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 462, in export_tensorrt_llm_checkpoint
for (
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 264, in torch_to_tensorrt_llm_checkpoint
layer_config = build_decoder_config(
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1452, in build_decoder_config
attention_config = build_attention_config(
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 825, in build_attention_config
config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 784, in build_linear_config
config.activation_scaling_factor = get_activation_scaling_factor(module)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 318, in get_activation_scaling_factor
get_scaling_factor(module.input_quantizer) if hasattr(module, "input_quantizer") else None
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 298, in get_scaling_factor
amax = quantizer.export_amax()
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 622, in export_amax
self._validate_amax(amax)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", lineroot@2ccbcb131650:/workspace/trtllm/kaggle-aimo2#
additional notes
.
The text was updated successfully, but these errors were encountered:
System Info
I am working to deploy the new
tensorrt_llm.llmapi
for a decoder heavy task. Input sequence ~128 tokens, output around 4k to 10k and small batchsize ~ 6 to 8 samples per batch.I am trying to build a few formats of the
Qwen/Qwen-QwQ-32B-Preview
model to test throughput speeds on a 4 x L4 instance. vLLM withw4a16
is benchmarked; and I would like to try to get it faster with tensorrt_llm.I understand that loading the model with
bfloat16
is not supported for a tensorrt_llm build withfp8
orw4a8
, so I have manually changed the huggingface model config to contain"torch_dtype": "float16",
.Building
w4a16_awq
&w4a8_awq
worked sucessfully. I would like to tryw8a8
withfp8
kv_cache, however I get an error on the build after ~2 hours, see below.Also, for my task do you have any recommendations to maximise throughput ? I have seen using
MAX_UTILIZATION
is good; is there anything on the below; or when I call the model that you recommend ?Thank you 🙏🏼
Snippet of python script building the model.
Python & packages
nvcc --version
uname -m
->x86_64
free -h
nvidia-smi
allocated by the buildWho can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
On the above machine with 4 *L4, run the below.
Expected behavior
.
actual behavior
Full error
additional notes
.
The text was updated successfully, but these errors were encountered: