Build fails on `w8a8` with `kv_cache_dtype` `FP8` #2559

darraghdog · 2024-12-10T17:05:00Z

System Info

I am working to deploy the new tensorrt_llm.llmapi for a decoder heavy task. Input sequence ~128 tokens, output around 4k to 10k and small batchsize ~ 6 to 8 samples per batch.
I am trying to build a few formats of the Qwen/Qwen-QwQ-32B-Preview model to test throughput speeds on a 4 x L4 instance. vLLM with w4a16 is benchmarked; and I would like to try to get it faster with tensorrt_llm.
I understand that loading the model withbfloat16 is not supported for a tensorrt_llm build with fp8 or w4a8, so I have manually changed the huggingface model config to contain "torch_dtype": "float16",.

Building w4a16_awq & w4a8_awq worked sucessfully. I would like to tryw8a8 with fp8 kv_cache, however I get an error on the build after ~2 hours, see below.

Also, for my task do you have any recommendations to maximise throughput ? I have seen using MAX_UTILIZATION is good; is there anything on the below; or when I call the model that you recommend ?

Thank you 🙏🏼

Snippet of python script building the model.

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import CalibConfig, QuantAlgo, QuantConfig, BuildConfig
from tensorrt_llm.llmapi import KvCacheConfig, CapacitySchedulerPolicy

....
    quant_config = QuantConfig(quant_algo=QuantAlgo.FP8, has_zero_point = False) 
    quant_config.kv_cache_quant_algo = QuantAlgo.FP8
    calib_config = CalibConfig(calib_dataset='my_dataset_folder',
                         calib_batches=512,
                         calib_max_seq_length=1024)
    build_config = BuildConfig(max_batch_size = 8 )
    llm = LLM(model="/workspace/hf/Qwen-QwQ-32B-Preview/",
              dtype="auto",
              tensor_parallel_size=4,
              quant_config=quant_config,
              calib_config=calib_config,
              enable_tqdm=True,
              build_config=build_config)
    llm.save(".tmp/Qwen-QwQ-32B-Preview_FP8_KVFP8_AWQ_tp4/")

Python & packages

Python 3.10.12
tensorrt_llm==0.15.0
nvidia-modelopt==0.19.0

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

uname -m -> x86_64

free -h

               total        used        free      shared  buff/cache   available
Mem:           503Gi        35Gi       116Gi       138Mi       351Gi       463Gi
Swap:             0B          0B          0B

nvidia-smi allocated by the build

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:41:00.0 Off |                    0 |
| N/A   41C    P0             27W /   72W |   16273MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:42:00.0 Off |                    0 |
| N/A   40C    P0             27W /   72W |   17190MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:82:00.0 Off |                    0 |
| N/A   43C    P0             28W /   72W |   17190MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:C2:00.0 Off |                    0 |
| N/A   42C    P0             27W /   72W |   14956MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

On the above machine with 4 *L4, run the below.

    quant_config = QuantConfig(quant_algo=QuantAlgo.FP8, has_zero_point = False) 
    quant_config.kv_cache_quant_algo = QuantAlgo.FP8
    calib_config = CalibConfig(calib_dataset='my_dataset_folder',
                         calib_batches=512,
                         calib_max_seq_length=1024)
    build_config = BuildConfig(max_batch_size = 8 )
    llm = LLM(model="/workspace/hf/Qwen-QwQ-32B-Preview/",
              dtype="auto",
              tensor_parallel_size=4,
              quant_config=quant_config,
              calib_config=calib_config,
              enable_tqdm=True,
              build_config=build_config)
    llm.save(".tmp/Qwen-QwQ-32B-Preview_FP8_KVFP8_AWQ_tp4/")

Expected behavior

.

actual behavior

Full error

[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
Loading checkpoint shards: 100%|██████████| 17/17 [01:19<00:00,  4.69s/it]
Inserted 1347 quantizerss: 100%|██████████| 17/17 [01:19<00:00,  4.51s/it]
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py:132: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
  return calibrate(model, config["algorithm"], forward_loop=forward_loop)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 462, in export_tensorrt_llm_checkpoint
    for (
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 264, in torch_to_tensorrt_llm_checkpoint
    layer_config = build_decoder_config(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1452, in build_decoder_config
    attention_config = build_attention_config(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 825, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 784, in build_linear_config
    config.activation_scaling_factor = get_activation_scaling_factor(module)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 318, in get_activation_scaling_factor
    get_scaling_factor(module.input_quantizer) if hasattr(module, "input_quantizer") else None
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 298, in get_scaling_factor
    amax = quantizer.export_amax()
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 622, in export_amax
    self._validate_amax(amax)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", lineroot@2ccbcb131650:/workspace/trtllm/kaggle-aimo2#

additional notes

.

The text was updated successfully, but these errors were encountered:

darraghdog · 2024-12-10T17:34:42Z

Archiving... my volume filled up when it was writing to disk 🤦‍♂

darraghdog added the bug Something isn't working label Dec 10, 2024

darraghdog closed this as completed Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build fails on `w8a8` with `kv_cache_dtype` `FP8` #2559

Build fails on `w8a8` with `kv_cache_dtype` `FP8` #2559

darraghdog commented Dec 10, 2024 •

edited

Loading

darraghdog commented Dec 10, 2024

Build fails on w8a8 with kv_cache_dtype FP8 #2559

Build fails on w8a8 with kv_cache_dtype FP8 #2559

Comments

darraghdog commented Dec 10, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

darraghdog commented Dec 10, 2024

Build fails on `w8a8` with `kv_cache_dtype` `FP8` #2559

Build fails on `w8a8` with `kv_cache_dtype` `FP8` #2559

darraghdog commented Dec 10, 2024 •

edited

Loading