Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails on w8a8 with kv_cache_dtype FP8 #2559

Closed
1 of 4 tasks
darraghdog opened this issue Dec 10, 2024 · 1 comment
Closed
1 of 4 tasks

Build fails on w8a8 with kv_cache_dtype FP8 #2559

darraghdog opened this issue Dec 10, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@darraghdog
Copy link

darraghdog commented Dec 10, 2024

System Info

I am working to deploy the new tensorrt_llm.llmapi for a decoder heavy task. Input sequence ~128 tokens, output around 4k to 10k and small batchsize ~ 6 to 8 samples per batch.
I am trying to build a few formats of the Qwen/Qwen-QwQ-32B-Preview model to test throughput speeds on a 4 x L4 instance. vLLM with w4a16 is benchmarked; and I would like to try to get it faster with tensorrt_llm.
I understand that loading the model withbfloat16 is not supported for a tensorrt_llm build with fp8 or w4a8, so I have manually changed the huggingface model config to contain "torch_dtype": "float16",.

Building w4a16_awq & w4a8_awq worked sucessfully. I would like to tryw8a8 with fp8 kv_cache, however I get an error on the build after ~2 hours, see below.

Also, for my task do you have any recommendations to maximise throughput ? I have seen using MAX_UTILIZATION is good; is there anything on the below; or when I call the model that you recommend ?

Thank you 🙏🏼

Snippet of python script building the model.

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import CalibConfig, QuantAlgo, QuantConfig, BuildConfig
from tensorrt_llm.llmapi import KvCacheConfig, CapacitySchedulerPolicy

....
    quant_config = QuantConfig(quant_algo=QuantAlgo.FP8, has_zero_point = False) 
    quant_config.kv_cache_quant_algo = QuantAlgo.FP8
    calib_config = CalibConfig(calib_dataset='my_dataset_folder',
                         calib_batches=512,
                         calib_max_seq_length=1024)
    build_config = BuildConfig(max_batch_size = 8 )
    llm = LLM(model="/workspace/hf/Qwen-QwQ-32B-Preview/",
              dtype="auto",
              tensor_parallel_size=4,
              quant_config=quant_config,
              calib_config=calib_config,
              enable_tqdm=True,
              build_config=build_config)
    llm.save(".tmp/Qwen-QwQ-32B-Preview_FP8_KVFP8_AWQ_tp4/")

Python & packages

Python 3.10.12
tensorrt_llm==0.15.0
nvidia-modelopt==0.19.0

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

uname -m -> x86_64

free -h

               total        used        free      shared  buff/cache   available
Mem:           503Gi        35Gi       116Gi       138Mi       351Gi       463Gi
Swap:             0B          0B          0B

nvidia-smi allocated by the build

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:41:00.0 Off |                    0 |
| N/A   41C    P0             27W /   72W |   16273MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:42:00.0 Off |                    0 |
| N/A   40C    P0             27W /   72W |   17190MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:82:00.0 Off |                    0 |
| N/A   43C    P0             28W /   72W |   17190MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:C2:00.0 Off |                    0 |
| N/A   42C    P0             27W /   72W |   14956MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

On the above machine with 4 *L4, run the below.

    quant_config = QuantConfig(quant_algo=QuantAlgo.FP8, has_zero_point = False) 
    quant_config.kv_cache_quant_algo = QuantAlgo.FP8
    calib_config = CalibConfig(calib_dataset='my_dataset_folder',
                         calib_batches=512,
                         calib_max_seq_length=1024)
    build_config = BuildConfig(max_batch_size = 8 )
    llm = LLM(model="/workspace/hf/Qwen-QwQ-32B-Preview/",
              dtype="auto",
              tensor_parallel_size=4,
              quant_config=quant_config,
              calib_config=calib_config,
              enable_tqdm=True,
              build_config=build_config)
    llm.save(".tmp/Qwen-QwQ-32B-Preview_FP8_KVFP8_AWQ_tp4/")

Expected behavior

.

actual behavior

Full error

[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
Loading checkpoint shards: 100%|██████████| 17/17 [01:19<00:00,  4.69s/it]
Inserted 1347 quantizerss: 100%|██████████| 17/17 [01:19<00:00,  4.51s/it]
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py:132: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
  return calibrate(model, config["algorithm"], forward_loop=forward_loop)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 462, in export_tensorrt_llm_checkpoint
    for (
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 264, in torch_to_tensorrt_llm_checkpoint
    layer_config = build_decoder_config(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1452, in build_decoder_config
    attention_config = build_attention_config(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 825, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 784, in build_linear_config
    config.activation_scaling_factor = get_activation_scaling_factor(module)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 318, in get_activation_scaling_factor
    get_scaling_factor(module.input_quantizer) if hasattr(module, "input_quantizer") else None
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 298, in get_scaling_factor
    amax = quantizer.export_amax()
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 622, in export_amax
    self._validate_amax(amax)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", lineroot@2ccbcb131650:/workspace/trtllm/kaggle-aimo2# 

additional notes

.

@darraghdog darraghdog added the bug Something isn't working label Dec 10, 2024
@darraghdog
Copy link
Author

Archiving... my volume filled up when it was writing to disk 🤦‍♂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant