Performance issue with long context #2548

ShuaiShao93 · 2024-12-06T21:36:37Z

System Info

x86_64, debian 11, L4/A100 GPU

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

On L4 or A100 GPU
Install trtllm 0.15.0
Build LLama3.1 8b

python TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Meta-Llama-3.1-8B-Instruct --dtype float16 --qformat int4_awq  --awq_block_size 128  --output_dir ./tllm_checkpoint_1gpu_int4_awq   --calib_size 32 --batch_size 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto --max_num_tokens 128000 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=paged

Benchmark with different lengths (2k, 15k, 100k)

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 2k-tokens.txt

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 15k-tokens.txt

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 100k-tokens.txt

Expected behavior

15k should be around 7.5x slower than 2k, 100k should be 7x slower than 15k

actual behavior

L4 2k: 580ms
L4 15k: 10.3s
L4 100k: 163s

A100 2k: 155ms
A100 15k: 1.75s
A100 100k: 36.5s

L4 15k, L4 100k, A100 100k are too slow and unacceptable

additional notes

There may be some way to configure the trtllm-build to accelerate this, but I can't find a good doc about optimizing the long context

The text was updated successfully, but these errors were encountered:

ShuaiShao93 added the bug Something isn't working label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with long context #2548

Performance issue with long context #2548

ShuaiShao93 commented Dec 6, 2024

Performance issue with long context #2548

Performance issue with long context #2548

Comments

ShuaiShao93 commented Dec 6, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes