Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with long context #2548

Open
4 tasks
ShuaiShao93 opened this issue Dec 6, 2024 · 0 comments
Open
4 tasks

Performance issue with long context #2548

ShuaiShao93 opened this issue Dec 6, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ShuaiShao93
Copy link

System Info

x86_64, debian 11, L4/A100 GPU

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. On L4 or A100 GPU
  2. Install trtllm 0.15.0
  3. Build LLama3.1 8b
python TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Meta-Llama-3.1-8B-Instruct --dtype float16 --qformat int4_awq  --awq_block_size 128  --output_dir ./tllm_checkpoint_1gpu_int4_awq   --calib_size 32 --batch_size 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto --max_num_tokens 128000 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=paged
  1. Benchmark with different lengths (2k, 15k, 100k)
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 2k-tokens.txt

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 15k-tokens.txt

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 100k-tokens.txt

Expected behavior

15k should be around 7.5x slower than 2k, 100k should be 7x slower than 15k

actual behavior

L4 2k: 580ms
L4 15k: 10.3s
L4 100k: 163s

A100 2k: 155ms
A100 15k: 1.75s
A100 100k: 36.5s

L4 15k, L4 100k, A100 100k are too slow and unacceptable

additional notes

There may be some way to configure the trtllm-build to accelerate this, but I can't find a good doc about optimizing the long context

@ShuaiShao93 ShuaiShao93 added the bug Something isn't working label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant