We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86_64, debian 11, L4/A100 GPU
No response
examples
python TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Meta-Llama-3.1-8B-Instruct --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./tllm_checkpoint_1gpu_int4_awq --calib_size 32 --batch_size 8 trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu --gpt_attention_plugin auto --gemm_plugin auto --max_num_tokens 128000 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=paged
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 2k-tokens.txt python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 15k-tokens.txt python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --max_input_length=1000000 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file 100k-tokens.txt
15k should be around 7.5x slower than 2k, 100k should be 7x slower than 15k
L4 2k: 580ms L4 15k: 10.3s L4 100k: 163s
A100 2k: 155ms A100 15k: 1.75s A100 100k: 36.5s
L4 15k, L4 100k, A100 100k are too slow and unacceptable
There may be some way to configure the trtllm-build to accelerate this, but I can't find a good doc about optimizing the long context
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
x86_64, debian 11, L4/A100 GPU
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
15k should be around 7.5x slower than 2k, 100k should be 7x slower than 15k
actual behavior
L4 2k: 580ms
L4 15k: 10.3s
L4 100k: 163s
A100 2k: 155ms
A100 15k: 1.75s
A100 100k: 36.5s
L4 15k, L4 100k, A100 100k are too slow and unacceptable
additional notes
There may be some way to configure the trtllm-build to accelerate this, but I can't find a good doc about optimizing the long context
The text was updated successfully, but these errors were encountered: