Correctly setting the `--max_encoder_input_len` when using trtllm-build with mllama models #2558

here4dadata · 2024-12-10T17:04:50Z

Hi,

I'd like to understand better how to set this flag value correctly...
In the example given from the backend TensorRT-LLM project, the max_encoder_input_len is set to 8200, whereas in the example page for TensorRT-llm it is set to 4100. The batch size is half the size in the backend example, so I can understand the change between 8200, and 4100, however I do not really understand how the number of features are calculated.
The Llama-3.2-11B-Vision-Instruct and Llama-3.2-11B-Vision models have different image sizes (560x560) vs (448x448) respectively.

I mistakenly set these numbers when building the visual and text engines for Llama-3.2-11B-Vision-Instruct, however only during the runtime run.py test was I warned that the encoder_max_input_length should be 6404 (for a batch size of 2). How was this 6404 number arrived at?

My quite wrong take here: 560/14 = 40. We have 4x of these so $40\cdot40\cdot4 = 6400$. Maybe I get to 6404 because there are four extra tokens for position? How does this take in to account the batch size? Was it only telling me 6404 because I sent in a single picture?

Thanks for any clarification!!

The following commands were used w/ TensorRT-LLM v0.15.0:

# create vis engine
python /app/tensorrt_llm/examples/multimodal/build_visual_engine.py \
  --model_dtype mllama \
  --model_path /hf-models/Llama-3.2-11B-Vision-Instruct \
  --output_dir /llama-vis-engine

# convert weights
python /app/tensorrt_llm/examples/mllama/convert_checkpoint.py \
  --model_dir /hf-models/Llama-3.2-11B-Vision-Instruct \
  --output_dir /chkpt \
  --dtype bfloat16

# build engine 
trtllm-build \
  --checkpoint_dir /chkpt/ \
  --output_dir /llama-engine/ \
  --gemm_plugin auto \
  --max_batch_size 2 \
  --max_seq_len 2048 \
  --max_num_tokens 4096 \
  --max_encoder_input_len 6404 

# test 
python /app/tensorrt_llm/examples/multimodal/run.py \
  --visual_engine_dir /llama-vis-engine \
  --visual_engine_name visual_encoder.engine \
  --llm_engine_dir /llama-engine/ \
  --hf_model_dir /hf-models/Llama-3.2-11B-Vision-Instruct \
  --image_path <some-path> \
  --batch_size 2 \
  --max_new_tokens 64 \
  --input_text "<|image|>What is this picture about?"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly setting the `--max_encoder_input_len` when using trtllm-build with mllama models #2558

Correctly setting the `--max_encoder_input_len` when using trtllm-build with mllama models #2558

here4dadata commented Dec 10, 2024 •

edited

Loading

Correctly setting the --max_encoder_input_len when using trtllm-build with mllama models #2558

Correctly setting the --max_encoder_input_len when using trtllm-build with mllama models #2558

Comments

here4dadata commented Dec 10, 2024 • edited Loading

Correctly setting the `--max_encoder_input_len` when using trtllm-build with mllama models #2558

Correctly setting the `--max_encoder_input_len` when using trtllm-build with mllama models #2558

here4dadata commented Dec 10, 2024 •

edited

Loading