Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly setting the --max_encoder_input_len when using trtllm-build with mllama models #2558

Open
here4dadata opened this issue Dec 10, 2024 · 0 comments

Comments

@here4dadata
Copy link

here4dadata commented Dec 10, 2024

Hi,

I'd like to understand better how to set this flag value correctly...
In the example given from the backend TensorRT-LLM project, the max_encoder_input_len is set to 8200, whereas in the example page for TensorRT-llm it is set to 4100. The batch size is half the size in the backend example, so I can understand the change between 8200, and 4100, however I do not really understand how the number of features are calculated.
The Llama-3.2-11B-Vision-Instruct and Llama-3.2-11B-Vision models have different image sizes (560x560) vs (448x448) respectively.

I mistakenly set these numbers when building the visual and text engines for Llama-3.2-11B-Vision-Instruct, however only during the runtime run.py test was I warned that the encoder_max_input_length should be 6404 (for a batch size of 2). How was this 6404 number arrived at?

My quite wrong take here: 560/14 = 40. We have 4x of these so $40\cdot40\cdot4 = 6400$. Maybe I get to 6404 because there are four extra tokens for position? How does this take in to account the batch size? Was it only telling me 6404 because I sent in a single picture?

Thanks for any clarification!!

The following commands were used w/ TensorRT-LLM v0.15.0:

# create vis engine
python /app/tensorrt_llm/examples/multimodal/build_visual_engine.py \
  --model_dtype mllama \
  --model_path /hf-models/Llama-3.2-11B-Vision-Instruct \
  --output_dir /llama-vis-engine

# convert weights
python /app/tensorrt_llm/examples/mllama/convert_checkpoint.py \
  --model_dir /hf-models/Llama-3.2-11B-Vision-Instruct \
  --output_dir /chkpt \
  --dtype bfloat16

# build engine 
trtllm-build \
  --checkpoint_dir /chkpt/ \
  --output_dir /llama-engine/ \
  --gemm_plugin auto \
  --max_batch_size 2 \
  --max_seq_len 2048 \
  --max_num_tokens 4096 \
  --max_encoder_input_len 6404 

# test 
python /app/tensorrt_llm/examples/multimodal/run.py \
  --visual_engine_dir /llama-vis-engine \
  --visual_engine_name visual_encoder.engine \
  --llm_engine_dir /llama-engine/ \
  --hf_model_dir /hf-models/Llama-3.2-11B-Vision-Instruct \
  --image_path <some-path> \
  --batch_size 2 \
  --max_new_tokens 64 \
  --input_text "<|image|>What is this picture about?"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant