Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

alex-jw-brooks · 2025-01-14T01:04:15Z

Feature request

Currently, llava-based models, e.g., llava-next, will throw IndexError: index out of range in self from the LLM embedding if the LLM does not contain the image embedding. However, at inference time, the embedding value isn't actually used, because the indices corresponding to the image token will be overwritten by the image features.

Other inference engines, e.g., vLLM, separately mask out the text and multimodal embeddings and merge them together (e.g., here). This prevents such indexing errors if the image token is only part of the tokenizer vocabulary, and not part of the encapsulated language model's embedding vocab.

This can be fixed on the model side by resizing the token embeddings to add the image token to the LLM, but it would be nice to take a similar approach in transformers to allow use of models that don't have the image token in the LLM embedding vocabulary.

Motivation

Fixing this will allow the use of llava-based models that don't have an embedding for the placeholder image token 😄

Your contribution

I am happy to submit a PR for this if the team is open to it!

The text was updated successfully, but these errors were encountered:

lcxrocks · 2025-01-14T12:28:03Z

Exactly. Taking llava-next as an example, this feature was introduced in v4.41.0:

transformers/src/transformers/models/llava_next/modeling_llava_next.py

Lines 750 to 754 in 4c6c45b

    
           # 1. Extract the input embeddings 
        
           # In case image_token_index is not in the embeddings (extra token but embedding don't have it) 
        
           for_inputs_embeds_ids = input_ids.clone() 
        
           for_inputs_embeds_ids[(input_ids == self.config.image_token_index)] = 0 
        
           inputs_embeds = self.get_input_embeddings()(for_inputs_embeds_ids)

However, it was removed starting from v4.45.0.

zucchini-nlp · 2025-01-14T14:10:57Z

Totally agreed that it would be a useful feature to have, a PR is welcome 🤗

alex-jw-brooks · 2025-01-14T15:56:10Z

Great! I will work on this and open a PR in the near future 😄

alex-jw-brooks added the Feature request Request for a new feature label Jan 14, 2025

zucchini-nlp added Multimodal VLM labels Jan 14, 2025

zucchini-nlp mentioned this issue Jan 14, 2025

Better llava next. #29850

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

alex-jw-brooks commented Jan 14, 2025 •

edited

Loading

lcxrocks commented Jan 14, 2025 •

edited

Loading

zucchini-nlp commented Jan 14, 2025

alex-jw-brooks commented Jan 14, 2025

Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

Comments

alex-jw-brooks commented Jan 14, 2025 • edited Loading

Feature request

Motivation

Your contribution

lcxrocks commented Jan 14, 2025 • edited Loading

zucchini-nlp commented Jan 14, 2025

alex-jw-brooks commented Jan 14, 2025

alex-jw-brooks commented Jan 14, 2025 •

edited

Loading

lcxrocks commented Jan 14, 2025 •

edited

Loading