Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LLMs With No Image Placeholder Embedding in LLava-based Models #35683

Open
alex-jw-brooks opened this issue Jan 14, 2025 · 3 comments
Open
Labels
Feature request Request for a new feature Multimodal VLM

Comments

@alex-jw-brooks
Copy link

alex-jw-brooks commented Jan 14, 2025

Feature request

Currently, llava-based models, e.g., llava-next, will throw IndexError: index out of range in self from the LLM embedding if the LLM does not contain the image embedding. However, at inference time, the embedding value isn't actually used, because the indices corresponding to the image token will be overwritten by the image features.

Other inference engines, e.g., vLLM, separately mask out the text and multimodal embeddings and merge them together (e.g., here). This prevents such indexing errors if the image token is only part of the tokenizer vocabulary, and not part of the encapsulated language model's embedding vocab.

This can be fixed on the model side by resizing the token embeddings to add the image token to the LLM, but it would be nice to take a similar approach in transformers to allow use of models that don't have the image token in the LLM embedding vocabulary.

Motivation

Fixing this will allow the use of llava-based models that don't have an embedding for the placeholder image token 😄

Your contribution

I am happy to submit a PR for this if the team is open to it!

@alex-jw-brooks alex-jw-brooks added the Feature request Request for a new feature label Jan 14, 2025
@zucchini-nlp zucchini-nlp mentioned this issue Jan 14, 2025
5 tasks
@lcxrocks
Copy link

lcxrocks commented Jan 14, 2025

Exactly. Taking llava-next as an example, this feature was introduced in v4.41.0:

# 1. Extract the input embeddings
# In case image_token_index is not in the embeddings (extra token but embedding don't have it)
for_inputs_embeds_ids = input_ids.clone()
for_inputs_embeds_ids[(input_ids == self.config.image_token_index)] = 0
inputs_embeds = self.get_input_embeddings()(for_inputs_embeds_ids)

However, it was removed starting from v4.45.0.

@zucchini-nlp
Copy link
Member

Totally agreed that it would be a useful feature to have, a PR is welcome 🤗

@alex-jw-brooks
Copy link
Author

Great! I will work on this and open a PR in the near future 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature Multimodal VLM
Projects
None yet
Development

No branches or pull requests

3 participants