You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, llava-based models, e.g., llava-next, will throw IndexError: index out of range in self from the LLM embedding if the LLM does not contain the image embedding. However, at inference time, the embedding value isn't actually used, because the indices corresponding to the image token will be overwritten by the image features.
Other inference engines, e.g., vLLM, separately mask out the text and multimodal embeddings and merge them together (e.g., here). This prevents such indexing errors if the image token is only part of the tokenizer vocabulary, and not part of the encapsulated language model's embedding vocab.
This can be fixed on the model side by resizing the token embeddings to add the image token to the LLM, but it would be nice to take a similar approach in transformers to allow use of models that don't have the image token in the LLM embedding vocabulary.
Motivation
Fixing this will allow the use of llava-based models that don't have an embedding for the placeholder image token 😄
Your contribution
I am happy to submit a PR for this if the team is open to it!
The text was updated successfully, but these errors were encountered:
Feature request
Currently, llava-based models, e.g., llava-next, will throw
IndexError: index out of range in self
from the LLM embedding if the LLM does not contain the image embedding. However, at inference time, the embedding value isn't actually used, because the indices corresponding to the image token will be overwritten by the image features.Other inference engines, e.g., vLLM, separately mask out the text and multimodal embeddings and merge them together (e.g., here). This prevents such indexing errors if the image token is only part of the tokenizer vocabulary, and not part of the encapsulated language model's embedding vocab.
This can be fixed on the model side by resizing the token embeddings to add the image token to the LLM, but it would be nice to take a similar approach in transformers to allow use of models that don't have the image token in the LLM embedding vocabulary.
Motivation
Fixing this will allow the use of llava-based models that don't have an embedding for the placeholder image token 😄
Your contribution
I am happy to submit a PR for this if the team is open to it!
The text was updated successfully, but these errors were encountered: