Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resize_token_embeddings feature #1965

Open
5 tasks done
ccdv-ai opened this issue Oct 11, 2024 · 5 comments
Open
5 tasks done

Add resize_token_embeddings feature #1965

ccdv-ai opened this issue Oct 11, 2024 · 5 comments
Labels

Comments

@ccdv-ai
Copy link

ccdv-ai commented Oct 11, 2024

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Add the option to resize the token embeddings:
PreTrainedModel has this method.

✔️ Solution

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(num_tokens)

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@ccdv-ai ccdv-ai added the enhancement New feature or request label Oct 11, 2024
@NanoCode012
Copy link
Collaborator

May I ask what is the use case? We currently resize to the tokenizer's length (or multiple of 32 to it if enabled).

@ccdv-ai
Copy link
Author

ccdv-ai commented Oct 30, 2024

Can be usefull for a tokenizer/custom tokenizer and vocab_size mismatch.
For instance, there is a mismatch for all qwen 2.5 models where they padded the embedding layer for distributed training.

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Oct 30, 2024

@ccdv-ai , just to clarify, would you want a config that lets you specify the new tokenizer vocab size or just to resize?

Axolotl does the latter* under the hood when you add new tokens.

embeddings_len = (
math.ceil(len(self.tokenizer) / 32) * 32
if self.cfg.resize_token_embeddings_to_32x
else len(self.tokenizer)
)
if (
hasattr(self.model, "get_input_embeddings")
and self.model.get_input_embeddings().num_embeddings < embeddings_len
):
resize_kwargs = {}
if self.cfg.mean_resizing_embeddings is not None:
resize_kwargs["mean_resizing"] = self.cfg.mean_resizing_embeddings
self.model.resize_token_embeddings(embeddings_len, **resize_kwargs)
else:
self.model.tie_weights()

If you enable resize_token_embeddings_to_32x: true, it will resize to next multiple of 32.

@ccdv-ai
Copy link
Author

ccdv-ai commented Oct 30, 2024

@NanoCode012 Only the option to choose and resize the token_embeddings to an arbitrary value.
For example, Qwen 2.5 7B tokenizer has 151665 tokens but the embedding layer has 152064.
resize_token_embeddings_to: 151665 should be possible.

if self.cfg.resize_token_embeddings_to < len(self.tokenizer):
    #Warning or stop
self.model.resize_token_embeddings(self.cfg.resize_token_embeddings_to, **resize_kwargs)

@NanoCode012
Copy link
Collaborator

@ccdv-ai thanks for clarifying. To add on to your point, we already do resize to tokenizer.

# above code but summarized

embeddings_len = len(self.tokenizer) 

if ( 
     self.model.get_input_embeddings().num_embeddings < embeddings_len 
 ): 
     self.model.resize_token_embeddings(embeddings_len)

For resizing to another value (!=len(self.tokenizer) ) , I'm not sure I understand the use case as the tokenizer would then mismatch the embedding length and cause an error during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants