Add resize_token_embeddings feature #1965

ccdv-ai · 2024-10-11T22:22:21Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Add the option to resize the token embeddings:
PreTrainedModel has this method.

✔️ Solution

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(num_tokens)

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-10-14T16:47:15Z

May I ask what is the use case? We currently resize to the tokenizer's length (or multiple of 32 to it if enabled).

ccdv-ai · 2024-10-30T11:02:48Z

Can be usefull for a tokenizer/custom tokenizer and vocab_size mismatch.
For instance, there is a mismatch for all qwen 2.5 models where they padded the embedding layer for distributed training.

NanoCode012 · 2024-10-30T11:06:45Z

@ccdv-ai , just to clarify, would you want a config that lets you specify the new tokenizer vocab size or just to resize?

Axolotl does the latter* under the hood when you add new tokens.

axolotl/src/axolotl/utils/models.py

Lines 1039 to 1053 in 8c3a727

    
           embeddings_len = ( 
        
               math.ceil(len(self.tokenizer) / 32) * 32 
        
               if self.cfg.resize_token_embeddings_to_32x 
        
               else len(self.tokenizer) 
        
           ) 
        
           if ( 
        
               hasattr(self.model, "get_input_embeddings") 
        
               and self.model.get_input_embeddings().num_embeddings < embeddings_len 
        
           ): 
        
               resize_kwargs = {} 
        
               if self.cfg.mean_resizing_embeddings is not None: 
        
                   resize_kwargs["mean_resizing"] = self.cfg.mean_resizing_embeddings 
        
               self.model.resize_token_embeddings(embeddings_len, **resize_kwargs) 
        
           else: 
        
               self.model.tie_weights()

If you enable resize_token_embeddings_to_32x: true, it will resize to next multiple of 32.

ccdv-ai · 2024-10-30T11:35:20Z

@NanoCode012 Only the option to choose and resize the token_embeddings to an arbitrary value.
For example, Qwen 2.5 7B tokenizer has 151665 tokens but the embedding layer has 152064.
resize_token_embeddings_to: 151665 should be possible.

if self.cfg.resize_token_embeddings_to < len(self.tokenizer):
    #Warning or stop
self.model.resize_token_embeddings(self.cfg.resize_token_embeddings_to, **resize_kwargs)

NanoCode012 · 2024-10-30T14:28:56Z

@ccdv-ai thanks for clarifying. To add on to your point, we already do resize to tokenizer.

# above code but summarized

embeddings_len = len(self.tokenizer) 

if ( 
     self.model.get_input_embeddings().num_embeddings < embeddings_len 
 ): 
     self.model.resize_token_embeddings(embeddings_len)

For resizing to another value (!=len(self.tokenizer) ) , I'm not sure I understand the use case as the tokenizer would then mismatch the embedding length and cause an error during training.

ccdv-ai added the enhancement New feature or request label Oct 11, 2024

NanoCode012 added the waiting for reporter label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resize_token_embeddings feature #1965

Add resize_token_embeddings feature #1965

ccdv-ai commented Oct 11, 2024

NanoCode012 commented Oct 14, 2024

ccdv-ai commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024 •

edited

Loading

ccdv-ai commented Oct 30, 2024 •

edited

Loading

NanoCode012 commented Oct 30, 2024

Add resize_token_embeddings feature #1965

Add resize_token_embeddings feature #1965

Comments

ccdv-ai commented Oct 11, 2024

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

NanoCode012 commented Oct 14, 2024

ccdv-ai commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024 • edited Loading

ccdv-ai commented Oct 30, 2024 • edited Loading

NanoCode012 commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024 •

edited

Loading

ccdv-ai commented Oct 30, 2024 •

edited

Loading