Consistency issue with max_length flag in tokenizer #274

le1nux · 2024-12-04T08:44:57Z

System Info

all versions

🐛 Describe the bug

Currently, we set the max_length attribute in the huggingface tokenizer, which describes
"Maximum length of the tokenization output. Defaults to None.".

modalities/src/modalities/tokenization/tokenizer_wrapper.py

Lines 66 to 77 in 09c666b

    
           class PreTrainedHFTokenizer(TokenizerWrapper): 
        
               """Wrapper for pretrained Hugging Face tokenizers.""" 
        
               def __init__( 
        
                   self, 
        
                   pretrained_model_name_or_path: str, 
        
                   truncation: Optional[bool] = False, 
        
                   padding: Optional[bool | str] = False, 
        
                   max_length: Optional[int] = None, 
        
                   special_tokens: Optional[dict[str, str]] = None, 
        
               ) -> None: 
        
                   """Initializes the PreTrainedHFTokenizer.

We set this flag, when calling the __call__ function on the tokenizer

modalities/src/modalities/tokenization/tokenizer_wrapper.py

Lines 133 to 139 in 09c666b

    
           tokens = self.tokenizer.__call__( 
        
               text, 
        
               max_length=self.max_length, 
        
               padding=self.padding, 
        
               truncation=self.truncation, 
        
           )["input_ids"] 
        
           return tokens

however, as per the huggingface documentation if max_length is set to None, then it uses max input length of the corresponding model.

Huggingface documentation on the __call__ function:

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

Proposal:
we allow the user to specify max_length in the config (just like before). If it is set to None, we set the max_length to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the model_max_length attribute.

The text was updated successfully, but these errors were encountered:

le1nux added the bug Something isn't working label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency issue with max_length flag in tokenizer #274

Consistency issue with max_length flag in tokenizer #274

le1nux commented Dec 4, 2024

Consistency issue with max_length flag in tokenizer #274

Consistency issue with max_length flag in tokenizer #274

Comments

le1nux commented Dec 4, 2024

System Info

🐛 Describe the bug