Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency issue with max_length flag in tokenizer #274

Open
le1nux opened this issue Dec 4, 2024 · 0 comments
Open

Consistency issue with max_length flag in tokenizer #274

le1nux opened this issue Dec 4, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@le1nux
Copy link
Member

le1nux commented Dec 4, 2024

System Info

all versions

🐛 Describe the bug

Currently, we set the max_length attribute in the huggingface tokenizer, which describes
"Maximum length of the tokenization output. Defaults to None.".

class PreTrainedHFTokenizer(TokenizerWrapper):
"""Wrapper for pretrained Hugging Face tokenizers."""
def __init__(
self,
pretrained_model_name_or_path: str,
truncation: Optional[bool] = False,
padding: Optional[bool | str] = False,
max_length: Optional[int] = None,
special_tokens: Optional[dict[str, str]] = None,
) -> None:
"""Initializes the PreTrainedHFTokenizer.

We set this flag, when calling the __call__ function on the tokenizer

tokens = self.tokenizer.__call__(
text,
max_length=self.max_length,
padding=self.padding,
truncation=self.truncation,
)["input_ids"]
return tokens

however, as per the huggingface documentation if max_length is set to None, then it uses max input length of the corresponding model.

Huggingface documentation on the __call__ function:

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

Proposal:
we allow the user to specify max_length in the config (just like before). If it is set to None, we set the max_length to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the model_max_length attribute.

@le1nux le1nux added the bug Something isn't working label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant