You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we set the max_length attribute in the huggingface tokenizer, which describes
"Maximum length of the tokenization output. Defaults to None.".
however, as per the huggingface documentation if max_length is set to None, then it uses max input length of the corresponding model.
Huggingface documentation on the __call__ function:
max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
Proposal:
we allow the user to specify max_length in the config (just like before). If it is set to None, we set the max_length to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the model_max_length attribute.
The text was updated successfully, but these errors were encountered:
System Info
all versions
🐛 Describe the bug
Currently, we set the
max_length
attribute in the huggingface tokenizer, which describes"Maximum length of the tokenization output. Defaults to None.".
modalities/src/modalities/tokenization/tokenizer_wrapper.py
Lines 66 to 77 in 09c666b
We set this flag, when calling the
__call__
function on the tokenizermodalities/src/modalities/tokenization/tokenizer_wrapper.py
Lines 133 to 139 in 09c666b
however, as per the huggingface documentation if
max_length
is set to None, then it uses max input length of the corresponding model.Huggingface documentation on the
__call__
function:Proposal:
we allow the user to specify
max_length
in the config (just like before). If it is set toNone
, we set themax_length
to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of themodel_max_length
attribute.The text was updated successfully, but these errors were encountered: