Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase model_max_length to avoid tokenizer warnings during packing #275

Open
flxst opened this issue Dec 4, 2024 · 0 comments
Open

Increase model_max_length to avoid tokenizer warnings during packing #275

flxst opened this issue Dec 4, 2024 · 0 comments

Comments

@flxst
Copy link
Member

flxst commented Dec 4, 2024

The config of the tokenizer employed during packing / tokenization of the data (modalities data pack_encoded_data) has an attribute "model_max_length", see here.

While this attribute is not used in any way during packing and thus doesn't have any effect on the results (i.e. the pbin file), it throws many warnings like "Token indices sequence length is longer than the specified maximum sequence length for this model (1757 > 1024)".

We should avoid those warnings by setting "model_max_length" to a very large value (e.g. in the Tokenizer object).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant