Increase model_max_length to avoid tokenizer warnings during packing #275

flxst · 2024-12-04T15:23:29Z

The config of the tokenizer employed during packing / tokenization of the data (modalities data pack_encoded_data) has an attribute "model_max_length", see here.

While this attribute is not used in any way during packing and thus doesn't have any effect on the results (i.e. the pbin file), it throws many warnings like "Token indices sequence length is longer than the specified maximum sequence length for this model (1757 > 1024)".

We should avoid those warnings by setting "model_max_length" to a very large value (e.g. in the Tokenizer object).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase model_max_length to avoid tokenizer warnings during packing #275

Increase model_max_length to avoid tokenizer warnings during packing #275

flxst commented Dec 4, 2024

Increase model_max_length to avoid tokenizer warnings during packing #275

Increase model_max_length to avoid tokenizer warnings during packing #275

Comments

flxst commented Dec 4, 2024