Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer clean_up_tokenization_spaces #65

Open
Nickwiz opened this issue Aug 21, 2024 · 5 comments
Open

Tokenizer clean_up_tokenization_spaces #65

Nickwiz opened this issue Aug 21, 2024 · 5 comments

Comments

@Nickwiz
Copy link

Nickwiz commented Aug 21, 2024

FYI

Changes in transformers tokenizer gives deprecation warning.

/xxx/dltranslate/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601:
FutureWarning: clean_up_tokenization_spaces was not set. It will be set
to True by default. This behavior will be depracted in transformers
v4.45, and will be then set to False by default. For more details check
this issue: huggingface/transformers#31884

Something like this can be used:

mt = dlt.TranslationModel(
        tokenizer_options = {
            "clean_up_tokenization_spaces": True # Add this
        }
)

Have not found any difference in result by using True vs False, but then again I just started looking at this project.

@xhluca
Copy link
Owner

xhluca commented Aug 26, 2024

I'm a bit confused, why does it say that it is true by default, then immediately after it is false by default?

@Nickwiz
Copy link
Author

Nickwiz commented Aug 26, 2024

I'm a bit confused, why does it say that it is true by default, then immediately after it is false by default?

As I read it; it is set to True by default as of now, under patches to current version 4.44, as the code as-is in effect gives a result as if it were.

When they move to version 4.45 it is going to be set to False by default as that would make the tokenized string reversible into the original.

@xhluca
Copy link
Owner

xhluca commented Aug 26, 2024

Would True or False be better in this case? Perhaps it is useful to have reversible string, but that would be bad if it affects the final results (regression)

@Nickwiz
Copy link
Author

Nickwiz commented Sep 19, 2024

I do not know the code good enough to answer that. What I found, from some simple tests, is that formatting is not preserved. E.g.

Input text:

Text with extra spaces     and line    terminators.
End.

mt.translate(text, source="en", target="en")

Output:

Text with extra spaces and line terminators. End.

Setting "clean_up_tokenization_spaces" to False does not change result.

As the project is for now, I do not see any feature for reverting and not sure if that is something that would be of need. Perhaps in some sub-component?

Preserving formatting on the other hand would be nice, but a rather huge challenge when it comes to the nature of translations. E.g:

Input: line     terminators
no: linjeavslutninger
fi: linjapäätteet
ru: терминаторы линии (terminator lines)
...

Where if anywhere would one add the extra white-space etc.; And a rather big beast to handle - unless this is something already a feature or being worked on in the models.

@xhluca
Copy link
Owner

xhluca commented Sep 29, 2024

Hmm I see. In this case, i'm happy to merge a PR if you are interested in creating one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants