-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve all non-whitespace characters in the input sentence #12
base: master
Are you sure you want to change the base?
Conversation
I spoke to @arysin on that matter. For example, the corresponding code for the regex you are altering is there: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/tokenizers/uk/UkrainianWordTokenizer.java#L39 (tests are also updated languagetool-org/languagetool@95b83d7) His idea is that making it all non-space is way to broad and might cause unwanted consequences on other samples (not covered with tests). I'll ask him to comment here too. |
I have extensive set of tests (unit tests and multimillion token tests) but using languagetool as mentioned above. |
Hello, thank you for your comments and sorry for the late response. The manual enumeration has the not-so-nice property, that any character not in the list will not be present in the tokenized output. This includes quite common characters such as euro symbol €, degree symbol ° or multiplication symbol ×, that were present in my own testing data. |
Sorry, it's hard for me to comment on this change, as the original regexp in this code is very different from the tokenizer in LanguageTool I maintain. I agree that symbols should not be dropped from tokenized text, and from what I tried, the regex in the LanguageTool already passes all the new tests above. |
BTW there's little Python wrapper for the LT/nlp_uk modules https://github.com/brown-uk/nlp_uk/tree/master/src/main/python |
The tokenizer omits some characters not covered by the
WORD_TOKENIZATION_RULES
regex. An example is the € character.The sentence "за ставкою € 1." gets tokenized as:
["за", "ставкою", "1", "."]
and the € character is left out completely.This pull request fixes this problem by covering all non-whitespace characters in the
WORD_TOKENIZATION_RULES
regex.