Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve all non-whitespace characters in the input sentence #12

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kukas
Copy link

@kukas kukas commented Nov 1, 2023

The tokenizer omits some characters not covered by the WORD_TOKENIZATION_RULES regex. An example is the € character.
The sentence "за ставкою € 1." gets tokenized as: ["за", "ставкою", "1", "."] and the € character is left out completely.

This pull request fixes this problem by covering all non-whitespace characters in the WORD_TOKENIZATION_RULES regex.

@dchaplinsky
Copy link
Contributor

I spoke to @arysin on that matter.
First of all, it'd be great if you can align the current java implementation with the python port.

For example, the corresponding code for the regex you are altering is there: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/tokenizers/uk/UkrainianWordTokenizer.java#L39

(tests are also updated languagetool-org/languagetool@95b83d7)

His idea is that making it all non-space is way to broad and might cause unwanted consequences on other samples (not covered with tests). I'll ask him to comment here too.

@arysin
Copy link

arysin commented Nov 2, 2023

I have extensive set of tests (unit tests and multimillion token tests) but using languagetool as mentioned above.
If I can see the change based on the latest Java code I could run these tests and provide more detailed feedback.

@kukas
Copy link
Author

kukas commented Nov 24, 2023

Hello, thank you for your comments and sorry for the late response.
Sadly, I do not have the capacity to set up a java development, make the corresponding changes in the Java code and test them there as well.
But I would like to address the concern about the broadness of the edited regex.
The regex WORD_TOKENIZATION_RULES is made up of 9 subexpressions joined with disjunction. I have edited the [^\s] subexpression, that matches 1-character length substrings only (it will match always only one nonwhitespace character at a time). Because regexes always try to match the longest subsequence possible, this change cannot affect the matches longer than 1-character, that are covered in the other subexpressions. Now, in the case if 1-character matches, the modified subexpression merely broadens the number of characters that are tokenized as a 1-character token in the resulting tokenization. So instead of relying on the manual enumeration of the allowed one-character tokens before (["#$%&*+,/:;<=.....]), we match any one-character token, that is not whitespace ([^\s]).

The manual enumeration has the not-so-nice property, that any character not in the list will not be present in the tokenized output. This includes quite common characters such as euro symbol €, degree symbol ° or multiplication symbol ×, that were present in my own testing data.
The change I propose includes these characters as one-character tokens in the output and does not make any change to the multi-character tokens tokenized using the other regex subexpressions.

@arysin
Copy link

arysin commented Dec 15, 2023

Sorry, it's hard for me to comment on this change, as the original regexp in this code is very different from the tokenizer in LanguageTool I maintain. I agree that symbols should not be dropped from tokenized text, and from what I tried, the regex in the LanguageTool already passes all the new tests above.
I'd say somebody who can run the Python module on some significant amount of text should do the testing and comment.

@arysin
Copy link

arysin commented Dec 15, 2023

BTW there's little Python wrapper for the LT/nlp_uk modules https://github.com/brown-uk/nlp_uk/tree/master/src/main/python
I know it's not very kosher as it invokes groovy/java, but it should provide very high quality tokenization and tagging (with lots of options).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants