Preserve all non-whitespace characters in the input sentence #12

kukas · 2023-11-01T19:01:16Z

The tokenizer omits some characters not covered by the WORD_TOKENIZATION_RULES regex. An example is the € character.
The sentence "за ставкою € 1." gets tokenized as: ["за", "ставкою", "1", "."] and the € character is left out completely.

This pull request fixes this problem by covering all non-whitespace characters in the WORD_TOKENIZATION_RULES regex.

dchaplinsky · 2023-11-02T16:11:03Z

I spoke to @arysin on that matter.
First of all, it'd be great if you can align the current java implementation with the python port.

For example, the corresponding code for the regex you are altering is there: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/tokenizers/uk/UkrainianWordTokenizer.java#L39

(tests are also updated languagetool-org/languagetool@95b83d7)

His idea is that making it all non-space is way to broad and might cause unwanted consequences on other samples (not covered with tests). I'll ask him to comment here too.

arysin · 2023-11-02T16:18:17Z

I have extensive set of tests (unit tests and multimillion token tests) but using languagetool as mentioned above.
If I can see the change based on the latest Java code I could run these tests and provide more detailed feedback.

kukas · 2023-11-24T09:55:19Z

Hello, thank you for your comments and sorry for the late response.
Sadly, I do not have the capacity to set up a java development, make the corresponding changes in the Java code and test them there as well.
But I would like to address the concern about the broadness of the edited regex.
The regex WORD_TOKENIZATION_RULES is made up of 9 subexpressions joined with disjunction. I have edited the [^\s] subexpression, that matches 1-character length substrings only (it will match always only one nonwhitespace character at a time). Because regexes always try to match the longest subsequence possible, this change cannot affect the matches longer than 1-character, that are covered in the other subexpressions. Now, in the case if 1-character matches, the modified subexpression merely broadens the number of characters that are tokenized as a 1-character token in the resulting tokenization. So instead of relying on the manual enumeration of the allowed one-character tokens before (["#$%&*+,/:;<=.....]), we match any one-character token, that is not whitespace ([^\s]).

The manual enumeration has the not-so-nice property, that any character not in the list will not be present in the tokenized output. This includes quite common characters such as euro symbol €, degree symbol ° or multiplication symbol ×, that were present in my own testing data.
The change I propose includes these characters as one-character tokens in the output and does not make any change to the multi-character tokens tokenized using the other regex subexpressions.

arysin · 2023-12-15T19:07:04Z

Sorry, it's hard for me to comment on this change, as the original regexp in this code is very different from the tokenizer in LanguageTool I maintain. I agree that symbols should not be dropped from tokenized text, and from what I tried, the regex in the LanguageTool already passes all the new tests above.
I'd say somebody who can run the Python module on some significant amount of text should do the testing and comment.

arysin · 2023-12-15T19:22:38Z

BTW there's little Python wrapper for the LT/nlp_uk modules https://github.com/brown-uk/nlp_uk/tree/master/src/main/python
I know it's not very kosher as it invokes groovy/java, but it should provide very high quality tokenization and tagging (with lots of options).

kukas added 2 commits November 1, 2023 19:48

single-character tokens may contain any non-whitespace character

61a4b4e

fix tests

8a75ca5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve all non-whitespace characters in the input sentence #12

Preserve all non-whitespace characters in the input sentence #12

kukas commented Nov 1, 2023

dchaplinsky commented Nov 2, 2023

arysin commented Nov 2, 2023

kukas commented Nov 24, 2023

arysin commented Dec 15, 2023

arysin commented Dec 15, 2023

Preserve all non-whitespace characters in the input sentence #12

Are you sure you want to change the base?

Preserve all non-whitespace characters in the input sentence #12

Conversation

kukas commented Nov 1, 2023

dchaplinsky commented Nov 2, 2023

arysin commented Nov 2, 2023

kukas commented Nov 24, 2023

arysin commented Dec 15, 2023

arysin commented Dec 15, 2023