Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization plans #1

Open
foxik opened this issue Sep 5, 2022 · 0 comments
Open

Tokenization plans #1

foxik opened this issue Sep 5, 2022 · 0 comments
Labels
planning Planning of future features

Comments

@foxik
Copy link
Member

foxik commented Sep 5, 2022

The tokenization of MorphoDiTa and UDPipe-1 needs to be considerably improved:

  • Both MorphoDiTa and UDPipe-1: Currently, we are splitting grapheme clusters (legacy and extended)
  • MorphoDiTa: Numbers are tokenized weirdly, 1.2.3 is tokenized as 1.2 . 3, 1.2.3.4 as 1.2.3.4 (because it is an IP address, meh), 1.2.3.4.5.6.7 as 1.2.3.4 . 5.6 . 7
  • UDPipe-1: Purely data-driven tokenizer does not consider Unicode variants (ASCII and Unicode hyphen, different variants of quotes, fullstops, etc), does not know what to do about characters not present in the training data, and sometimes the results are weird and surprises users.

The plan is to provide a new implementation based on Unicode Text Segmentation #UAX29.

  • We will use it for a rule-based tokenizer, possibly with language-specific tweaking (to support MorfFlex 2.0 tokenization in Czech)
  • For a trainable tokenizer, we will
    • allow tokens only on the borders of extended grapheme clusters
    • use the clusters, words, and sentences as hints to the model being trained, so only the changes with respect to UAX29 need to be trained
@foxik foxik added the planning Planning of future features label Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
planning Planning of future features
Projects
None yet
Development

No branches or pull requests

1 participant