Tokenization plans #1

foxik · 2022-09-05T17:16:41Z

The tokenization of MorphoDiTa and UDPipe-1 needs to be considerably improved:

Both MorphoDiTa and UDPipe-1: Currently, we are splitting grapheme clusters (legacy and extended)
- Tokenizer: Don't split extended grapheme clusters morphodita#19
MorphoDiTa: Numbers are tokenized weirdly, 1.2.3 is tokenized as 1.2 . 3, 1.2.3.4 as 1.2.3.4 (because it is an IP address, meh), 1.2.3.4.5.6.7 as 1.2.3.4 . 5.6 . 7
UDPipe-1: Purely data-driven tokenizer does not consider Unicode variants (ASCII and Unicode hyphen, different variants of quotes, fullstops, etc), does not know what to do about characters not present in the training data, and sometimes the results are weird and surprises users.

The plan is to provide a new implementation based on Unicode Text Segmentation #UAX29.

We will use it for a rule-based tokenizer, possibly with language-specific tweaking (to support MorfFlex 2.0 tokenization in Czech)
For a trainable tokenizer, we will
- allow tokens only on the borders of extended grapheme clusters
- use the clusters, words, and sentences as hints to the model being trained, so only the changes with respect to UAX29 need to be trained

The text was updated successfully, but these errors were encountered:

foxik added the planning Planning of future features label Sep 5, 2022

Provide feedback