You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MorphoDiTa: Numbers are tokenized weirdly, 1.2.3 is tokenized as 1.2 . 3, 1.2.3.4 as 1.2.3.4 (because it is an IP address, meh), 1.2.3.4.5.6.7 as 1.2.3.4 . 5.6 . 7
UDPipe-1: Purely data-driven tokenizer does not consider Unicode variants (ASCII and Unicode hyphen, different variants of quotes, fullstops, etc), does not know what to do about characters not present in the training data, and sometimes the results are weird and surprises users.
The plan is to provide a new implementation based on Unicode Text Segmentation #UAX29.
We will use it for a rule-based tokenizer, possibly with language-specific tweaking (to support MorfFlex 2.0 tokenization in Czech)
For a trainable tokenizer, we will
allow tokens only on the borders of extended grapheme clusters
use the clusters, words, and sentences as hints to the model being trained, so only the changes with respect to UAX29 need to be trained
The text was updated successfully, but these errors were encountered:
The tokenization of MorphoDiTa and UDPipe-1 needs to be considerably improved:
1.2.3
is tokenized as1.2 . 3
,1.2.3.4
as1.2.3.4
(because it is an IP address, meh),1.2.3.4.5.6.7
as1.2.3.4 . 5.6 . 7
The plan is to provide a new implementation based on Unicode Text Segmentation #UAX29.
The text was updated successfully, but these errors were encountered: