You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself:
For example:
17 out. What out. what PRON WP PronType=Int _ _ _ _
Where '\u2028' is just placed after the end of the sentence.
So it would be really cool if you could add this character to the list of newline characters.
The text was updated successfully, but these errors were encountered:
Good catch, the tokenizer does not consider '\u2028' to be a newline character. Furthermore, we do not recognize '\u2029' as well -- we should fix both.
We might even consider adding a new escaping characters to SpacesAfter, even if ConLL-U documentation states that only LF is used as line separator, some tools might split on \u202[89]. regardless. But maybe not... I will think about it.
Thank you for your reply!
Yes, we actually used the UUParser and it does split on '\u2028' and crashes. So escaping would definitely help in that regard.
We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself:
For example:
17 out. What out. what PRON WP PronType=Int _ _ _ _
Where '\u2028' is just placed after the end of the sentence.
So it would be really cool if you could add this character to the list of newline characters.
The text was updated successfully, but these errors were encountered: