'\u2028' not recognized in SpacesAfter #103

maxtrem · 2019-06-12T11:57:46Z

We used the tagger and tokenizer of UDpipe. In some of our files we had this newline character '\u2028' which wasn't recognized as one. This led to further errors in other programs in our pipeline, but also to tokenization problems in UDpipe itself:
For example:

17 out. What out. what PRON WP PronType=Int _ _ _ _

Where '\u2028' is just placed after the end of the sentence.

So it would be really cool if you could add this character to the list of newline characters.

The text was updated successfully, but these errors were encountered:

foxik · 2019-06-12T12:41:13Z

Good catch, the tokenizer does not consider '\u2028' to be a newline character. Furthermore, we do not recognize '\u2029' as well -- we should fix both.

We might even consider adding a new escaping characters to SpacesAfter, even if ConLL-U documentation states that only LF is used as line separator, some tools might split on \u202[89]. regardless. But maybe not... I will think about it.

maxtrem · 2019-06-13T10:19:53Z

Thank you for your reply!
Yes, we actually used the UUParser and it does split on '\u2028' and crashes. So escaping would definitely help in that regard.

foxik · 2019-06-13T11:07:57Z

Thanks for the feedback, escaping it is then :-)

foxik mentioned this issue Jan 19, 2023

SpacesAfter= for unbreakable spaces etc. UniversalDependencies/docs#917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'\u2028' not recognized in SpacesAfter #103

'\u2028' not recognized in SpacesAfter #103

maxtrem commented Jun 12, 2019 •

edited

Loading

foxik commented Jun 12, 2019

maxtrem commented Jun 13, 2019

foxik commented Jun 13, 2019

'\u2028' not recognized in SpacesAfter #103

'\u2028' not recognized in SpacesAfter #103

Comments

maxtrem commented Jun 12, 2019 • edited Loading

foxik commented Jun 12, 2019

maxtrem commented Jun 13, 2019

foxik commented Jun 13, 2019

maxtrem commented Jun 12, 2019 •

edited

Loading