-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused words in Universal Dependencies #17
Comments
I agree we need a better (less wild) API for fused (aka multi-word) tokens in Treex. I am not sure how it will solve the problem in KonText, which probably can display either only tokens or only words. There are scripts distributed with UD (e.g. conllu-w2t.py) for converting the CoNLL-U word-indexed format to other formats. See also |
From ufal/lindat-corpora-conversions#3 (comment) :
I think we need a better representation of fused tokens in Treex. Now it is just sketched using the wild attributes but it will probably be needed in future, as it is part of the UD guidelines. So we need a less wild solution. Once we have it, we could try to implement directly in Treex the heuristics that will collapse fused words whenever desirable. And once we have this, we should probably use it before exporting data for Kontext. Because the surface matters here.
The text was updated successfully, but these errors were encountered: