Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused words in Universal Dependencies #17

Open
dan-zeman opened this issue Aug 31, 2015 · 1 comment
Open

Fused words in Universal Dependencies #17

dan-zeman opened this issue Aug 31, 2015 · 1 comment
Assignees

Comments

@dan-zeman
Copy link
Member

From ufal/lindat-corpora-conversions#3 (comment) :

I think we need a better representation of fused tokens in Treex. Now it is just sketched using the wild attributes but it will probably be needed in future, as it is part of the UD guidelines. So we need a less wild solution. Once we have it, we could try to implement directly in Treex the heuristics that will collapse fused words whenever desirable. And once we have this, we should probably use it before exporting data for Kontext. Because the surface matters here.

@dan-zeman dan-zeman self-assigned this Aug 31, 2015
@martinpopel
Copy link
Member

I agree we need a better (less wild) API for fused (aka multi-word) tokens in Treex.

I am not sure how it will solve the problem in KonText, which probably can display either only tokens or only words. There are scripts distributed with UD (e.g. conllu-w2t.py) for converting the CoNLL-U word-indexed format to other formats.

See also
http://universaldependencies.github.io/docs/cs/overview/tokenization.html
http://universaldependencies.github.io/docs/u/overview/tokenization.html
http://universaldependencies.github.io/docs/format.html#words-and-tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants