-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Morphological dictionary and multi-word tokens #99
Comments
Currently that would be non-trivial to do (just because how the implementation works). There are two parts of the mentioned problem:
I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself. |
(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated
FORM,LEMMA,UPOS,XPOS,FEATS
format so that I can also use it with UDPipe.Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
It does seem awfully verbose though...
The text was updated successfully, but these errors were encountered: