Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphological dictionary and multi-word tokens #99

Open
jeanm opened this issue Jun 5, 2019 · 2 comments
Open

Morphological dictionary and multi-word tokens #99

jeanm opened this issue Jun 5, 2019 · 2 comments

Comments

@jeanm
Copy link

jeanm commented Jun 5, 2019

(First of all, congrats on UDPipe, it's a pleasure to use!)

I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.

Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:

1-2 au _ _ _ _
1 à à ADP ADP _
2 le le DET DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art

If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?

I was thinking of something like the following:

au   _   _   _   _   SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art

It does seem awfully verbose though...

@foxik
Copy link
Member

foxik commented Jun 13, 2019

Currently that would be non-trivial to do (just because how the implementation works).

There are two parts of the mentioned problem:

  1. The au multi-word token must be split in two words à and le. Currently UDPipe does that in a very old-fashioned way by having a dictionary with rules how the multi-word tokens are split. It would not be difficult to allow adding additional rules, both during training or during inference.
  2. Run morphological analysis on the resulting words. UDPipe currently does not distinguish tokens and multi-word tokens, so the analyses for à are the same independently whether it was a token or a part of a multi-word token -- but of course it could be modified.

I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself.

@ftyers
Copy link

ftyers commented May 22, 2020

There are some other issues that relate to this: #63 and #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants