Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

Open
kowaalczyk opened this issue Sep 23, 2019 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@kowaalczyk
Copy link
Collaborator

kowaalczyk commented Sep 23, 2019

Clean existing pipeline and move it into the new repository.
Convert lemmtaizer rules to proposed spaCy JSON format and update PR.

@kowaalczyk kowaalczyk added the bug Something isn't working label Sep 23, 2019
@Gizzio Gizzio changed the title Fix bugs in lemmatizer rules generation Rewrite lemmatzation alghoritm to use 'tag' attribute Sep 23, 2019
@Gizzio Gizzio added enhancement New feature or request and removed bug Something isn't working labels Sep 23, 2019
@MateuszOlko
Copy link
Contributor

Screenshot from 2019-09-24 09-39-32
SpaCy internals pass to lemmatization the POS-tag only.

To use richer tags we would have to write our own pipeline component.
Suggest: drop the idea and use POS

@MateuszOlko MateuszOlko added the wontfix This will not be worked on label Sep 24, 2019
@kowaalczyk
Copy link
Collaborator Author

While I agree this may be too much work, I'd like to know whether the NKJP information we're trying to put in there would be something we could fit into morphology parameter - spacy docs say that this should also be in UD format, which is described here.

Given that most of these features are inflection-related I believe we should be able to extract at least some of them from NKJP tags. By using more than just UD POS we would surely get better speed (shorter tree to search) and if our rule-to-tag assignment is correct, this would also likely give us better accuracy. This is one issue, the other is complexity of our change:

We'd have to introduce a class deriving from the base lemmatizer, since the base one uses spacy.lemmatizer.lemmatize that unfortunately doesn't get the morphology parameter from the spacy.lemmarizer.Lemmatizer.__call__. The way to do this right is define a PolishLemmatizer class deriving spacy.lemmatizer.Lemmatizer, and just reimplement the __call__ method.
Then, in spacy.lang.pl.PolishDefaults override classmethod create_lemmatizer from spacy.language.BaseDefaults.

So, the questions I'm asking are:

  • What would it take to convert NKJP into UD, and include all inflectional information that we can map to UD?
  • Do we want to take a shot at doing this, to really get the best possible lemmatizer, or just leave it on the table for someone else to try?

@kowaalczyk
Copy link
Collaborator Author

Update: we're letting the full NKJP POS tags remain unused, but let's move the pipeline to the new repo anyway, so that it can be extended when necessary

@kowaalczyk kowaalczyk removed the wontfix This will not be worked on label Sep 26, 2019
@kowaalczyk kowaalczyk changed the title Rewrite lemmatzation alghoritm to use 'tag' attribute Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer Sep 26, 2019
@kowaalczyk kowaalczyk assigned Gizzio and unassigned MateuszOlko Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants