Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

kowaalczyk · 2019-09-23T20:14:46Z

Clean existing pipeline and move it into the new repository.
Convert lemmtaizer rules to proposed spaCy JSON format and update PR.

MateuszOlko · 2019-09-24T07:56:21Z

SpaCy internals pass to lemmatization the POS-tag only.

To use richer tags we would have to write our own pipeline component.
Suggest: drop the idea and use POS

kowaalczyk · 2019-09-24T13:33:24Z

While I agree this may be too much work, I'd like to know whether the NKJP information we're trying to put in there would be something we could fit into morphology parameter - spacy docs say that this should also be in UD format, which is described here.

Given that most of these features are inflection-related I believe we should be able to extract at least some of them from NKJP tags. By using more than just UD POS we would surely get better speed (shorter tree to search) and if our rule-to-tag assignment is correct, this would also likely give us better accuracy. This is one issue, the other is complexity of our change:

We'd have to introduce a class deriving from the base lemmatizer, since the base one uses spacy.lemmatizer.lemmatize that unfortunately doesn't get the morphology parameter from the spacy.lemmarizer.Lemmatizer.__call__. The way to do this right is define a PolishLemmatizer class deriving spacy.lemmatizer.Lemmatizer, and just reimplement the __call__ method.
Then, in spacy.lang.pl.PolishDefaults override classmethod create_lemmatizer from spacy.language.BaseDefaults.

So, the questions I'm asking are:

What would it take to convert NKJP into UD, and include all inflectional information that we can map to UD?
Do we want to take a shot at doing this, to really get the best possible lemmatizer, or just leave it on the table for someone else to try?

kowaalczyk · 2019-09-26T15:31:47Z

Update: we're letting the full NKJP POS tags remain unused, but let's move the pipeline to the new repo anyway, so that it can be extended when necessary

kowaalczyk added the bug Something isn't working label Sep 23, 2019

Gizzio changed the title ~~Fix bugs in lemmatizer rules generation~~ Rewrite lemmatzation alghoritm to use 'tag' attribute Sep 23, 2019

Gizzio added enhancement New feature or request and removed bug Something isn't working labels Sep 23, 2019

Gizzio assigned MateuszOlko Sep 23, 2019

MateuszOlko added the wontfix This will not be worked on label Sep 24, 2019

kowaalczyk removed the wontfix This will not be worked on label Sep 26, 2019

kowaalczyk changed the title ~~Rewrite lemmatzation alghoritm to use 'tag' attribute~~ Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer Sep 26, 2019

kowaalczyk assigned Gizzio and unassigned MateuszOlko Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

kowaalczyk commented Sep 23, 2019 •

edited

Loading

MateuszOlko commented Sep 24, 2019

kowaalczyk commented Sep 24, 2019

kowaalczyk commented Sep 26, 2019

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

Comments

kowaalczyk commented Sep 23, 2019 • edited Loading

MateuszOlko commented Sep 24, 2019

kowaalczyk commented Sep 24, 2019

kowaalczyk commented Sep 26, 2019

kowaalczyk commented Sep 23, 2019 •

edited

Loading