-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6
Comments
While I agree this may be too much work, I'd like to know whether the NKJP information we're trying to put in there would be something we could fit into morphology parameter - spacy docs say that this should also be in UD format, which is described here. Given that most of these features are inflection-related I believe we should be able to extract at least some of them from NKJP tags. By using more than just UD POS we would surely get better speed (shorter tree to search) and if our rule-to-tag assignment is correct, this would also likely give us better accuracy. This is one issue, the other is complexity of our change: We'd have to introduce a class deriving from the base lemmatizer, since the base one uses So, the questions I'm asking are:
|
Update: we're letting the full NKJP POS tags remain unused, but let's move the pipeline to the new repo anyway, so that it can be extended when necessary |
Clean existing pipeline and move it into the new repository.
Convert lemmtaizer rules to proposed spaCy JSON format and update PR.
The text was updated successfully, but these errors were encountered: