-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530
Comments
Thanks for the note, we'll take a look! |
The suggestion for the lemmatizer is included in #12554. For the poor tagging, etc. with statistical models for the tokens with diacritics, I think the best option would be to configure custom https://spacy.io/usage/linguistic-features#language-subclass The defaults would be extended similar to this: spaCy/spacy/lang/ru/__init__.py Lines 13 to 23 in 8e6a3d5
|
Wonderful! Thank you for the quick PR and suggestions. I'm a noob when it comes to spaCy. I'm using it to generate tags on anki flashcards to study Russian. But, if I understand you correctly, the model I use should be trained with diacritics. Is that correct (e.g. I ask because I tried making a custom language and the results were still unsatisfactory (even with a patch similar to #12554). DIACRITICS_RE = re.compile(f'[{COMBINING_DIACRITICS}]')
def norm(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())
def prefix(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())[0]
def suffix(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())[-3:]
ATTR_GETTERS = spacy.lang.ru.LEX_ATTRS
ATTR_GETTERS.update({
attrs.NORM: norm,
attrs.PREFIX: prefix,
attrs.SUFFIX: suffix,
})
class CustomRussianDefaults(Russian.Defaults):
lex_attr_getters = ATTR_GETTERS
@spacy.registry.languages("custom_ru")
class CustomRussian(Russian):
lang = "custom_ru"
Defaults = CustomRussianDefaults nlp = ru_core_news_lg.load()
# omitted the patching of _pymorphy_lemmatize
nlp.lang = 'custom_ru' Test >>> nlp('Я ви́жу му́жа и жену́')[-1].morph
Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
>>> nlp('Я вижу мужа и жену')[-1].morph
Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing The |
The language and language defaults really needs to be set before the pipeline is loaded at all, but you can test this a bit by modifying the pipeline on-the-fly instead. (A few things may already be cached so it might not work 100%.) nlp = spacy.load("ru_core_news_lg")
nlp.vocab.lex_attr_getters.update(...) A cleaner version would basically make a copy of |
I had the same problem and discovered at least a workaround: It's half as fast, but it does work. |
It seems that while there is support for tokenization with diacritics in spaCy, the project doesn't lemmatize/morph/pos tag correctly when they are used.
How to reproduce the behaviour
if changed to remove the diacritics all is well
pymorphy3/pymorphy2 doesn't handle diacritics
it seems pymorphy3/2 doesn't handle diacritics, so perhaps before
parse
is called, diacritics should be removed.The text was updated successfully, but these errors were encountered: