Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

mtak- · 2023-04-16T12:58:21Z

It seems that while there is support for tokenization with diacritics in spaCy, the project doesn't lemmatize/morph/pos tag correctly when they are used.

How to reproduce the behaviour

import ru_core_news_lg
nlp = ru_core_news_lg.load()
doc = nlp('Я ви́жу му́жа и жену́')
print(doc[-1].pos_) # PROPN (incorrect. just a noun)
print(doc[-1].lemma_) # жену́ (incorrect. should be жена)
print(doc[-1].morph) # nothing is printed which is obviously incorrect

if changed to remove the diacritics all is well

from spacy.lang.char_classes import COMBINING_DIACRITICS
diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
doc = nlp(diacritics_re.sub('', 'Я ви́жу му́жа и жену́'))

print(doc[-1].pos_) # NOUN
print(doc[-1].lemma_) # жена
print(doc[-1].morph) # Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

pymorphy3/pymorphy2 doesn't handle diacritics

it seems pymorphy3/2 doesn't handle diacritics, so perhaps before parse is called, diacritics should be removed.

diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
text = diacritics_re.sub('', token.text)

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2023-04-17T08:38:51Z

Thanks for the note, we'll take a look!

adrianeboyd · 2023-04-20T09:22:48Z

The suggestion for the lemmatizer is included in #12554.

For the poor tagging, etc. with statistical models for the tokens with diacritics, I think the best option would be to configure custom NORM, PREFIX, and SUFFIX features for ru and uk that strip diacritics. If you wanted to try this out with the current spacy release (v3.5), you could use a custom language to customize these methods, called lex_attr_getters in the defaults similar to this:

https://spacy.io/usage/linguistic-features#language-subclass

The defaults would be extended similar to this:

spaCy/spacy/lang/ru/__init__.py

Lines 13 to 23 in 8e6a3d5

    
           class RussianDefaults(BaseDefaults): 
        
               tokenizer_exceptions = TOKENIZER_EXCEPTIONS 
        
               lex_attr_getters = LEX_ATTRS 
        
               stop_words = STOP_WORDS 
        
               suffixes = COMBINING_DIACRITICS_TOKENIZER_SUFFIXES 
        
               infixes = COMBINING_DIACRITICS_TOKENIZER_INFIXES 
        
           class Russian(Language): 
        
               lang = "ru" 
        
               Defaults = RussianDefaults

mtak- · 2023-04-20T14:44:04Z

Wonderful! Thank you for the quick PR and suggestions.

I'm a noob when it comes to spaCy. I'm using it to generate tags on anki flashcards to study Russian. But, if I understand you correctly, the model I use should be trained with diacritics. Is that correct (e.g. ru_core_news_lg will not work)?

I ask because I tried making a custom language and the results were still unsatisfactory (even with a patch similar to #12554).

DIACRITICS_RE = re.compile(f'[{COMBINING_DIACRITICS}]')
def norm(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())
def prefix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[0]
def suffix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[-3:]
ATTR_GETTERS = spacy.lang.ru.LEX_ATTRS
ATTR_GETTERS.update({
    attrs.NORM: norm,
    attrs.PREFIX: prefix,
    attrs.SUFFIX: suffix,
})

class CustomRussianDefaults(Russian.Defaults):
    lex_attr_getters = ATTR_GETTERS

@spacy.registry.languages("custom_ru")
class CustomRussian(Russian):
    lang = "custom_ru"
    Defaults = CustomRussianDefaults

nlp = ru_core_news_lg.load()
# omitted the patching of _pymorphy_lemmatize
nlp.lang = 'custom_ru'

Test

>>> nlp('Я ви́жу му́жа и жену́')[-1].morph
Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
>>> nlp('Я вижу мужа и жену')[-1].morph
Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

The Animacy for жену́ is inanimate with diacritics, which is incorrect.

adrianeboyd · 2023-04-20T15:33:04Z

The language and language defaults really needs to be set before the pipeline is loaded at all, but you can test this a bit by modifying the pipeline on-the-fly instead. (A few things may already be cached so it might not work 100%.)

nlp = spacy.load("ru_core_news_lg")
nlp.vocab.lex_attr_getters.update(...)

A cleaner version would basically make a copy of ru_core_news_lg where [nlp.lang] is edited to custom_ru. But with the above you should be able to test most things out. And keep in mind that the statistical models will still make mistakes, especially for ambiguous cases.

Vuizur · 2024-02-23T22:42:05Z

I had the same problem and discovered at least a workaround:
One can create two docs, one with the original stressed text, and one with the text with diacritics removed.
That way you can iterate through the docs in parallel, getting the correct (stressed) text from doc 1 while getting the grammatical information from doc 2.

It's half as fast, but it does work.

mtak- changed the title ~~Russian lemmatization/morphological analysis fails with diacritics~~ Russian pos tagging/lemmatization/morphological analysis fails with diacritics Apr 16, 2023

adrianeboyd added lang / ru Russian language data and models lang / uk Ukrainian language data and models labels Apr 17, 2023

adrianeboyd linked a pull request Apr 20, 2023 that will close this issue

Strip diacritics for pymorphy lemmatizer #12554

Draft

3 tasks

Vuizur mentioned this issue Feb 24, 2024

Fix word wise for stressed Russian epubs xxyzz/WordDumb#192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

mtak- commented Apr 16, 2023

adrianeboyd commented Apr 17, 2023

adrianeboyd commented Apr 20, 2023

mtak- commented Apr 20, 2023

adrianeboyd commented Apr 20, 2023

Vuizur commented Feb 23, 2024

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

Comments

mtak- commented Apr 16, 2023

How to reproduce the behaviour

pymorphy3/pymorphy2 doesn't handle diacritics

adrianeboyd commented Apr 17, 2023

adrianeboyd commented Apr 20, 2023

mtak- commented Apr 20, 2023

adrianeboyd commented Apr 20, 2023

Vuizur commented Feb 23, 2024