Skip to content

Latest commit

 

History

History
38 lines (19 loc) · 3.46 KB

serbian_stemmer.md

File metadata and controls

38 lines (19 loc) · 3.46 KB

serbian_stemmer

The serbian_stemmer is an Elasticsearch filter that provides stemming for the Bosnian-Croatian-Montenegrin-Serbian language (BCMS).

The stemmer converts tokens from Cyrillic script to Latin script, based on the Serbian alphabet mapping. The stemmer only returns tokens for BCMS words in the Latin script, which allows for cross-script indexing and searching.

Analyzer Notes

  • Mixed script tokens: Note that Cyrillic characters that are not part of the Serbian alphabet are not converted to Latin, so that, for example, some characters from the Russian and Ukrainian alphabets, like ґ, ё, і, ї, й, щ, ъ, ь, ю, and я will not be converted. This means that some Russian or Ukrainian words sent to the serbian_stemmer can generate mixed-script tokens.
  • Diacritics: Serbian dictionaries and encyclopedias often use diacritics (ácute, gràve, double grȁve, mācron, and inverted brȇve) as a pronunciation guide for the pitch accent of the word. The serbian_stemmer doesn't currently handle those accents, and they can lead to poor stemming. They should be removed before stemming.
  • Folding: If you use generic folding (ICU folding conveniently handles both combining and precomposed diacrtics), be sure not to fold Ć/ć, Č/č, Đ/đ, Š/š, or Ž/ž, which should be kept distinct from C/c, D/d, S/s, and Z/z.
  • Note that some non-Serbian Cyrillic characters can be folded to Serbian Cyrillic characters (ґ to г, ё to е, й to и) and then they would get converted to the corresponding Serbian Latin characters.

Implementation History

  • The original python implementation is "Simple stemmer for Croatian v0.1", by Nikola Ljubešić and Ivan Pandžić, based on a paper by Ljubešić, et al.

  • The python version was ported to Java (along with several other Serbian and Croatian stemmers in the collection "SCStemmers") by Vuk Batanović.

  • SCStemmers includes WEKA integration, and has a dependency on WEKA.

  • SCStemmers adds support for Cyrillic-to-Latin mapping for this originally Croatian stemmer, and better handling of non-BCMS characters. (Serbian is digraphic and uses both Cyrillic and Latin script. Croatian is written only in Latin script. Stemming algorithms can work across BCMS varieties as long as they are transliterated into the right character set for the stemmer.)

  • A WEKA-free version of the SCStemmers collection was forked by Trey Jones.

  • The WEKA-free version of just the Ljubešić-Pandžić stemmer was wrapped into this Elasticsearch plugin by Trey Jones to provide the serbian_stemmer filter.

  • Only the Ljubešić-Pandžić stemmer was ported because it performed the best on stemming corpora from the Serbian-language Wikipedia and Wiktionary projects.