Partial python port of java SRX segmenter, originally written by Jarek Lipski.
In a nutshell, it allows you to tokenize texts into sentences (but generally, it's rule-based so that you can chop anything textual).
Shipped with segment.srx
set of segmentation rules for different languages, crafted by the great team of languagetool.
pip3 install git+https://github.com/lang-uk/choppa.git
cat << EOF | python3 -m choppa
Жоден сучасний електронний прилад не обходиться без мікрочипів. Мікрочіп, інакше кажучи, мікросхема - це набір електронних схем на невеликому плоскому шматку кремнію.
EOF
See choppa/main.py for a Python usage example.
That port currently covers:
- All structures (
structures.py
) necessary for the parser to operate (Rule
,LanguageRule
,LanguageMap
) - Abstract, Accurate (legacy), and SrxTextIterator iterator (
iterators.py
), which basically segments text into chunks according to the SRX rules - Extra classes required for the SrxTextIterator (
TextManager
,RuleManager
) - Some utils (
utils.py
), for regex mangling - SAX based parser (
srx_parser.py
) to read SRX rules from xml files (SRX2.0 only) - SrxDocument (again
srx_parser.py
) class which allows you to manage rules and cache regexes - A partial implementation of Java Matcher class, which is absent in python.
- Tests for everything above (and beyond)
- Additional tests from LanguageTool for Ukrainian language
- Type hints
I also pythonized the code to some extent (by removing some setters/getters, snake_casing methods, and variables and adapting data structures).
First and foremost, I would like to thank Jarek for his work and code quality. My project is not original, it just brings the power of srx segmenter to the python world. And it relies entirely on the work done by Jarek.
Please pay attention to the fact that only Accurate iterator and Ultimate iterator is currently implemented (and I don't have immediate plans to implement the rest). Accurate Iterator should work well on relatively small documents (i.e. do not use it on multi GB plaintext corpora!), but known for some bugs. Ultimate iterator from the original library is also ported, allowing to parse large documents efficiently while sacrificing accuracy (limiting look-behind patterns, etc). If you need other iterators or are keen to optimize that beast — I'm always open for the pull requests. Similarly, I've only implemented SAX reader for rules and I'm using xmlschema
package for schema validation.
Also, I don't have any plan of porting UI at all. You can reuse some of UI's available.
- Python port: Dmytro Chaplynskyi
- Original Java implementation: Jarek Lipski
- Segmentation rules: Daniel Naber, Jaume Ortolà et al (153 contributors!)
- Special thanks to Andriy Rysin, the driving force behind the Ukrainian language in LanguageTool