Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Latest commit

 

History

History
31 lines (24 loc) · 999 Bytes

README.md

File metadata and controls

31 lines (24 loc) · 999 Bytes

rdt

RDT: Russian Distributional Thesaurus (Русский Дистрибутивный Тезаурус)

This package let you efficiently use word graph of the Russian Distributional Thesaurus.

Quickstart

  1. Download the pre-packed resource:
wget http://panchenko.me/data/russe/rdt.pkl
  1. Install dependencies, e.g.:
pip install -r requirements.txt
  1. Load the distributional thesaurus (specify path to the downloaded 'rdt.pkl' file):
from dt import RDT, DistributionalThesaurus
rdt = RDT(dt_pkl_fpath="rdt.pkl")

Loading takes about 5 minutes and the resulting structure occupy around 1.3 Gb of RAM. This is however more efficient than parsing the CSV file into a dict in terms of both time and memory consumption. This implementation relies on marisa trie for storing keys and on numpy array for storing similarity scores.

  1. Search for nearest neighbours:
for w,s in rdt.most_similar(u"граф"):
    print w,s