"fixedHeader": false,
"scrollX": true,
"scrollY": '80vh',
"class": "display"
});
} );
name | Meta | Corpora | Text processing | Annotation | ML | visualization | Multilanguage | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
github | Splitting | Parsing | Coreference resolution | Word inflection | Pattern Matching | X-grams | Spelling correction | WordNet | stopwords | statistics | Tagger | NER | Sentiment analysis | Classification | Clustering | Topic Modelling | Vectorization (including embeddings) | Translation | Language Identification | |||
TextBlob | NLTK-tokenizers | based on `pattern` | singularize, pluralize, lemmatize | based on `pattern` | integration | Word and phrase frequencies | 1) POS based on `pattern` 2) POS based on NLTK‘s TreeBank tagger 3) NP based on Shlomi Babluki’s implementation 4) NP uses the CoNLL 2000 corpus to train a tagger |
PatternAnalyzer (based on the `pattern`) NaiveBayesAnalyzer (an NLTK classifier trained on a movie reviews corpus) |
Naive Bayes, Decision Tree | powered by the Google Translate API | powered by the Google Translate API | |||||||||||
textacy | ||||||||||||||||||||||
pattern | contains API's (Google, Gmail, Bing, Twitter, Facebook, Wikipedia, Wiktionary, DBPedia, Flickr, ...), a robust HTML DOM parser and a web crawler. | yes | by POS-tags | POS (NN, VB, JJ, DT) Chunks (NP) |
Naive Bayes, Perceptron, k-NN, SVM | k-means, hierarchical | LSA | td, df, idf, tf-idf, cosine similarity, infogain | graph.js on canvas | |||||||||||||
pymorphy2 | for Russian: singularize, pluralize, lemmatize | for Russian: morphology | ||||||||||||||||||||
PyNLPl | ||||||||||||||||||||||
glove | glove | |||||||||||||||||||||
MITIE | tokenizer | - "bunch of different types of binary relation detector" | yes | yes | pretrained word_feature_extractor | |||||||||||||||||
gensim | tf, tf-idf, word2vec | |||||||||||||||||||||
NLTK | n-grams | |||||||||||||||||||||
stopwords | ||||||||||||||||||||||
colibri-core | n-grams, skipgrams, flexgrams | |||||||||||||||||||||
spaCy | - Non-destructive tokenization - Syntax-driven sentence segmentation |
"fast and accurate syntactic dependency parser" | Rule-based matching | English and German tagging models with rule-based morphology | > 10 built-in types Stand-off format and token tags training |
|||||||||||||||||
fastText | yes | skipgram, cbow | ||||||||||||||||||||
SyntaxNet | tokenizer | "transition-based dependency parser" | POS | |||||||||||||||||||
langid | pre-trained for 97 languages | |||||||||||||||||||||
CoreNLP | tokenizer | yes | "multi-pass sieve coreference resolution" | lemmatize | Pattern-based entity extraction | POS | - NER with "CRF sequence models" - "Open information extraction" |
|||||||||||||||
bllip-parser | "8 known unified parsing models", including models for web, news, PubMed texts | |||||||||||||||||||||
MBSP | Regex-based segmentation Regex-bases tokenization |
MBLEM-based lemmatization | POS (NN, JJ, VB) Chunks (NP, VP) Relations (SBJ, OBJ) |