Skip to content

distance metrics, better TextDoc/TextCorpus, and most discriminating terms

Compare
Choose a tag to compare
@bdewilde bdewilde released this 14 Jul 18:19

Changes:

  • New features for TextDoc and TextCorpus classes
    • added .save() methods and .load() classmethods, which allows for fast serialization of parsed documents/corpora and associated metadata to/from disk — with an important caveat: if spacy.Vocab object used to serialize and deserialize is not the same, there will be problems, making this format useful as short-term but not long-term storage
    • TextCorpus may now be instantiated with an already-loaded spaCy pipeline, which may or may not have all models loaded; it can still be instantiated using a language code string ('en', 'de') to load a spaCy pipeline that includes all models by default
    • TextDoc methods wrapping extract and keyterms functions now have full documentation rather than forwarding users to the wrapped functions themselves; more irritating on the dev side, but much less irritating on the user side :)
  • Added a distance.py module containing several document, set, and string distance metrics
    • word movers: document distance as distance between individual words represented by word2vec vectors, normalized
    • "word2vec": token, span, or document distance as cosine distance between (average) word2vec representations, normalized
    • jaccard: string or set(string) distance as intersection / overlap, normalized, with optional fuzzy-matching across set members
    • hamming: distance between two strings as number of substititions, optionally normalized
    • levenshtein: distance between two strings as number of substitions, deletions, and insertions, optionally normalized (and removed a redundant function from the still-orphaned math_utils.py module)
    • jaro-winkler: distance between two strings with variable prefix weighting, normalized
  • Added most_discriminating_terms() function to keyterms module to take a collection of documents split into two exclusive groups and compute the most discriminating terms for group1-and-not-group2 as well as group2-and-not-group1

Bugfixes:

  • fixed variable name error in docs usage example (thanks to @licyeus, PR #23)