Skip to content

Releases: chartbeat-labs/textacy

Custom pipelines and fewer dependencies

15 Nov 19:11
Compare
Choose a tag to compare

Changes:

  • Preliminary inclusion of custom spaCy pipelines
    • updated load_spacy() to include explicit path and create_pipeline kwargs, and removed the already-deprecated load_spacy_pipeline() function to avoid confusion around spaCy languages and pipelines
    • added spacy_pipelines module to hold implementations of custom spaCy pipelines, including a basic one that merges entities into single tokens
    • note: necessarily bumped minimum spaCy version to 1.1.0+ (see the announcement here)
  • To reduce code bloat, made the matplotlib dependency optional and dropped the gensim dependency
    • to install matplotlib at the same time as textacy, do $ pip install textacy[viz]
    • bonus: backports.csv is now only installed for Py2 users
    • thanks to @mbatchkarov for the request
  • Improved performance of textacy.corpora.WikiReader().texts(); results should stream faster and have cleaner plaintext content than when they were produced by gensim
    • this should also fix a bug reported in Issue #51 by @baisk
  • Added a Corpus.vectors property that returns a matrix of shape (# documents, vector dim) containing the average word2vec-style vector representation of constituent tokens for all Doc s

v0.3.1

19 Oct 19:51
Compare
Choose a tag to compare

After a couple months' hiatus working on a different project, I'm back on textacy. Since spaCy just released a new v1.0 — and I'd like to take advantage of some new features — this seemed like a good time to ensure compatibility as well as set minimum version requirements for other dependencies.

Changes:

  • Updated spaCy dependency to the latest v1.0.1; set a floor on other dependencies' versions to make sure everyone's running reasonably up-to-date code

Bugfixes:

  • Fixed incorrect kwarg in sgrank 's call to extract.ngrams() (@patcollis34, issue #44)
  • Fixed import for cachetool 's hashkey, which changed in v2.0 (@gramonov, issue #45)

Refactor for Consistency, Convenience, and Simplicity

23 Aug 13:26
Compare
Choose a tag to compare

After several months of somewhat organic development, textacy had acquired some rough edges in the API and inconsistencies throughout the code base. This release breaks the existing API in a few (mostly minor) ways for the sake of consistency, user convenience, and simplicity. It also adds some new functionality and enhances existing functionality for a better overall experience and, I hope, more straightforward development moving forward.

Changes:

  • Refactored and streamlined TextDoc; changed name to Doc
    • simplified init params: lang can now be a language code string or an equivalent spacy.Language object, and content is either a string or spacy.Doc; param values and their interactions are better checked for errors and inconsistencies
    • renamed and improved methods transforming the Doc; for example, .as_bag_of_terms() is now .to_bag_of_terms(), and terms can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
    • added performant .to_bag_of_words() method, at the cost of less customizability of what gets included in the bag (no stopwords or punctuation); words can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
    • removed methods wrapping extract functions, in favor of simply calling that function on the Doc (see below for updates to extract functions to make this more convenient); for example, TextDoc.words() is now extract.words(Doc)
    • removed .term_counts() method, which was redundant with Doc.to_bag_of_terms()
    • renamed .term_count() => .count(), and checking + caching results is now smarter and faster
  • Refactored and streamlined TextCorpus; changed name to Corpus
    • added init params: can now initialize a Corpus with a stream of texts, spacy or textacy Docs, and optional metadatas, analogous to Doc; accordingly, removed .from_texts() class method
    • refactored, streamlined, bug-fixed, and made consistent the process of adding, getting, and removing documents from Corpus
      • getting/removing by index is now equivalent to the built-in list API: Corpus[:5] gets the first 5 Docs, and del Corpus[:5] removes the first 5, automatically keeping track of corpus statistics for total # docs, sents, and tokens
      • getting/removing by boolean function is now done via the .get() and .remove() methods, the latter of which now also correctly tracks corpus stats
      • adding documents is split across the .add_text(), .add_texts(), and .add_doc() methods for performance and clarity reasons
    • added .word_freqs() and .word_doc_freqs() methods for getting a mapping of word (int id or string) to global weight (absolute, relative, binary, or inverse frequency); akin to a vectorized representation (see: textacy.vsm) but in non-vectorized form, which can be useful
    • removed .as_doc_term_matrix() method, which was just wrapping another function; so, instead of corpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus)), do textacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))
  • Updated several extract functions
    • almost all now accept either a textacy.Doc or spacy.Doc as input
    • renamed and improved parameters for filtering for or against certain POS or NE types; for example, good_pos_tags is now include_pos, and will accept either a single POS tag as a string or a set of POS tags to filter for; same goes for exclude_pos, and analogously include_types, and exclude_types
  • Updated corpora classes for consistency and added flexibility
    • enforced a consistent API: .texts() for a stream of plain text documents and .records() for a stream of dicts containing both text and metadata
    • added filtering options for RedditReader, e.g. by date or subreddit, consistent with other corpora (similar tweaks to WikiReader may come later, but it's slightly more complicated...)
    • added a nicer repr for RedditReader and WikiReader corpora, consistent with other corpora
  • Moved vsm.py and network.py into the top-level of textacy and thus removed the representations subpackage
    • renamed vsm.build_doc_term_matrix() => vsm.doc_term_matrix(), because the "build" part of it is obvious
  • Renamed distance.py => similarity.py; all returned values are now similarity metrics in the interval [0, 1], where higher values indicate higher similarity
  • Renamed regexes_etc.py => constants.py, without additional changes
  • Renamed fileio.utils.split_content_and_metadata() => fileio.utils.split_record_fields(), without further changes (except for tweaks to the docstring)
  • Added functions to read and write delimited file formats: fileio.read_csv() and fileio.write_csv(), where the delimiter can be any valid one-char string; gzip/bzip/lzma compression is handled automatically when available
  • Added better and more consistent docstrings and usage examples throughout the code base

New Corpora, Compressed FileIO

03 Aug 17:03
Compare
Choose a tag to compare

Changes:

  • Added two new corpora!
    • the CapitolWords corpus: a collection of 11k speeches (~7M tokens) given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress — including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich — from January 1996 through June 2016
    • the SupremeCourt corpus: a collection of 8.4k court cases (~71M tokens) decided by the U.S. Supreme Court from 1946 through 2016, with metadata on subject matter categories, ideology, and voting patterns
    • DEPRECATED: the Bernie and Hillary corpus, which is a small subset of CapitolWords that can be easily recreated by filtering CapitolWords by speaker_name={'Bernie Sanders', 'Hillary Clinton'}
  • Refactored and improved fileio subpackage
    • moved shared (read/write) functions into separate fileio.utils module
    • almost all read/write functions now use fileio.utils.open_sesame(), enabling seamless fileio for uncompressed or gzip, bz2, and lzma compressed files; relative/user-home-based paths; and missing intermediate directories. NOTE: certain file mode / compression pairs simply don't work (this is Python's fault), so users may run into exceptions; in Python 3, you'll almost always want to use text mode ('wt' or 'rt'), but in Python 2, users can't read or write compressed files in text mode, only binary mode ('wb' or 'rb')
    • added options for writing json files (matching stdlib's json.dump()) that can help save space
    • fileio.utils.get_filenames() now matches for/against a regex pattern rather than just a contained substring; using the old params will now raise a deprecation warning
    • BREAKING: fileio.utils.split_content_and_metadata() now has itemwise=False by default, rather than itemwise=True, which means that splitting multi-document streams of content and metadata into parallel iterators is now the default action
    • added compression param to TextCorpus.save() and .load() to optionally write metadata json file in compressed form
    • moved fileio.write_conll() functionality to export.doc_to_conll(), which converts a spaCy doc into a ConLL-U formatted string; writing that string to disk would require a separate call to fileio.write_file()
  • Cleaned up deprecated/bad Py2/3 compat imports, and added better functionality for Py2/3 strings
    • now compat.unicode_type used for text data, compat.bytes_type for binary data, and compat.string_types for when either will do
    • also added compat.unicode_to_bytes() and compat.bytes_to_unicode() functions, for converting between string types

Bugfixes:

  • Fixed document(s) removal from TextCorpus objects, including correct decrementing of .n_docs, .n_sents, and .n_tokens attributes (@michelleful #29)
  • Fixed OSError being incorrectly raised in fileio.open_sesame() on missing files
  • lang parameter in TextDoc and TextCorpus can now be unicode or bytes, which was bug-like

oops

15 Jul 00:02
Compare
Choose a tag to compare

Bugfixes:

  • Added (missing) pyemd and python-levenshtein dependencies to requirements and setup files
  • Fixed bug in data.load_depechemood() arising from the Py2 csv module's inability to take unicode as input (thanks to @robclewley, issue #25)

distance metrics, better TextDoc/TextCorpus, and most discriminating terms

14 Jul 18:19
Compare
Choose a tag to compare

Changes:

  • New features for TextDoc and TextCorpus classes
    • added .save() methods and .load() classmethods, which allows for fast serialization of parsed documents/corpora and associated metadata to/from disk — with an important caveat: if spacy.Vocab object used to serialize and deserialize is not the same, there will be problems, making this format useful as short-term but not long-term storage
    • TextCorpus may now be instantiated with an already-loaded spaCy pipeline, which may or may not have all models loaded; it can still be instantiated using a language code string ('en', 'de') to load a spaCy pipeline that includes all models by default
    • TextDoc methods wrapping extract and keyterms functions now have full documentation rather than forwarding users to the wrapped functions themselves; more irritating on the dev side, but much less irritating on the user side :)
  • Added a distance.py module containing several document, set, and string distance metrics
    • word movers: document distance as distance between individual words represented by word2vec vectors, normalized
    • "word2vec": token, span, or document distance as cosine distance between (average) word2vec representations, normalized
    • jaccard: string or set(string) distance as intersection / overlap, normalized, with optional fuzzy-matching across set members
    • hamming: distance between two strings as number of substititions, optionally normalized
    • levenshtein: distance between two strings as number of substitions, deletions, and insertions, optionally normalized (and removed a redundant function from the still-orphaned math_utils.py module)
    • jaro-winkler: distance between two strings with variable prefix weighting, normalized
  • Added most_discriminating_terms() function to keyterms module to take a collection of documents split into two exclusive groups and compute the most discriminating terms for group1-and-not-group2 as well as group2-and-not-group1

Bugfixes:

  • fixed variable name error in docs usage example (thanks to @licyeus, PR #23)

Corpora Readers, Better Examples, and Fewer Bugs

20 Jun 16:54
Compare
Choose a tag to compare

Changes:

  • Added corpora.RedditReader() class for streaming Reddit comments from disk, with .texts() method for a stream of plaintext comments and .comments() method for a stream of structured comments as dicts, with basic filtering by text length and limiting the number of comments returned
  • Refactored functions for streaming Wikipedia articles from disk into a corpora.WikiReader() class, with .texts() method for a stream of plaintext articles and .pages() method for a stream of structured pages as dicts, with basic filtering by text length and limiting the number of pages returned
  • Updated README and docs with a more comprehensive — and correct — usage example; also added tests to ensure it doesn't get stale
  • Updated requirements to latest version of spaCy, as well as added matplotlib for viz

Bugfixes:

  • textacy.preprocess.preprocess_text() is now, once again, imported at the top level, so easily reachable via textacy.preprocess_text() (@bretdabaker #14)
  • viz subpackage now included in the docs' API reference
  • missing dependencies added into setup.py so pip install handles everything for folks

#DataViz #FeelTheBern #GermanNLP

05 May 20:11
Compare
Choose a tag to compare

0.2.2 (2016-05-05)

Changes:

  • Added a viz subpackage, with two types of plots (so far):
    • viz.draw_termite_plot(), typically used to evaluate and interpret topic models; conveniently accessible from the tm.TopicModel class
    • viz.draw_semantic_network() for visualizing networks such as those output by representations.network
  • Added a "Bernie & Hillary" corpus with 3000 congressional speeches made by Bernie Sanders and Hillary Clinton since 1996
    • corpora.fetch_bernie_and_hillary() function automatically downloads to and loads from disk this corpus
  • Modified data.load_depechemood function, now downloads data from GitHub source if not found on disk
  • Removed resources/ directory from GitHub, hence all the downloadin'
  • Updated to spaCy v0.100.7
    • German is now supported! although some functionality is English-only
    • added textacy.load_spacy() function for loading spaCy packages, taking advantage of the new spacy.load() API; added a DeprecationWarning for textacy.data.load_spacy_pipeline()
    • proper nouns' and pronouns' .pos_ attributes are now correctly assigned 'PROPN' and 'PRON'; hence, modified regexes_etc.POS_REGEX_PATTERNS['en'] to include 'PROPN'
    • modified spacy_utils.preserve_case() to check for language-agnostic 'PROPN' POS rather than English-specific 'NNP' and 'NNPS' tags
  • Added text_utils.clean_terms() function for cleaning up a sequence of single- or multi-word strings by stripping leading/trailing junk chars, handling dangling parens and odd hyphenation, etc.

Bugfixes:

  • textstats.readability_stats() now correctly gets the number of words in a doc from its generator function (@gryBox #8)
  • removed NLTK dependency, which wasn't actually required
  • text_utils.detect_language() now warns via logging rather than a print() statement
  • fileio.write_conll() documentation now correctly indicates that the filename param is not optional

VSM and Topic Modeling

11 Apr 16:03
Compare
Choose a tag to compare

Changes:

  • Added representations subpackage; includes modules for network and vector space model (VSM) document and corpus representations
    • Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
    • Some of this functionality was refactored from existing parts of the package
  • Added tm (topic modeling) subpackage, with a main TopicModel class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface
  • Various improvements to TextDoc and TextCorpus classes
    • TextDoc can now be initialized from a spaCy Doc
    • Removed caching from TextDoc, because it was a pain and weird and probably not all that useful
    • extract-based methods are now generators, like the functions they wrap
    • Added .as_semantic_network() and .as_terms_list() methods to TextDoc
    • TextCorpus.from_texts() now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
  • Added read/write functions for sparse scipy matrices
  • Added fileio.read.split_content_and_metadata() convenience function for splitting (text) content from associated metadata when reading data from disk into a TextDoc or TextCorpus
  • Renamed fileio.read.get_filenames_in_dir() to fileio.read.get_filenames() and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files
  • Rewrote export.docs_to_gensim(), now significantly faster
  • Imports in __init__.py files for main and subpackages now explicit

Bugfixes:

  • textstats.readability_stats() no longer filters out stop words (@henningko #7)
  • Wikipedia article processing now recursively removes nested markup
  • extract.ngrams() now filters out ngrams with any space-only tokens
  • functions with include_nps kwarg changed to include_ncs, to match the renaming of the associated function from extract.noun_phrases() to extract.noun_chunks()