Releases · chartbeat-labs/textacy

15 Nov 19:11

bdewilde

0.3.2

a5e1c89

Custom pipelines and fewer dependencies

Changes:

Preliminary inclusion of custom spaCy pipelines
- updated load_spacy() to include explicit path and create_pipeline kwargs, and removed the already-deprecated load_spacy_pipeline() function to avoid confusion around spaCy languages and pipelines
- added spacy_pipelines module to hold implementations of custom spaCy pipelines, including a basic one that merges entities into single tokens
- note: necessarily bumped minimum spaCy version to 1.1.0+ (see the announcement here)
To reduce code bloat, made the matplotlib dependency optional and dropped the gensim dependency
- to install matplotlib at the same time as textacy, do $ pip install textacy[viz]
- bonus: backports.csv is now only installed for Py2 users
- thanks to @mbatchkarov for the request
Improved performance of textacy.corpora.WikiReader().texts(); results should stream faster and have cleaner plaintext content than when they were produced by gensim
- this should also fix a bug reported in Issue #51 by @baisk
Added a Corpus.vectors property that returns a matrix of shape (# documents, vector dim) containing the average word2vec-style vector representation of constituent tokens for all Doc s

Assets 2

19 Oct 19:51

bdewilde

v0.3.1

dd3aca6

v0.3.1

After a couple months' hiatus working on a different project, I'm back on textacy. Since spaCy just released a new v1.0 — and I'd like to take advantage of some new features — this seemed like a good time to ensure compatibility as well as set minimum version requirements for other dependencies.

Changes:

Updated spaCy dependency to the latest v1.0.1; set a floor on other dependencies' versions to make sure everyone's running reasonably up-to-date code

Bugfixes:

Fixed incorrect kwarg in sgrank 's call to extract.ngrams() (@patcollis34, issue #44)
Fixed import for cachetool 's hashkey, which changed in v2.0 (@gramonov, issue #45)

Assets 2

23 Aug 13:26

bdewilde

0.3.0

dbd9321

Refactor for Consistency, Convenience, and Simplicity

After several months of somewhat organic development, textacy had acquired some rough edges in the API and inconsistencies throughout the code base. This release breaks the existing API in a few (mostly minor) ways for the sake of consistency, user convenience, and simplicity. It also adds some new functionality and enhances existing functionality for a better overall experience and, I hope, more straightforward development moving forward.

Changes:

Refactored and streamlined TextDoc; changed name to Doc
- simplified init params: lang can now be a language code string or an equivalent spacy.Language object, and content is either a string or spacy.Doc; param values and their interactions are better checked for errors and inconsistencies
- renamed and improved methods transforming the Doc; for example, .as_bag_of_terms() is now .to_bag_of_terms(), and terms can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
- added performant .to_bag_of_words() method, at the cost of less customizability of what gets included in the bag (no stopwords or punctuation); words can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
- removed methods wrapping extract functions, in favor of simply calling that function on the Doc (see below for updates to extract functions to make this more convenient); for example, TextDoc.words() is now extract.words(Doc)
- removed .term_counts() method, which was redundant with Doc.to_bag_of_terms()
- renamed .term_count() => .count(), and checking + caching results is now smarter and faster
Refactored and streamlined TextCorpus; changed name to Corpus
- added init params: can now initialize a Corpus with a stream of texts, spacy or textacy Docs, and optional metadatas, analogous to Doc; accordingly, removed .from_texts() class method
- refactored, streamlined, bug-fixed, and made consistent the process of adding, getting, and removing documents from Corpus
  - getting/removing by index is now equivalent to the built-in list API: Corpus[:5] gets the first 5 Docs, and del Corpus[:5] removes the first 5, automatically keeping track of corpus statistics for total # docs, sents, and tokens
  - getting/removing by boolean function is now done via the .get() and .remove() methods, the latter of which now also correctly tracks corpus stats
  - adding documents is split across the .add_text(), .add_texts(), and .add_doc() methods for performance and clarity reasons
- added .word_freqs() and .word_doc_freqs() methods for getting a mapping of word (int id or string) to global weight (absolute, relative, binary, or inverse frequency); akin to a vectorized representation (see: textacy.vsm) but in non-vectorized form, which can be useful
- removed .as_doc_term_matrix() method, which was just wrapping another function; so, instead of corpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus)), do textacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))
Updated several extract functions
- almost all now accept either a textacy.Doc or spacy.Doc as input
- renamed and improved parameters for filtering for or against certain POS or NE types; for example, good_pos_tags is now include_pos, and will accept either a single POS tag as a string or a set of POS tags to filter for; same goes for exclude_pos, and analogously include_types, and exclude_types
Updated corpora classes for consistency and added flexibility
- enforced a consistent API: .texts() for a stream of plain text documents and .records() for a stream of dicts containing both text and metadata
- added filtering options for RedditReader, e.g. by date or subreddit, consistent with other corpora (similar tweaks to WikiReader may come later, but it's slightly more complicated...)
- added a nicer repr for RedditReader and WikiReader corpora, consistent with other corpora
Moved vsm.py and network.py into the top-level of textacy and thus removed the representations subpackage
- renamed vsm.build_doc_term_matrix() => vsm.doc_term_matrix(), because the "build" part of it is obvious
Renamed distance.py => similarity.py; all returned values are now similarity metrics in the interval [0, 1], where higher values indicate higher similarity
Renamed regexes_etc.py => constants.py, without additional changes
Renamed fileio.utils.split_content_and_metadata() => fileio.utils.split_record_fields(), without further changes (except for tweaks to the docstring)
Added functions to read and write delimited file formats: fileio.read_csv() and fileio.write_csv(), where the delimiter can be any valid one-char string; gzip/bzip/lzma compression is handled automatically when available
Added better and more consistent docstrings and usage examples throughout the code base

Assets 2

03 Aug 17:03

bdewilde

0.2.8

edb753d

New Corpora, Compressed FileIO

Changes:

Added two new corpora!
- the CapitolWords corpus: a collection of 11k speeches (~7M tokens) given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress — including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich — from January 1996 through June 2016
- the SupremeCourt corpus: a collection of 8.4k court cases (~71M tokens) decided by the U.S. Supreme Court from 1946 through 2016, with metadata on subject matter categories, ideology, and voting patterns
- DEPRECATED: the Bernie and Hillary corpus, which is a small subset of CapitolWords that can be easily recreated by filtering CapitolWords by speaker_name={'Bernie Sanders', 'Hillary Clinton'}
Refactored and improved fileio subpackage
- moved shared (read/write) functions into separate fileio.utils module
- almost all read/write functions now use fileio.utils.open_sesame(), enabling seamless fileio for uncompressed or gzip, bz2, and lzma compressed files; relative/user-home-based paths; and missing intermediate directories. NOTE: certain file mode / compression pairs simply don't work (this is Python's fault), so users may run into exceptions; in Python 3, you'll almost always want to use text mode ('wt' or 'rt'), but in Python 2, users can't read or write compressed files in text mode, only binary mode ('wb' or 'rb')
- added options for writing json files (matching stdlib's json.dump()) that can help save space
- fileio.utils.get_filenames() now matches for/against a regex pattern rather than just a contained substring; using the old params will now raise a deprecation warning
- BREAKING: fileio.utils.split_content_and_metadata() now has itemwise=False by default, rather than itemwise=True, which means that splitting multi-document streams of content and metadata into parallel iterators is now the default action
- added compression param to TextCorpus.save() and .load() to optionally write metadata json file in compressed form
- moved fileio.write_conll() functionality to export.doc_to_conll(), which converts a spaCy doc into a ConLL-U formatted string; writing that string to disk would require a separate call to fileio.write_file()
Cleaned up deprecated/bad Py2/3 compat imports, and added better functionality for Py2/3 strings
- now compat.unicode_type used for text data, compat.bytes_type for binary data, and compat.string_types for when either will do
- also added compat.unicode_to_bytes() and compat.bytes_to_unicode() functions, for converting between string types

Bugfixes:

Fixed document(s) removal from TextCorpus objects, including correct decrementing of .n_docs, .n_sents, and .n_tokens attributes (@michelleful #29)
Fixed OSError being incorrectly raised in fileio.open_sesame() on missing files
lang parameter in TextDoc and TextCorpus can now be unicode or bytes, which was bug-like

Assets 2

15 Jul 00:02

bdewilde

0.2.5

021aa14

oops

Bugfixes:

Added (missing) pyemd and python-levenshtein dependencies to requirements and setup files
Fixed bug in data.load_depechemood() arising from the Py2 csv module's inability to take unicode as input (thanks to @robclewley, issue #25)

Assets 2

14 Jul 18:19

bdewilde

0.2.4

fc94cc7

distance metrics, better TextDoc/TextCorpus, and most discriminating terms

Changes:

New features for TextDoc and TextCorpus classes
- added .save() methods and .load() classmethods, which allows for fast serialization of parsed documents/corpora and associated metadata to/from disk — with an important caveat: if spacy.Vocab object used to serialize and deserialize is not the same, there will be problems, making this format useful as short-term but not long-term storage
- TextCorpus may now be instantiated with an already-loaded spaCy pipeline, which may or may not have all models loaded; it can still be instantiated using a language code string ('en', 'de') to load a spaCy pipeline that includes all models by default
- TextDoc methods wrapping extract and keyterms functions now have full documentation rather than forwarding users to the wrapped functions themselves; more irritating on the dev side, but much less irritating on the user side :)
Added a distance.py module containing several document, set, and string distance metrics
- word movers: document distance as distance between individual words represented by word2vec vectors, normalized
- "word2vec": token, span, or document distance as cosine distance between (average) word2vec representations, normalized
- jaccard: string or set(string) distance as intersection / overlap, normalized, with optional fuzzy-matching across set members
- hamming: distance between two strings as number of substititions, optionally normalized
- levenshtein: distance between two strings as number of substitions, deletions, and insertions, optionally normalized (and removed a redundant function from the still-orphaned math_utils.py module)
- jaro-winkler: distance between two strings with variable prefix weighting, normalized
Added most_discriminating_terms() function to keyterms module to take a collection of documents split into two exclusive groups and compute the most discriminating terms for group1-and-not-group2 as well as group2-and-not-group1

Bugfixes:

fixed variable name error in docs usage example (thanks to @licyeus, PR #23)

Assets 2

20 Jun 16:54

bdewilde

0.2.3

6f51888

Corpora Readers, Better Examples, and Fewer Bugs

Changes:

Added corpora.RedditReader() class for streaming Reddit comments from disk, with .texts() method for a stream of plaintext comments and .comments() method for a stream of structured comments as dicts, with basic filtering by text length and limiting the number of comments returned
Refactored functions for streaming Wikipedia articles from disk into a corpora.WikiReader() class, with .texts() method for a stream of plaintext articles and .pages() method for a stream of structured pages as dicts, with basic filtering by text length and limiting the number of pages returned
Updated README and docs with a more comprehensive — and correct — usage example; also added tests to ensure it doesn't get stale
Updated requirements to latest version of spaCy, as well as added matplotlib for viz

Bugfixes:

textacy.preprocess.preprocess_text() is now, once again, imported at the top level, so easily reachable via textacy.preprocess_text() (@bretdabaker #14)
viz subpackage now included in the docs' API reference
missing dependencies added into setup.py so pip install handles everything for folks

Assets 2

05 May 20:11

bdewilde

0.2.2

42ea974

#DataViz #FeelTheBern #GermanNLP

0.2.2 (2016-05-05)

Changes:

Added a viz subpackage, with two types of plots (so far):
- viz.draw_termite_plot(), typically used to evaluate and interpret topic models; conveniently accessible from the tm.TopicModel class
- viz.draw_semantic_network() for visualizing networks such as those output by representations.network
Added a "Bernie & Hillary" corpus with 3000 congressional speeches made by Bernie Sanders and Hillary Clinton since 1996
- corpora.fetch_bernie_and_hillary() function automatically downloads to and loads from disk this corpus
Modified data.load_depechemood function, now downloads data from GitHub source if not found on disk
Removed resources/ directory from GitHub, hence all the downloadin'
Updated to spaCy v0.100.7
- German is now supported! although some functionality is English-only
- added textacy.load_spacy() function for loading spaCy packages, taking advantage of the new spacy.load() API; added a DeprecationWarning for textacy.data.load_spacy_pipeline()
- proper nouns' and pronouns' .pos_ attributes are now correctly assigned 'PROPN' and 'PRON'; hence, modified regexes_etc.POS_REGEX_PATTERNS['en'] to include 'PROPN'
- modified spacy_utils.preserve_case() to check for language-agnostic 'PROPN' POS rather than English-specific 'NNP' and 'NNPS' tags
Added text_utils.clean_terms() function for cleaning up a sequence of single- or multi-word strings by stripping leading/trailing junk chars, handling dangling parens and odd hyphenation, etc.

Bugfixes:

textstats.readability_stats() now correctly gets the number of words in a doc from its generator function (@gryBox #8)
removed NLTK dependency, which wasn't actually required
text_utils.detect_language() now warns via logging rather than a print() statement
fileio.write_conll() documentation now correctly indicates that the filename param is not optional

Assets 2

11 Apr 16:03

bdewilde

v0.2.0

529b1bf

VSM and Topic Modeling

Changes:

Added representations subpackage; includes modules for network and vector space model (VSM) document and corpus representations
- Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
- Some of this functionality was refactored from existing parts of the package
Added tm (topic modeling) subpackage, with a main TopicModel class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface
Various improvements to TextDoc and TextCorpus classes
- TextDoc can now be initialized from a spaCy Doc
- Removed caching from TextDoc, because it was a pain and weird and probably not all that useful
- extract-based methods are now generators, like the functions they wrap
- Added .as_semantic_network() and .as_terms_list() methods to TextDoc
- TextCorpus.from_texts() now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
Added read/write functions for sparse scipy matrices
Added fileio.read.split_content_and_metadata() convenience function for splitting (text) content from associated metadata when reading data from disk into a TextDoc or TextCorpus
Renamed fileio.read.get_filenames_in_dir() to fileio.read.get_filenames() and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files
Rewrote export.docs_to_gensim(), now significantly faster
Imports in __init__.py files for main and subpackages now explicit

Bugfixes:

textstats.readability_stats() no longer filters out stop words (@henningko #7)
Wikipedia article processing now recursively removes nested markup
extract.ngrams() now filters out ngrams with any space-only tokens
functions with include_nps kwarg changed to include_ncs, to match the renaming of the associated function from extract.noun_phrases() to extract.noun_chunks()

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.2.2 (2016-05-05)

Releases: chartbeat-labs/textacy

Custom pipelines and fewer dependencies

v0.3.1

Refactor for Consistency, Convenience, and Simplicity

New Corpora, Compressed FileIO

oops

distance metrics, better TextDoc/TextCorpus, and most discriminating terms

Corpora Readers, Better Examples, and Fewer Bugs

#DataViz #FeelTheBern #GermanNLP

0.2.2 (2016-05-05)

VSM and Topic Modeling