Releases: chartbeat-labs/textacy
Custom pipelines and fewer dependencies
Changes:
- Preliminary inclusion of custom spaCy pipelines
- updated
load_spacy()
to include explicit path and create_pipeline kwargs, and removed the already-deprecatedload_spacy_pipeline()
function to avoid confusion around spaCy languages and pipelines - added
spacy_pipelines
module to hold implementations of custom spaCy pipelines, including a basic one that merges entities into single tokens - note: necessarily bumped minimum spaCy version to 1.1.0+ (see the announcement here)
- updated
- To reduce code bloat, made the
matplotlib
dependency optional and dropped thegensim
dependency- to install
matplotlib
at the same time as textacy, do$ pip install textacy[viz]
- bonus:
backports.csv
is now only installed for Py2 users - thanks to @mbatchkarov for the request
- to install
- Improved performance of
textacy.corpora.WikiReader().texts()
; results should stream faster and have cleaner plaintext content than when they were produced bygensim
- Added a
Corpus.vectors
property that returns a matrix of shape (# documents, vector dim) containing the average word2vec-style vector representation of constituent tokens for allDoc
s
v0.3.1
After a couple months' hiatus working on a different project, I'm back on textacy
. Since spaCy just released a new v1.0 — and I'd like to take advantage of some new features — this seemed like a good time to ensure compatibility as well as set minimum version requirements for other dependencies.
Changes:
- Updated spaCy dependency to the latest v1.0.1; set a floor on other dependencies' versions to make sure everyone's running reasonably up-to-date code
Bugfixes:
- Fixed incorrect kwarg in
sgrank
's call toextract.ngrams()
(@patcollis34, issue #44) - Fixed import for
cachetool
'shashkey
, which changed in v2.0 (@gramonov, issue #45)
Refactor for Consistency, Convenience, and Simplicity
After several months of somewhat organic development, textacy
had acquired some rough edges in the API and inconsistencies throughout the code base. This release breaks the existing API in a few (mostly minor) ways for the sake of consistency, user convenience, and simplicity. It also adds some new functionality and enhances existing functionality for a better overall experience and, I hope, more straightforward development moving forward.
Changes:
- Refactored and streamlined
TextDoc
; changed name toDoc
- simplified init params:
lang
can now be a language code string or an equivalentspacy.Language
object, andcontent
is either a string orspacy.Doc
; param values and their interactions are better checked for errors and inconsistencies - renamed and improved methods transforming the Doc; for example,
.as_bag_of_terms()
is now.to_bag_of_terms()
, and terms can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights - added performant
.to_bag_of_words()
method, at the cost of less customizability of what gets included in the bag (no stopwords or punctuation); words can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights - removed methods wrapping
extract
functions, in favor of simply calling that function on the Doc (see below for updates toextract
functions to make this more convenient); for example,TextDoc.words()
is nowextract.words(Doc)
- removed
.term_counts()
method, which was redundant withDoc.to_bag_of_terms()
- renamed
.term_count()
=>.count()
, and checking + caching results is now smarter and faster
- simplified init params:
- Refactored and streamlined
TextCorpus
; changed name toCorpus
- added init params: can now initialize a
Corpus
with a stream of texts, spacy or textacy Docs, and optional metadatas, analogous toDoc
; accordingly, removed.from_texts()
class method - refactored, streamlined, bug-fixed, and made consistent the process of adding, getting, and removing documents from
Corpus
- getting/removing by index is now equivalent to the built-in
list
API:Corpus[:5]
gets the first 5Doc
s, anddel Corpus[:5]
removes the first 5, automatically keeping track of corpus statistics for total # docs, sents, and tokens - getting/removing by boolean function is now done via the
.get()
and.remove()
methods, the latter of which now also correctly tracks corpus stats - adding documents is split across the
.add_text()
,.add_texts()
, and.add_doc()
methods for performance and clarity reasons
- getting/removing by index is now equivalent to the built-in
- added
.word_freqs()
and.word_doc_freqs()
methods for getting a mapping of word (int id or string) to global weight (absolute, relative, binary, or inverse frequency); akin to a vectorized representation (see:textacy.vsm
) but in non-vectorized form, which can be useful - removed
.as_doc_term_matrix()
method, which was just wrapping another function; so, instead ofcorpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus))
, dotextacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))
- added init params: can now initialize a
- Updated several
extract
functions- almost all now accept either a
textacy.Doc
orspacy.Doc
as input - renamed and improved parameters for filtering for or against certain POS or NE types; for example,
good_pos_tags
is nowinclude_pos
, and will accept either a single POS tag as a string or a set of POS tags to filter for; same goes forexclude_pos
, and analogouslyinclude_types
, andexclude_types
- almost all now accept either a
- Updated corpora classes for consistency and added flexibility
- enforced a consistent API:
.texts()
for a stream of plain text documents and.records()
for a stream of dicts containing both text and metadata - added filtering options for
RedditReader
, e.g. by date or subreddit, consistent with other corpora (similar tweaks toWikiReader
may come later, but it's slightly more complicated...) - added a nicer
repr
forRedditReader
andWikiReader
corpora, consistent with other corpora
- enforced a consistent API:
- Moved
vsm.py
andnetwork.py
into the top-level oftextacy
and thus removed therepresentations
subpackage- renamed
vsm.build_doc_term_matrix()
=>vsm.doc_term_matrix()
, because the "build" part of it is obvious
- renamed
- Renamed
distance.py
=>similarity.py
; all returned values are now similarity metrics in the interval [0, 1], where higher values indicate higher similarity - Renamed
regexes_etc.py
=>constants.py
, without additional changes - Renamed
fileio.utils.split_content_and_metadata()
=>fileio.utils.split_record_fields()
, without further changes (except for tweaks to the docstring) - Added functions to read and write delimited file formats:
fileio.read_csv()
andfileio.write_csv()
, where the delimiter can be any valid one-char string; gzip/bzip/lzma compression is handled automatically when available - Added better and more consistent docstrings and usage examples throughout the code base
New Corpora, Compressed FileIO
Changes:
- Added two new corpora!
- the CapitolWords corpus: a collection of 11k speeches (~7M tokens) given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress — including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich — from January 1996 through June 2016
- the SupremeCourt corpus: a collection of 8.4k court cases (~71M tokens) decided by the U.S. Supreme Court from 1946 through 2016, with metadata on subject matter categories, ideology, and voting patterns
- DEPRECATED: the Bernie and Hillary corpus, which is a small subset of CapitolWords that can be easily recreated by filtering CapitolWords by
speaker_name={'Bernie Sanders', 'Hillary Clinton'}
- Refactored and improved
fileio
subpackage- moved shared (read/write) functions into separate
fileio.utils
module - almost all read/write functions now use
fileio.utils.open_sesame()
, enabling seamless fileio for uncompressed or gzip, bz2, and lzma compressed files; relative/user-home-based paths; and missing intermediate directories. NOTE: certain file mode / compression pairs simply don't work (this is Python's fault), so users may run into exceptions; in Python 3, you'll almost always want to use text mode ('wt' or 'rt'), but in Python 2, users can't read or write compressed files in text mode, only binary mode ('wb' or 'rb') - added options for writing json files (matching stdlib's
json.dump()
) that can help save space fileio.utils.get_filenames()
now matches for/against a regex pattern rather than just a contained substring; using the old params will now raise a deprecation warning- BREAKING:
fileio.utils.split_content_and_metadata()
now hasitemwise=False
by default, rather thanitemwise=True
, which means that splitting multi-document streams of content and metadata into parallel iterators is now the default action - added
compression
param toTextCorpus.save()
and.load()
to optionally write metadata json file in compressed form - moved
fileio.write_conll()
functionality toexport.doc_to_conll()
, which converts a spaCy doc into a ConLL-U formatted string; writing that string to disk would require a separate call tofileio.write_file()
- moved shared (read/write) functions into separate
- Cleaned up deprecated/bad Py2/3
compat
imports, and added better functionality for Py2/3 strings- now
compat.unicode_type
used for text data,compat.bytes_type
for binary data, andcompat.string_types
for when either will do - also added
compat.unicode_to_bytes()
andcompat.bytes_to_unicode()
functions, for converting between string types
- now
Bugfixes:
- Fixed document(s) removal from
TextCorpus
objects, including correct decrementing of.n_docs
,.n_sents
, and.n_tokens
attributes (@michelleful #29) - Fixed OSError being incorrectly raised in
fileio.open_sesame()
on missing files lang
parameter inTextDoc
andTextCorpus
can now be unicode or bytes, which was bug-like
oops
distance metrics, better TextDoc/TextCorpus, and most discriminating terms
Changes:
- New features for
TextDoc
andTextCorpus
classes- added
.save()
methods and.load()
classmethods, which allows for fast serialization of parsed documents/corpora and associated metadata to/from disk — with an important caveat: ifspacy.Vocab
object used to serialize and deserialize is not the same, there will be problems, making this format useful as short-term but not long-term storage TextCorpus
may now be instantiated with an already-loaded spaCy pipeline, which may or may not have all models loaded; it can still be instantiated using a language code string ('en', 'de') to load a spaCy pipeline that includes all models by defaultTextDoc
methods wrappingextract
andkeyterms
functions now have full documentation rather than forwarding users to the wrapped functions themselves; more irritating on the dev side, but much less irritating on the user side :)
- added
- Added a
distance.py
module containing several document, set, and string distance metrics- word movers: document distance as distance between individual words represented by word2vec vectors, normalized
- "word2vec": token, span, or document distance as cosine distance between (average) word2vec representations, normalized
- jaccard: string or set(string) distance as intersection / overlap, normalized, with optional fuzzy-matching across set members
- hamming: distance between two strings as number of substititions, optionally normalized
- levenshtein: distance between two strings as number of substitions, deletions, and insertions, optionally normalized (and removed a redundant function from the still-orphaned
math_utils.py
module) - jaro-winkler: distance between two strings with variable prefix weighting, normalized
- Added
most_discriminating_terms()
function tokeyterms
module to take a collection of documents split into two exclusive groups and compute the most discriminating terms for group1-and-not-group2 as well as group2-and-not-group1
Bugfixes:
Corpora Readers, Better Examples, and Fewer Bugs
Changes:
- Added
corpora.RedditReader()
class for streaming Reddit comments from disk, with.texts()
method for a stream of plaintext comments and.comments()
method for a stream of structured comments as dicts, with basic filtering by text length and limiting the number of comments returned - Refactored functions for streaming Wikipedia articles from disk into a
corpora.WikiReader()
class, with.texts()
method for a stream of plaintext articles and.pages()
method for a stream of structured pages as dicts, with basic filtering by text length and limiting the number of pages returned - Updated README and docs with a more comprehensive — and correct — usage example; also added tests to ensure it doesn't get stale
- Updated requirements to latest version of spaCy, as well as added matplotlib for
viz
Bugfixes:
textacy.preprocess.preprocess_text()
is now, once again, imported at the top level, so easily reachable viatextacy.preprocess_text()
(@bretdabaker #14)viz
subpackage now included in the docs' API reference- missing dependencies added into
setup.py
so pip install handles everything for folks
#DataViz #FeelTheBern #GermanNLP
0.2.2 (2016-05-05)
Changes:
- Added a
viz
subpackage, with two types of plots (so far):viz.draw_termite_plot()
, typically used to evaluate and interpret topic models; conveniently accessible from thetm.TopicModel
classviz.draw_semantic_network()
for visualizing networks such as those output byrepresentations.network
- Added a "Bernie & Hillary" corpus with 3000 congressional speeches made by Bernie Sanders and Hillary Clinton since 1996
corpora.fetch_bernie_and_hillary()
function automatically downloads to and loads from disk this corpus
- Modified
data.load_depechemood
function, now downloads data from GitHub source if not found on disk - Removed
resources/
directory from GitHub, hence all the downloadin' - Updated to spaCy v0.100.7
- German is now supported! although some functionality is English-only
- added
textacy.load_spacy()
function for loading spaCy packages, taking advantage of the newspacy.load()
API; added a DeprecationWarning fortextacy.data.load_spacy_pipeline()
- proper nouns' and pronouns'
.pos_
attributes are now correctly assigned 'PROPN' and 'PRON'; hence, modifiedregexes_etc.POS_REGEX_PATTERNS['en']
to include 'PROPN' - modified
spacy_utils.preserve_case()
to check for language-agnostic 'PROPN' POS rather than English-specific 'NNP' and 'NNPS' tags
- Added
text_utils.clean_terms()
function for cleaning up a sequence of single- or multi-word strings by stripping leading/trailing junk chars, handling dangling parens and odd hyphenation, etc.
Bugfixes:
textstats.readability_stats()
now correctly gets the number of words in a doc from its generator function (@gryBox #8)- removed NLTK dependency, which wasn't actually required
text_utils.detect_language()
now warns vialogging
rather than aprint()
statementfileio.write_conll()
documentation now correctly indicates that the filename param is not optional
VSM and Topic Modeling
Changes:
- Added
representations
subpackage; includes modules for network and vector space model (VSM) document and corpus representations- Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
- Some of this functionality was refactored from existing parts of the package
- Added
tm
(topic modeling) subpackage, with a mainTopicModel
class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface - Various improvements to
TextDoc
andTextCorpus
classesTextDoc
can now be initialized from a spaCy Doc- Removed caching from
TextDoc
, because it was a pain and weird and probably not all that useful extract
-based methods are now generators, like the functions they wrap- Added
.as_semantic_network()
and.as_terms_list()
methods toTextDoc
TextCorpus.from_texts()
now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
- Added read/write functions for sparse scipy matrices
- Added
fileio.read.split_content_and_metadata()
convenience function for splitting (text) content from associated metadata when reading data from disk into aTextDoc
orTextCorpus
- Renamed
fileio.read.get_filenames_in_dir()
tofileio.read.get_filenames()
and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files - Rewrote
export.docs_to_gensim()
, now significantly faster - Imports in
__init__.py
files for main and subpackages now explicit
Bugfixes:
textstats.readability_stats()
no longer filters out stop words (@henningko #7)- Wikipedia article processing now recursively removes nested markup
extract.ngrams()
now filters out ngrams with any space-only tokens- functions with
include_nps
kwarg changed toinclude_ncs
, to match the renaming of the associated function fromextract.noun_phrases()
toextract.noun_chunks()