spacy v2.0 compatibility and lots of cleanup
Changes:
-
Bumped version requirement for spaCy from < 2.0 to >= 2.0 --- textacy no longer
works with spaCy 1.x! It's worth the upgrade, though. v2.0's new features and
API enabled (or required) a few changes on textacy's endtextacy.load_spacy()
takes the same inputs as the newspacy.load()
,
i.e. a packagename
string and an optional list of pipes todisable
- textacy's
Doc
metadata and language string are now stored inuser_data
directly on the spaCyDoc
object; although the API from a user's perspective
is unchanged, this made the next change possible Doc
andCorpus
classes are now de/serialized via pickle into a single
file --- no more side-car JSON files for metadata! Accordingly, the.save()
and.load()
methods on both classes have a simpler API: they take
a single string specifying the file on disk where data is stored.
-
Cleaned up docs, imports, and tests throughout the entire code base.
- docstrings and https://textacy.readthedocs.io 's API reference are easier to
read, with better cross-referencing and far fewer broken web links - namespaces are less cluttered, and textacy's source code is easier to follow
import textacy
takes less than half the time from before- the full test suite also runs about twice as fast, and most tests are now
more robust to changes in the performance of spaCy's models - consistent adherence to conventions eases users' cognitive load :)
- docstrings and https://textacy.readthedocs.io 's API reference are easier to
-
The module responsible for caching loaded data in memory was cleaned up and
improved, as well as renamed: fromdata.py
tocache.py
, which is more
descriptive of its purpose. Otherwise, you shouldn't notice much of a difference
besides things working correctly.- All loaded data (e.g. spacy language pipelines) is now cached together in a
single LRU cache whose max size is set to 2GB, and the size of each element
in the cache is now accurately computed. (tl;dr:sys.getsizeof
does not
work on non-built-in objects like, say, aspacy.tokens.Doc
.) - Loading and downloading of the DepecheMood resource is now less hacky and
weird, and much closer to how users already deal with textacy's various
Dataset
s, In fact, it can be downloaded in exactly the same way as the
datasets via textacy's new CLI:$ python -m textacy download depechemood
.
P.S. A brief guide for using the CLI got added to the README.
- All loaded data (e.g. spacy language pipelines) is now cached together in a
-
Several function/method arguments marked for deprecation have been removed.
If you've been ignoring the warnings that print out when you uselemmatize=True
instead ofnormalize='lemma'
(etc.), now is the time to update your calls!- Of particular note: The
readability_stats()
function has been removed;
useTextStats(doc).readability_stats
instead.
- Of particular note: The
Bugfixes:
- In certain situations, the text of a spaCy span was being returned without
whitespace between tokens; that has been avoided in textacy, and the source bug
in spaCy got fixed (by yours truly! explosion/spaCy#1621). - When adding already-parsed
Doc
s to aCorpus
, includingmetadata
now correctly overwrites any existing metadata on those docs. - Fixed a couple related issues involving the assignment of a 2-letter language
string to the.lang
attribute ofDoc
andCorpus
objects. - textacy's CLI wasn't correctly handling certain dataset kwargs in all cases;
now, all kwargs get to their intended destinations.