v0.6.3
New:
- Added a proper contributing guide and code of conduct, as well as separate
GitHub issue templates for different user situations. This should help folks
contribute to the project more effectively, and make maintaining it a bit easier,
too. [Issue #212] - Gave the documentation a new look, using a template popularized by
requests
.
Added documentation on dealing with multi-lingual datasets. [Issue #233] - Made some minor adjustments to package dependencies, the way they're specified,
and the Travis CI setup, making for a faster and better development experience. - Confirmed and enabled compatibility with v2.1+ of
spacy
. 💫
Changed:
- Improved the
Wikipedia
dataset class in a variety of ways: it can now read
Wikinews db dumps; access records in namespaces other than the usual "0"
(such as category pages in namespace "14"); parse and extract category pages
in several languages, including in the case of bad wiki markup; and filter out
section headings from the accompanying text via aninclude_headings
kwarg.
[PR #219, #220, #223, #224, #231] - Removed the
transliterate_unicode()
preprocessing function that transliterated
non-ascii text into a reasonable ascii approximation, for technical and
philosophical reasons. Also removed its GPL-licensedunidecode
dependency,
for legal-ish reasons. [Issue #203] - Added convention-abiding
exclude
argument to the function that writes
spacy
docs to disk, to limit which pipeline annotations are serialized.
Replaced the existing but non-standardinclude_tensor
arg. - Deprecated the
n_threads
argument inCorpus.add_texts()
, which had not
been working inspacy.pipe
for some time and, as of v2.1, is defunct. - Made many tests model- and python-version agnostic and thus less likely to break
whenspacy
releases new and improved models. - Auto-formatted the entire code base using
black
; the results aren't always
more readable, but they are pleasingly consistent.
Fixed:
- Fixed bad behavior of
key_terms_from_semantic_network()
, where an error
would be raised if no suitable key terms could be found; now, an empty list
is returned instead. [Issue #211] - Fixed variable name typo so
GroupVectorizer.fit()
actually works. [Issue #215] - Fixed a minor typo in the quick-start docs. [PR #217]
- Check for and filter out any named entities that are entirely whitespace,
seemingly caused by an issue inspacy
. - Fixed an undefined variable error when merging spans. [Issue #225]
- Fixed a unicode/bytes issue in experimental function for deserializing
spacy
docs in "binary" format. [Issue #228, PR #229]
Contributors:
Many thanks to @abevieiramota, @ckot, @Jude188, and @digest0r for their help!