Skip to content

Releases: chartbeat-labs/textacy

packaging upgrades, faster language id, bug fixes

02 Apr 22:46
d94c618
Compare
Choose a tag to compare

Took a (longer than expected) break from NLP, so this release is mostly just maintenance and bug fixes — but in anticipation of more interesting updates to come.

  • upgraded built-in language identification model (PR #375)
    • replaced v2 thinc/cld3 model with v3 floret/fasttext model, which has much faster predictions and comparable but more consistent performance
  • modernized and improved Python packaging for faster, simpler installation and testing (PR #368 and #369)
    • all package metadata and configuration moved into a single pyproject.toml file
    • code formatting and linting updated to use ruff plus newer versions of mypy and black, and their use in GitHub Actions CI has been consolidated
    • bumped supported Python versions range from 3.8–3.10 to 3.9–3.11 (PR #369)
    • added full CI testing matrix for PY 3.9/3.10/3.11 x Linux/macOS/Windows, and removed extraneous AppVeyor integration
  • updated and improved type hints throughout, reducing number of mypy complaints by ~80% (PR #372)

Fixed

  • fixed ReDoS bugs in regex patterns (PR #371)
  • fixed breaking API issues with newer networkx/scikit-learn versions (PR #367)
  • improved dev workflow documentation and code to better incorporate language data (PR #363)
  • updated caching code with a fix from upstream pysize library, which was preventing Russian-language spaCy model from loading properly (PR #358)

Contributors

Big thanks to @jonwiggins, @Hironsan, amnd @kevinbackhouse for the fixes!

more text stats, consistent doc extensions, better packaging

06 Dec 14:59
Compare
Choose a tag to compare

New and Changed

  • Refactored and extended text statistics functionality (PR #350)
    • Added functions for computing measures of lexical diversity, such as the clasic Type-Token-Ratio and modern Hypergeometric Distribution Diversity
    • Added functions for counting token-level attributes, including morphological features and parts-of-speech, in a convenient form
    • Refactored all text stats functions to accept a Doc as their first positional arg, suitable for use as custom doc extensions (see below)
    • Deprecated the TextStats class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.
  • Standardized functionality for getting/setting/removing doc extensions (PR #352)
    • Now, custom extensions are accessed by name, and users have more control over the process:

      >>> import textacy
      >>> from textacy import extract, text_stats
      >>> textacy.set_doc_extensions("extract")
      >>> textacy.set_doc_extensions("text_stats.readability")
      >>> textacy.remove_doc_extensions("extract.matches")
      >>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease()
      118.17500000000001
    • Moved top-level extensions into spacier.core and extract.bags

    • Standardized extract and text_stats subpackage extensions to use the new setup, and made them more customizable

  • Improved package code, tests, and docs
    • Fixed outdated code and comments in the "Quickstart" guide, then renamed it "Walkthrough" since it wasn't actually quick; added a new and, yes, quick "Quickstart" guide to fill the gap (PR #353)
    • Added a pytest conftest file to improve maintainability and consistency of unit test suite (PR #353)
    • Improved quality and consistency of type annotations, everywhere (PR #349)
    • Note: Bumped Python version support from 3.7–3.9 to 3.8–3.10 in order to take advantage of new typing features in PY3.8 and formally support the current major version (PR #348)
    • Modernized and streamlined package builds and configuration (PR #347)
      • Removed deprecated setup.py and switched from setuptools to build for builds
      • Consolidated tool configuration in pyproject.toml
      • Extended and tidied up dev-oriented Makefile
      • Addressed some CI/CD issues

Fixed

  • Added missing import, args in TextStats docs (PR #331, Issue #334)
  • Fixed normalization in YAKE keyword extraction (PR #332)
  • Fixed text encoding issue when loading ConceptNet data on Windows systems (Issue #345)

Contributors

Thanks to @austinjp, @scarroll32, @mirkolenz for their help!

big refactor, improved functionality, and spaCy v3

12 Apr 15:49
Compare
Choose a tag to compare

This is probably the largest single update in textacy's history. The changes necessary for upgrading to spaCy v3 prompted a cascade of additional updates, quality-of-life improvements, expansions and retractions of scope, and general package cleanup to better align textacy with its primary dependency and set it up for future updates. Note that this version includes a number of breaking changes; most are minor and have easy fixes, but some represent actual shifts in functionality. Read on for details!

  • Refactored, standardized, and extended several areas of functionality
    • text preprocessing (textacy.preprocessing)
      • Added functions for normalizing bullet points in lists (normalize.bullet_points()), removing HTML tags (remove.html_tags()), and removing bracketed contents such as in-line citations (remove.brackets()).
      • Added make_pipeline() function for combining multiple preprocessors applied sequentially to input text into a single callable.
      • Renamed functions for flexibility and clarity of use; in most cases, this entails replacing an underscore with a period, e.g. preprocessing.normalize_whitespace() => preprocessing.normalize.whitespace().
      • Renamed and standardized some funcs' args; for example, all "replace" functions had their (optional) second argument renamed from replace_with => repl, and remove.punctuation(text, marks=".?!") => remove.punctuation(text, only=[".", "?", "!"]).
    • structured information extraction (textacy.extract)
      • Consolidated and restructured functionality previously spread across the extract.py and text_utils.py modules and ke subpackage. For the latter two, imports have changed:
        • from textacy import ke; ke.textrank() => from textacy import extract; extract.keyterms.textrank()
        • from textacy import text_utils; text_utils.keywords_in_context() => from textacy import extract; extract.keywords_in_context()
      • Added new extraction functions:
        • extract.regex_matches(): For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.
        • extract.acronyms(): For extracting acronym-like tokens, without looking around for related definitions.
        • extract.terms(): For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.
      • Improved the generality and quality of extracted "triples" such as Subject-Verb-Objects, and changed the structure of returned objects accordingly. Previously, only contiguous spans were permitted for each element, but this was overly restrictive: A sentence like "I did not really like the movie." would produce an SVO of ("I", "like", "movie") which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces (["I"], ["did", "not", "like"], ["movie"]). For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g. svo.subject == svo[0]).
      • Changed extract.keywords_in_context() to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to accept Doc or str objects as input.
      • Removed deprecated extract.pos_regex_matches() function, which is superseded by the more powerful extract.token_matches().
    • string and sequence similarity metrics (textacy.similarity)
      • Refactored top-level similarity.py module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics.
      • Added several similarity metrics:
        • edit-based Jaro (similarity.jaro())
        • token-based Cosine (similarity.cosine()), Bag (similarity.bag()), and Tversky (similarity.tvserky())
        • sequence-based Matching Subsequences Ratio (similarity.matching_subsequences_ratio())
        • hybrid Monge-Elkan (similarity.monge_elkan())
      • Removed a couple similarity metrics: Word Movers Distance relied on a troublesome external dependency, and Word2Vec+Cosine is available in spaCy via Doc.similarity.
    • network- and vector-based document representations (textacy.representations)
      • Consolidated and reworked networks functionality in representations.network module
        • Added build_cooccurrence_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred.
        • Added build_similarity_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity.
        • Removed obsolete network.py module and duplicative extract.keyterms.graph_base.py module.
      • Refined vectorizer initialization, and moved from vsm.vectorizers to representations.vectorizers module.
        • For both Vectorizer and GroupVectorizer, applying global inverse document frequency weights is now handled by a single arg: idf_type: Optional[str], rather than a combination of apply_idf: bool, idf_type: str; similarly, applying document-length weight normalizations is handled by dl_type: Optional[str] instead of apply_dl: bool, dl_type: str
      • Added representations.sparse_vec module for higher-level access to document vectorization via build_doc_term_matrix() and build_grp_term_matrix() functions, for cases when a single fit+transform is all you need.
    • automatic language identification (textacy.lang_id)
      • Moved functionality from lang_utils.py module into a subpackage, and added the primary user interface (identify_lang() and identify_topn_langs()) as package-level imports.
      • Implemented and trained a more accurate thinc-based language identification model that's closer to the original CLD3 inspiration, replacing the simpler sklearn-based pipeline.
  • Updated interface with spaCy for v3, and better leveraged the new functionality
    • Restricted textacy.load_spacy_lang() to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately, textacy can no longer play fast and loose with automatic language identification => pipeline loading...
    • Extended textacy.make_spacy_doc() to accept a chunk_size arg that splits input text into chunks, processes each individually, then joins them into a single Doc; supersedes spacier.utils.make_doc_from_text_chunks(), which is now deprecated.
    • Moved core Doc extensions into a top-level extensions.py module, and improved/streamlined the collection
      • Refactored and improved performance of Doc._.to_bag_of_words() and Doc._.to_bag_of_terms(), leveraging related functionality in extract.words() and extract.terms()
      • Removed redundant/awkward extensions:
        • Doc._.lang => use Doc.lang_
        • Doc._.tokens => use iter(Doc)
        • Doc._.n_tokens => len(Doc)
        • Doc._.to_terms_list() => extract.terms(doc) or Doc._.extract_terms()
        • Doc._.to_tagged_text() => NA, this was an old holdover that's not used in practice anymore
        • Doc._.to_semantic_network() => NA, use a function in textacy.representations.networks
    • Added Doc extensions for textacy.extract functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use either textacy.extract.acronyms(doc) or doc._.extract_acronyms(). Keyterm extraction functions share a single extension: textacy.extract.keyterms.textrank(doc) <> doc._.extract_keyterms(method="textrank")
    • Leveraged spaCy's new DocBin for efficiently saving/loading Docs in binary format, with corresponding arg changes in io.write_spacy_docs() and Corpus.save()+.load()
  • Improved package documentation, tests, dependencies, and type annotations
    • Added two beginner-oriented tutorials to documentation, showing how to use various aspects of the package in the context of specific tasks.
    • Reorganized API reference docs to put like functionality together and more consistently provide summary tables up top
    • Updated dependencies list and package versions
      • Removed: pyemd and srsly
      • Un-capped max versions: numpy and scikit-learn
      • Bumped min versions: cytoolz, jellyfish, matplotlib, pyphen, and spacy (v3.0+ only!)
    • Bumped min Python version from 3.6 => 3.7, and added PY3.9 support
    • Removed textacy.export module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency on gensim and CONLL-U that wasn't enforced or guaranteed, so better to remove.
    • Added types.py module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base.
    • Improved, added, and parametrized literally hundreds of tests.

Contributors

Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.

cleaner code, better packaging, and some upgrades

29 Aug 20:51
Compare
Choose a tag to compare

New and Changed:

  • Expanded text statistics and refactored into a sub-package (PR #307)
    • Refactored text_stats module into a sub-package with the same name and top-level API, but restructured under the hood for better consistency
    • Improved performance, API, and documentation on the main TextStats class, and improved documentation on many of the individual stats functions
    • Added new readability tests for texts in Arabic (Automated Arabic Readability Index), Spanish (µ-legibility and perspecuity index), and Turkish (a lang-specific formulation of Flesch Reading Ease)
    • Breaking change: Removed TextStats.basic_counts and TextStats.readability_stats attributes, since typically only one or a couple needed for a given use case; also, some of the readability tests are language-specific, which meant bad results could get mixed in with good ones
  • Improved and standardized some code quality and performance (PR #305, #306)
    • Standardized error messages via top-level errors.py module
    • Replaced str.format() with f-strings (almost) everywhere, for performance and readability
    • Fixed a whole mess of linting errors, significantly improving code quality and consistency
  • Improved package configuration, and maintenance (PRs #298, #305, #306)
    • Added automated GitHub workflows for building and testing the package, linting and formatting, publishing new releases to PyPi, and building documentation (and ripped out Travis CI)
    • Added a makefile with common commands for dev work, plus instructions
    • Adopted the new pyproject.toml package configuration standard; updated and streamlined setup.py and setup.cfg accordingly; and removed requirements.txt
    • Moved all source code into a /src directory, for technical reasons
    • Added mypy-specific config file to reduce output noisiness when type-checking
  • Improved and moved package documentation (PR #309)
    • Moved the docs site back to ReadTheDocs (https://textacy.readthedocs.io)! Pardon the years-long detour into GitHub Pages...
    • Enabled markdown-based documentation using recommonmark instead of m2r, and migrated all "narrative" docs from .rst to equivalent .md files
    • Added auto-generated summary tables to many sections of the API Reference, to help users get an overview of functionality and better find what they're looking for; also added auto-generated section heading references
    • Tidied up and further standardized docstrings throughout the code
  • Kept up with the Python ecosystem
    • Trained a v1.1 language identifier model using scikit-learn==0.23.0, and bumped the upper bound on that dependency's version accordingly
    • Updated and parametrized many tests using modern pytest functionality (PR #306)
    • Got textacy versions 0.9.1 and 0.10.0 up on conda-forge (Issue #294)
    • Added spectral seriation as a term-ordering technique when making a "Termite" visualization by taking advantage of pandas.DataFrame functionality, and otherwise tidied up the default for nice-looking plots (PR #295)

Fixed:

  • Corrected an incorrect and misleading reference in the quickstart docs (Issue #300, PR #302)
  • Fixed a bug in the delete_words() augmentation transform (Issue #308)

Contributors:

Special thanks to @tbsexton, @marius-mather, and @rmax for their contributions! 💐

keeping up with the evolving python ecosystem

01 Mar 23:17
Compare
Choose a tag to compare

New:

  • Added a logo to textacy's documentation and social preview 📃
  • Added type hints throughout the code base, for more expressive type indicators in docstrings and for static type checkers used by developers to code more effectively (PR #289)
  • Added a preprocessing function to normalize sequences of repeating characters (Issue #275)

Changed:

  • Improved core Corpus functionality using recent additions to spacy (PR #285)
    • Re-implemented Corpus.save() and Corpus.load() using spacy's new DocBin class, which resolved a few bugs/issues (Issue #254)
    • Added n_process arg to Corpus.add() to set the number of parallel processes used when adding many items to a corpus, following spacy's updates to nlp.pipe() (Issue #277)
    • Bumped minimum spaCy version from 2.0.12 => 2.2.0, accordingly
  • Added handling for zero-width whitespaces into normalize_whitespace() function (Issue #278)
  • Improved a couple rough spots in package administration:
    • Moved package setup information into a declarative configuration file, in an attempt to keep up with evolving best practices for Python packaging
    • Simplified the configuration and interoperability of sphinx + github pages for generating package documentation

Fixed:

  • Fixed typo in ConceptNet docstring (Issue #280)
  • Trained and distributed a LangIdentifier model using scikit-learn==0.22, to prevent ambiguous errors when trying to load a file that didn't exist (Issues #291, #292)

v0.9.1 (oops)

03 Sep 21:31
Compare
Choose a tag to compare

Changed:

  • Tweaked TopicModel class to work with newer versions of scikit-learn, and updated version requirements accordingly from >=0.18.0,<0.21.0 to >=0.19

Fixed:

  • Fixed residual bugs in the script for training language identification pipelines, then trained and released one using scikit-learn==0.19 to prevent errors for users on that version

data augmentation, linguistic resources, and PY3

03 Sep 17:12
Compare
Choose a tag to compare

Note: textacy is now PY3-only! 🎉 Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR #261). See below for related changes.

New:

  • Added augmentation subpackage for basic text data augmentation (PR #268, #269)
    • implemented several transformer functions for substituting, inserting, swapping, and deleting elements of text at both the word- and character-level
    • implemented an Augmenter class for combining multiple transforms and applying them to spaCy Docs in a randomized but configurable manner
    • Note: This API is provisional, and subject to change in future releases.
  • Added resources subpackage for standardized access to linguistic resources (PR #265)
    • DepecheMood++: high-coverage emotion lexicons for understanding the emotions evoked by a text. Updated from a previous version, and now features better English data and Italian data with expanded, consistent functionality.
      • removed lexicon_methods.py module with previous implementation
    • ConceptNet: multilingual knowledge base for representing relationships between words, similar to WordNet. Currently supports getting word antonyms, hyponyms, meronyms, and synonyms in dozens of languages.
  • Added UDHR dataset, a collection of translations of the Universal Declaration of Human Rights (PR #271)

Changed:

  • Updated and extended functionality previously blocked by PY2 compatibility while reducing code bloat / complexity
    • made many args keyword-only, to prevent user error
    • args accepting strings for directory / file paths now also accept pathlib.Path objects, with pathlib adopted widely under the hood
    • increased minimum versions and/or uncapped maximum versions of several dependencies, including jellyfish, networkx, and numpy
  • Added a Portuguese-specific formulation of Flesch Reading Ease score to text_stats (PR #263)
  • Reorganized and grouped together some like functionality
    • moved core functionality for loading spaCy langs and making spaCy docs into spacier.core, out of cache.py and doc.py
    • moved some general-purpose functionality from dataset.utils to io.utils and utils.py
    • moved function for loading "hyphenator" out of cache.py and into text_stats.py, where it's used
  • Re-trained and released language identification pipelines using a better mix of training data, for slightly improved performance; also added the script used to train the pipeline
  • Changed API Reference docs to show items in source code rather than alphabetical order, which should make the ordering more human-friendly
  • Updated repo README and PyPi metadata to be more consistent and representative of current functionality
  • Removed previously deprecated textacy.io.split_record_fields() function

Fixed:

  • Fixed a regex for cleaning up crufty terms to prevent catastrophic backtracking in certain edge cases (true story: this bug was encountered in production code, and ruined my day)
  • Fixed bad handling of edge cases in sCAKE keyterm extraction (Issue #270)
  • Changed order in which URL regexes are applied in preprocessing.replace_urls() to properly handle certain edge case URLs (Issue #267)

Contributors:

Thanks much to @hugoabonizio for the contribution. 🤝

better text preprocessing, keyterm extraction, and document similarity

14 Jul 22:25
Compare
Choose a tag to compare

New and Changed:

  • Refactored and expanded text preprocessing functionality (PR #253)
    • Moved code from a top-level preprocess module into a preprocessing sub-package, and reorganized it in the process
    • Added new functions:
      • replace_hashtags() to replace hashtags like #FollowFriday or #spacyIRL2019 with _TAG_
      • replace_user_handles() to replace user handles like @bjdewilde or @spacy_io with _USER_
      • replace_emojis() to replace emoji symbols like 😉 or 🚀 with _EMOJI_
      • normalize_hyphenated_words() to join hyphenated words back together, like antici- pation => anticipation
      • normalize_quotation_marks() to replace "fancy" quotation marks with simple ascii equivalents, like “the god particle” => "the god particle"
    • Changed a couple functions for clarity and consistency:
      • replace_currency_symbols() now replaces all dedicated ascii and unicode currency symbols with _CUR_, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like $ => USD)
      • remove_punct() now has a fast (bool) kwarg rather than method (str)
    • Removed normalize_contractions(), preprocess_text(), and fix_bad_unicode() functions, since they were bad/awkward and more trouble than they were worth
  • Refactored and expanded keyterm extraction functionality (PR #257)
    • Moved code from a top-level keyterms module into a ke sub-package, and cleaned it up / standardized arg names / better shared functionality in the process
    • Added new unsupervised keyterm extraction algorithms: YAKE (ke.yake()), sCAKE (ke.scake()), and PositionRank (ke.textrank(), with non-default parameter values)
    • Added new methods for selecting candidate keyterms: longest matching subsequence candidates (ke.utils.get_longest_subsequence_candidates()) and pattern-matching candidates (ke.utils.get_pattern_matching_candidates())
    • Improved speed of SGRank implementation, and generally optimized much of the code
  • Improved document similarity functionality (PR #256)
    • Added a character ngram-based similarity measure (similarity.character_ngrams()), for something that's useful in different contexts than the other measures
    • Removed Jaro-Winkler string similarity measure (similarity.jaro_winkler()), since it didn't add much beyond other measures
    • Improved speed of Token Sort Ratio implementation
    • Replaced python-levenshtein dependency with jellyfish, for its active development, better documentation, and actually-compliant license
  • Added customizability to certain functionality
    • Added options to Doc._.to_bag_of_words() and Corpus.word_counts() for filtering out stop words, punctuation, and/or numbers (PR #249)
    • Allowed for objects that look like sklearn-style topic modeling classes to be passed into tm.TopicModel() (PR #248)
    • Added options to customize rc params used by matplotlib when drawing a "termite" plot in viz.draw_termite_plot() (PR #248)
  • Removed deprecated functions with direct replacements: io.utils.get_filenames() and spacier.components.merge_entities()

Contributors:

Huge thanks to @kjoshi and @zf109 for the PRs! 🙌

v0.7.1

25 Jun 21:06
Compare
Choose a tag to compare

New:

  • Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR #247]
    • Implemented a Google CLD3-inspired model in scikit-learn and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (langid, langdetect, cld2-cffi, and cld3)
    • Dropped cld2-cffi dependency [Issue #246]
  • Added extract.matches() function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-based Matcher and a more powerful replacement for textacy's existing (now deprecated) extract.pos_regex_matches()
  • Added preprocess.normalize_unicode() function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removed fix_unicode() function

Changed:

  • Enabled loading blank spaCy Language pipelines (tokenization only -- no model-based tagging, parsing, etc.) via load_spacy_lang(name, allow_blank=True) for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprises
  • Changed inclusion/exclusion and de-duplication of entities and ngrams in to_terms_list() [Issues #169, #179]
    • entities = True => include entities, and drop exact duplicate ngrams
    • entities = False => don't include entities, and also drop exact duplicate ngrams
    • entities = None => use ngrams as-is without checking against entities
  • Moved to_collection() function from the datasets.utils module to the top-level utils module, for use throughout the code base
  • Added quoting option to io.read_csv() and io.write_csv(), for problematic cases
  • Deprecated the spacier.components.merge_entities() pipeline component, an implementation of which has since been added into spaCy itself
  • Updated documentation for developer convenience and reader clarity
    • Split API reference docs into related chunks, rather than having them all together in one long page, and tidied up headers
    • Fixed errors / inconsistencies in various docstrings (a never-ending struggle...)
    • Ported package readme and changelog from .rst to .md format

Fixed:

  • The NotImplementedError previously added to preprocess.fix_unicode() is now raised rather than returned [Issue #243]

standardizing, streamlining, and snuggling up to spaCy

13 May 14:31
Compare
Choose a tag to compare

New and Changed:

  • Removed textacy.Doc, and split its functionality into two parts

    • New: Added textacy.make_spacy_doc() as a convenient and flexible entry point
      for making spaCy Doc s from text or (text, metadata) pairs, with optional
      spaCy language pipeline specification. It's similar to textacy.Doc.__init__,
      with the exception that text and metadata are passed in together as a 2-tuple.
    • New: Added a variety of custom doc property and method extensions to
      the global spacy.tokens.Doc class, accessible via its Doc._ "underscore"
      property. These are similar to the properties/methods on textacy.Doc,
      they just require an interstitial underscore. For example,
      textacy.Doc.to_bag_of_words() => spacy.tokens.Doc._.to_bag_of_words().
    • New: Added functions for setting, getting, and removing these extensions.
      Note that they are set automatically when textacy is imported.
  • Simplified and improved performance of textacy.Corpus

    • Documents are now added through a simpler API, either in Corpus.__init__
      or Corpus.add(); they may be one or a stream of texts, (text, metadata)
      pairs, or existing spaCy Doc s. When adding many documents, the spaCy
      language processing pipeline is used in a faster and more efficient way.
    • Saving / loading corpus data to disk is now more efficient and robust.
    • Note: Corpus is now a collection of spaCy Doc s rather than textacy.Doc s.
  • Simplified, standardized, and added Dataset functionality

    • New: Added an IMDB dataset, built on the classic 2011 dataset
      commonly used to train sentiment analysis models.
    • New: Added a base Wikimedia dataset, from which a reworked
      Wikipedia dataset and a separate Wikinews dataset inherit.
      The underlying data source has changed, from XML db dumps of raw wiki markup
      to JSON db dumps of (relatively) clean text and metadata; now, the code is
      simpler, faster, and totally language-agnostic.
    • Dataset.records() now streams (text, metadata) pairs rather than a dict
      containing both text and metadata, so users don't need to know field names
      and split them into separate streams before creating Doc or Corpus
      objects from the data.
    • Filtering and limiting the number of texts/records produced is now clearer
      and more consistent between .texts() and .records() methods on
      a given Dataset --- and more performant!
    • Downloading datasets now always shows progress bars and saves to the same
      file names. When appropriate, downloaded archive files' contents are
      automatically extracted for easy inspection.
    • Common functionality (such as validating filter values) is now standardized
      and consolidated in the datasets.utils module.
  • Quality of life improvements

    • Reduced load time for import textacy from ~2-3 seconds to ~1 second,
      by lazy-loading expensive variables, deferring a couple heavy imports, and
      dropping a couple dependencies. Specifically:

      • ftfy was dropped, and a NotImplementedError is now raised
        in textacy's wrapper function, textacy.preprocess.fix_bad_unicode().
        Users with bad unicode should now directly call ftfy.fix_text().
      • ijson was dropped, and the behavior of textacy.read_json()
        is now simpler and consistent with other functions for line-delimited data.
      • mwparserfromhell was dropped, since the reworked Wikipedia dataset
        no longer requires complicated and slow parsing of wiki markup.
    • Renamed certain functions and variables for clarity, and for consistency with
      existing conventions:

      • textacy.load_spacy() => textacy.load_spacy_lang()
      • textacy.extract.named_entities() => textacy.extract.entities()
      • textacy.data_dir => textacy.DEFAULT_DATA_DIR
      • filename => filepath and dirname => dirpath when specifying
        full paths to files/dirs on disk, and textacy.io.utils.get_filenames()
        => textacy.io.utils.get_filepaths() accordingly
      • SpacyDoc => Doc, SpacySpan => Span, SpacyToken => Token,
        SpacyLang => Language as variables and in docs
      • compiled regular expressions now consistently start with RE_
    • Removed deprecated functionality

      • top-level spacy_utils.py and spacy_pipelines.py are gone;
        use equivalent functionality in the spacier subpackage instead
      • math_utils.py is gone; it was long neglected, and never actually used
    • Replaced textacy.compat.bytes_to_unicode() and textacy.compat.unicode_to_bytes()
      with textacy.compat.to_unicode() and textacy.compat.to_bytes(), which
      are safer and accept either binary or text strings as input.

    • Moved and renamed language detection functionality,
      textacy.text_utils.detect_language() => textacy.lang_utils.detect_lang().
      The idea is to add more/better lang-related functionality here in the future.

    • Updated and cleaned up documentation throughout the code base.

    • Added and refactored many tests, for both new and old functionality,
      significantly increasing test coverage while significantly reducing run-time.
      Also, added a proper coverage report to CI builds. This should help prevent
      future errors and inspire better test-writing.

    • Bumped the minimum required spaCy version: v2.0.0 => v2.0.12,
      for access to their full set of custom extension functionality.

Fixed:

  • The progress bar during an HTTP download now always closes, preventing weird
    nesting issues if another bar is subsequently displayed.
  • Filtering datasets by multiple values performed either a logical AND or OR
    over the values, which was confusing; now, a logical OR is always performed.
  • The existence of files/directories on disk is now checked properly via
    os.path.isfile() or os.path.isdir(), rather than os.path.exists().
  • Fixed a variety of formatting errors raised by sphinx when generating HTML docs.