diff --git a/CHANGELOG.md b/CHANGELOG.md index 9dfc5dc1..c7107e45 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,11 @@ ## Changelog +### 1.3.0 +- entirely type-checked code base +- new function `clear_caches()` (#57) +- slightly more efficient code (about 5% faster) + ### 1.2.3 - fix for memory leak (#56) - docs updated diff --git a/README.rst b/README.rst index 86aafd59..8d4a0826 100644 --- a/README.rst +++ b/README.rst @@ -17,15 +17,17 @@ htmldate: find the publication date of web pages :target: https://codecov.io/gh/adbar/htmldate :alt: Code Coverage -.. image:: https://static.pepy.tech/badge/htmldate/month +.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational :target: https://pepy.tech/project/htmldate :alt: Downloads -| +.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen + :target: https://doi.org/10.21105/joss.02439 + :alt: JOSS article reference DOI: 10.21105/joss.02439 -:Code: https://github.com/adbar/htmldate -:Documentation: https://htmldate.readthedocs.io -:Issue tracker: https://github.com/adbar/htmldate/issues +.. image:: https://img.shields.io/badge/code%20style-black-000000.svg + :target: https://github.com/psf/black + :alt: Code style: black | @@ -51,8 +53,6 @@ With Python: >>> from htmldate import find_date >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' - >>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) - '2016-06-23' On the command-line: @@ -65,23 +65,24 @@ On the command-line: Features -------- - -- Compatible with all recent versions of Python (see above) - Multilingual, robust and efficient (used in production on millions of documents) - URLs, HTML files, or HTML trees are given as input (includes batch processing) - Output as string in any date format (defaults to `ISO 8601 YMD `_) - Detection of both original and updated dates +- Compatible with all recent versions of Python -*htmldate* finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides following ways to date an HTML document: +``htmldate`` can examine markup and text. It provides the following ways to date an HTML document: -1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol `_ attributes and a large number of CMS idiosyncrasies -2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` and ``time`` elements as well as a series of attributes (e.g. ``postmetadata``) -3. **Bare HTML content**: A series of heuristics is run on text and markup: +1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol `_ attributes +2. **HTML code**: The whole document is searched for structural markers: ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``) +3. **Bare HTML content**: Heuristics are run on text and markup: - in ``fast`` mode the HTML page is cleaned and precise patterns are targeted - in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one +Finally the output is validated and converted to the chosen format. + Performance ----------- @@ -128,13 +129,14 @@ Author This effort is part of methods to derive information from web documents in order to build `text databases for research `_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information: -.. image:: https://joss.theoj.org/papers/10.21105/joss.02439/status.svg +.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen :target: https://doi.org/10.21105/joss.02439 - :alt: JOSS article + :alt: JOSS article reference DOI: 10.21105/joss.02439 -.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459599.svg +.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue :target: https://doi.org/10.5281/zenodo.3459599 - :alt: Zenodo archive + :alt: Zenodo archive DOI: 10.5281/zenodo.3459599 + .. code-block:: shell diff --git a/docs/conf.py b/docs/conf.py index c7330406..ea5f47a3 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -48,6 +48,7 @@ extensions = ['sphinx.ext.autodoc', 'sphinx.ext.intersphinx', 'sphinx.ext.napoleon', 'sphinx.ext.viewcode'] #'sphinx.ext.autosummary', #autosummary_generate = True +autodoc_typehints = 'description' # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] @@ -76,7 +77,10 @@ "show_powered_by": False, "github_user": "adbar", "github_repo": "htmldate", - "github_banner": True, + "github_banner": False, + "github_button": True, + "github_count": True, + "github_type": "star", "show_related": False, "analytics_id": "G-5BS735G6BB", # "note_bg": "#FFF59C", diff --git a/docs/index.rst b/docs/index.rst index 34c5df92..6cd20138 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,37 +13,43 @@ htmldate: find the publication date of web pages :target: https://codecov.io/gh/adbar/htmldate :alt: Code Coverage -.. image:: https://static.pepy.tech/badge/htmldate/month +.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational :target: https://pepy.tech/project/htmldate :alt: Downloads +.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen + :target: https://doi.org/10.21105/joss.02439 + :alt: JOSS article reference DOI: 10.21105/joss.02439 + +.. image:: https://img.shields.io/badge/code%20style-black-000000.svg + :target: https://github.com/psf/black + :alt: Code style: black + | -:Code: https://github.com/adbar/htmldate -:Documentation: https://htmldate.readthedocs.io -:Issue tracker: https://github.com/adbar/htmldate/issues +Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. + + +In a nutshell +------------- | .. image:: htmldate-demo.gif :alt: Demo as GIF image :align: center - :width: 80% + :width: 95% :target: https://htmldate.readthedocs.org/ | -Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. - -In a nutshell, with Python: +With Python: .. code-block:: python >>> from htmldate import find_date >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' - >>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) - '2016-06-23' On the command-line: @@ -52,34 +58,27 @@ On the command-line: $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html '2016-12-23' -| - -.. contents:: **Contents** - :backlinks: none - -| Features -------- - -- Compatible with all recent versions of Python (see above) - Multilingual, robust and efficient (used in production on millions of documents) - URLs, HTML files, or HTML trees are given as input (includes batch processing) - Output as string in any date format (defaults to `ISO 8601 YMD `_) - Detection of both original and updated dates +- Compatible with all recent versions of Python -*htmldate* finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides the following ways to date an HTML document: +``htmldate`` can examine markup and text. It provides the following ways to date an HTML document: 1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol `_ attributes and a large number of CMS idiosyncrasies -2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` and ``time`` elements as well as a series of attributes (e.g. ``postmetadata``) -3. **Bare HTML content**: A series of heuristics is run on text and markup: +2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``) +3. **Bare HTML content**: Heuristics are run on text and markup: - in ``fast`` mode the HTML page is cleaned and precise patterns are targeted - - in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one + - in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the most probable one -The output is thouroughly verified in terms of plausibility and adequateness and the library outputs a date string, corresponding to either the last update or the original publishing statement (the default), in the desired format (defaults to `ISO 8601 YMD format `_). +The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format. Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish. @@ -87,6 +86,9 @@ Markup-based extraction is multilingual by nature, text-based refinements for be Installation ------------ +Main package +~~~~~~~~~~~~ + This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.6 upwards. It is available on the package repository `PyPI `_ and can notably be installed with ``pip`` or ``pipenv``: .. code-block:: bash @@ -95,6 +97,10 @@ This Python package is tested on Linux, macOS and Windows systems; it is compati $ pip install --upgrade htmldate # to make sure you have the latest version $ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above) + +Optional +~~~~~~~~ + The additional library ``cchardet`` can be installed for better execution speed. They may not work on all platforms and have thus been singled out although installation is recommended: .. code-block:: bash @@ -106,10 +112,21 @@ You can also install or update the packages separately, *htmldate* will detect w *For infos on dependency management of Python packages see* `this discussion thread `_. +Experimental +~~~~~~~~~~~~ + +Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed: + +1. Install ``mypy``: ``pip3 install mypy`` +2. Compile the package: ``python setup.py --use-mypyc bdist_wheel`` +3. Use the newly created wheel: ``pip3 install dist/...`` + + With Python ----------- -All the functions of the module are currently bundled in *htmldate*. +``find_date`` +~~~~~~~~~~~~~ In case the web page features easily readable metadata in the header, the extraction is straightforward. A more advanced analysis of the document structure is sometimes needed: @@ -117,8 +134,8 @@ In case the web page features easily readable metadata in the header, the extrac >>> from htmldate import find_date >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') - '# DEBUG analyzing:

Friday, December 23, 2016

' - '# DEBUG result: 2016-12-23' + # DEBUG analyzing:

Friday, December 23, 2016

+ # DEBUG result: 2016-12-23 '2016-12-23' ``htmldate`` can resort to a guess based on a complete screening of the document (``extensive_search`` parameter) which can be deactivated: @@ -144,14 +161,20 @@ Already parsed HTML (that is a LXML tree object): >>> find_date(mytree) '2016-07-12' + +Output format +~~~~~~~~~~~~~ + Change the output to a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``: .. code-block:: python >>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y') '18 November 2016' # may have changed since - >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z') - '2016-12-23T05:11:00-0500' + + +Original vs. updated dates +~~~~~~~~~~~~~~~~~~~~~~~~~~ Although the time delta between original publication and "last modified" info is usually a matter of hours or days, it can be useful to prioritize the **original publication date**: @@ -216,13 +239,14 @@ Author This effort is part of methods to derive information from web documents in order to build `text databases for research `_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information: -.. image:: https://joss.theoj.org/papers/10.21105/joss.02439/status.svg +.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen :target: https://doi.org/10.21105/joss.02439 - :alt: JOSS article + :alt: JOSS article reference DOI: 10.21105/joss.02439 -.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459599.svg +.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue :target: https://doi.org/10.5281/zenodo.3459599 - :alt: Zenodo archive + :alt: Zenodo archive DOI: 10.5281/zenodo.3459599 + .. code-block:: shell @@ -276,15 +300,6 @@ Besides, there are pages for which no date can be found, ever: If the date is nowhere to be found, it might be worth considering `carbon dating `_ the web page, however this is computationally expensive. In addition, `datefinder `_ features pattern-based date extraction for texts written in English. -Tests -~~~~~ - -A series of webpages triggering different structural and content patterns is included for testing purposes: - -.. code-block:: bash - - $ pytest tests/unit_tests.py - .. toctree:: :maxdepth: 2 diff --git a/docs/options.rst b/docs/options.rst index e5c99aaf..36440344 100644 --- a/docs/options.rst +++ b/docs/options.rst @@ -1,10 +1,6 @@ Options ======= -.. contents:: **Contents** - :backlinks: none - - Configuration ------------- @@ -45,12 +41,14 @@ An external module can be used for download, as described in versions anterior t Date format ~~~~~~~~~~~ -The output format of the dates found can be set in a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``: +Change the output to a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``: .. code-block:: python >>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y') '18 November 2016' # may have changed since + >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z') + '2016-12-23T05:11:00-0500' .. autofunction:: htmldate.validators.output_format_validator @@ -80,3 +78,26 @@ See ``settings.py`` file: :undoc-members: The module can then be re-compiled locally to apply changes to the settings. + + +Clearing caches +~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from htmldate.meta import reset_caches + # at a given point in time + >>> reset_caches() + +*New in version 1.3.0.* + + +Tests +----- + +A series of HTML pages and patterns triggering different structural and content patterns is included for testing purposes: + +.. code-block:: bash + + $ python3 -m pip install pytest + $ pytest diff --git a/docs/requirements.txt b/docs/requirements.txt index 10adc74c..2ecd1992 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,4 +1,4 @@ # version required -sphinx>=5.0.1 +sphinx>=5.0.2 # without version specifier htmldate