Skip to content

Commit

Permalink
docs: update before release 1.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Jul 20, 2022
1 parent 38d57d4 commit fce6e86
Show file tree
Hide file tree
Showing 6 changed files with 112 additions and 65 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
## Changelog


### 1.3.0
- entirely type-checked code base
- new function `clear_caches()` (#57)
- slightly more efficient code (about 5% faster)

### 1.2.3
- fix for memory leak (#56)
- docs updated
Expand Down
36 changes: 19 additions & 17 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,17 @@ htmldate: find the publication date of web pages
:target: https://codecov.io/gh/adbar/htmldate
:alt: Code Coverage

.. image:: https://static.pepy.tech/badge/htmldate/month
.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational
:target: https://pepy.tech/project/htmldate
:alt: Downloads

|
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article reference DOI: 10.21105/joss.02439

:Code: https://github.com/adbar/htmldate
:Documentation: https://htmldate.readthedocs.io
:Issue tracker: https://github.com/adbar/htmldate/issues
.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:alt: Code style: black

|
Expand All @@ -51,8 +53,6 @@ With Python:
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True)
'2016-06-23'
On the command-line:

Expand All @@ -65,23 +65,24 @@ On the command-line:
Features
--------


- Compatible with all recent versions of Python (see above)
- Multilingual, robust and efficient (used in production on millions of documents)
- URLs, HTML files, or HTML trees are given as input (includes batch processing)
- Output as string in any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_)
- Detection of both original and updated dates
- Compatible with all recent versions of Python


*htmldate* finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides following ways to date an HTML document:
``htmldate`` can examine markup and text. It provides the following ways to date an HTML document:

1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes and a large number of CMS idiosyncrasies
2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` and ``time`` elements as well as a series of attributes (e.g. ``postmetadata``)
3. **Bare HTML content**: A series of heuristics is run on text and markup:
1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes
2. **HTML code**: The whole document is searched for structural markers: ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``)
3. **Bare HTML content**: Heuristics are run on text and markup:

- in ``fast`` mode the HTML page is cleaned and precise patterns are targeted
- in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one

Finally the output is validated and converted to the chosen format.


Performance
-----------
Expand Down Expand Up @@ -128,13 +129,14 @@ Author

This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:

.. image:: https://joss.theoj.org/papers/10.21105/joss.02439/status.svg
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article
:alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459599.svg
.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue
:target: https://doi.org/10.5281/zenodo.3459599
:alt: Zenodo archive
:alt: Zenodo archive DOI: 10.5281/zenodo.3459599


.. code-block:: shell
Expand Down
6 changes: 5 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.intersphinx', 'sphinx.ext.napoleon', 'sphinx.ext.viewcode']
#'sphinx.ext.autosummary',
#autosummary_generate = True
autodoc_typehints = 'description'

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
Expand Down Expand Up @@ -76,7 +77,10 @@
"show_powered_by": False,
"github_user": "adbar",
"github_repo": "htmldate",
"github_banner": True,
"github_banner": False,
"github_button": True,
"github_count": True,
"github_type": "star",
"show_related": False,
"analytics_id": "G-5BS735G6BB",
# "note_bg": "#FFF59C",
Expand Down
97 changes: 56 additions & 41 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,37 +13,43 @@ htmldate: find the publication date of web pages
:target: https://codecov.io/gh/adbar/htmldate
:alt: Code Coverage

.. image:: https://static.pepy.tech/badge/htmldate/month
.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational
:target: https://pepy.tech/project/htmldate
:alt: Downloads

.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:alt: Code style: black

|
:Code: https://github.com/adbar/htmldate
:Documentation: https://htmldate.readthedocs.io
:Issue tracker: https://github.com/adbar/htmldate/issues
Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.


In a nutshell
-------------

|
.. image:: htmldate-demo.gif
:alt: Demo as GIF image
:align: center
:width: 80%
:width: 95%
:target: https://htmldate.readthedocs.org/

|
Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.

In a nutshell, with Python:
With Python:

.. code-block:: python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True)
'2016-06-23'
On the command-line:

Expand All @@ -52,41 +58,37 @@ On the command-line:
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
|
.. contents:: **Contents**
:backlinks: none

|
Features
--------


- Compatible with all recent versions of Python (see above)
- Multilingual, robust and efficient (used in production on millions of documents)
- URLs, HTML files, or HTML trees are given as input (includes batch processing)
- Output as string in any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_)
- Detection of both original and updated dates
- Compatible with all recent versions of Python


*htmldate* finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides the following ways to date an HTML document:
``htmldate`` can examine markup and text. It provides the following ways to date an HTML document:

1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes and a large number of CMS idiosyncrasies
2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` and ``time`` elements as well as a series of attributes (e.g. ``postmetadata``)
3. **Bare HTML content**: A series of heuristics is run on text and markup:
2. **HTML code**: The whole document is then searched for structural markers: ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``)
3. **Bare HTML content**: Heuristics are run on text and markup:

- in ``fast`` mode the HTML page is cleaned and precise patterns are targeted
- in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one
- in ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the most probable one

The output is thouroughly verified in terms of plausibility and adequateness and the library outputs a date string, corresponding to either the last update or the original publishing statement (the default), in the desired format (defaults to `ISO 8601 YMD format <https://en.wikipedia.org/wiki/ISO_8601>`_).
The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format.

Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish.


Installation
------------

Main package
~~~~~~~~~~~~

This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.6 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:

.. code-block:: bash
Expand All @@ -95,6 +97,10 @@ This Python package is tested on Linux, macOS and Windows systems; it is compati
$ pip install --upgrade htmldate # to make sure you have the latest version
$ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above)
Optional
~~~~~~~~

The additional library ``cchardet`` can be installed for better execution speed. They may not work on all platforms and have thus been singled out although installation is recommended:

.. code-block:: bash
Expand All @@ -106,19 +112,30 @@ You can also install or update the packages separately, *htmldate* will detect w
*For infos on dependency management of Python packages see* `this discussion thread <https://stackoverflow.com/questions/41573587/what-is-the-difference-between-venv-pyvenv-pyenv-virtualenv-virtualenvwrappe>`_.


Experimental
~~~~~~~~~~~~

Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed:

1. Install ``mypy``: ``pip3 install mypy``
2. Compile the package: ``python setup.py --use-mypyc bdist_wheel``
3. Use the newly created wheel: ``pip3 install dist/...``


With Python
-----------

All the functions of the module are currently bundled in *htmldate*.
``find_date``
~~~~~~~~~~~~~

In case the web page features easily readable metadata in the header, the extraction is straightforward. A more advanced analysis of the document structure is sometimes needed:

.. code-block:: python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>
# DEBUG result: 2016-12-23
'2016-12-23'
``htmldate`` can resort to a guess based on a complete screening of the document (``extensive_search`` parameter) which can be deactivated:
Expand All @@ -144,14 +161,20 @@ Already parsed HTML (that is a LXML tree object):
>>> find_date(mytree)
'2016-07-12'
Output format
~~~~~~~~~~~~~

Change the output to a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``:

.. code-block:: python
>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z')
'2016-12-23T05:11:00-0500'
Original vs. updated dates
~~~~~~~~~~~~~~~~~~~~~~~~~~

Although the time delta between original publication and "last modified" info is usually a matter of hours or days, it can be useful to prioritize the **original publication date**:

Expand Down Expand Up @@ -216,13 +239,14 @@ Author

This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:

.. image:: https://joss.theoj.org/papers/10.21105/joss.02439/status.svg
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article
:alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459599.svg
.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue
:target: https://doi.org/10.5281/zenodo.3459599
:alt: Zenodo archive
:alt: Zenodo archive DOI: 10.5281/zenodo.3459599


.. code-block:: shell
Expand Down Expand Up @@ -276,15 +300,6 @@ Besides, there are pages for which no date can be found, ever:
If the date is nowhere to be found, it might be worth considering `carbon dating <https://github.com/oduwsdl/CarbonDate>`_ the web page, however this is computationally expensive. In addition, `datefinder <https://github.com/akoumjian/datefinder>`_ features pattern-based date extraction for texts written in English.

Tests
~~~~~

A series of webpages triggering different structural and content patterns is included for testing purposes:

.. code-block:: bash
$ pytest tests/unit_tests.py

.. toctree::
:maxdepth: 2
Expand Down
31 changes: 26 additions & 5 deletions docs/options.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
Options
=======

.. contents:: **Contents**
:backlinks: none


Configuration
-------------

Expand Down Expand Up @@ -45,12 +41,14 @@ An external module can be used for download, as described in versions anterior t
Date format
~~~~~~~~~~~

The output format of the dates found can be set in a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``:
Change the output to a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``:

.. code-block:: python
>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z')
'2016-12-23T05:11:00-0500'
.. autofunction:: htmldate.validators.output_format_validator
Expand Down Expand Up @@ -80,3 +78,26 @@ See ``settings.py`` file:
:undoc-members:

The module can then be re-compiled locally to apply changes to the settings.


Clearing caches
~~~~~~~~~~~~~~~

.. code-block:: python
>>> from htmldate.meta import reset_caches
# at a given point in time
>>> reset_caches()
*New in version 1.3.0.*


Tests
-----

A series of HTML pages and patterns triggering different structural and content patterns is included for testing purposes:

.. code-block:: bash
$ python3 -m pip install pytest
$ pytest
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# version required
sphinx>=5.0.1
sphinx>=5.0.2
# without version specifier
htmldate

0 comments on commit fce6e86

Please sign in to comment.