Test htmldate on further web pages and report bugs #8

adbar · 2020-01-03T16:01:46Z

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

The text was updated successfully, but these errors were encountered:

adbar · 2021-09-16T14:07:18Z

rahulbot · 2022-07-21T02:13:11Z

wrong date found: https://web.archive.org/web/20220721013749/https://www.ksal.com/high-wheat-quality-expected-despite-yield-drop/ (it is missing the string "June 15, 2022" in an <h5> → <span> and instead picking "July 20, 2022" from a footer)
Russian language date missed: https://web.archive.org/web/20220721014119/https://www.inopressa.ru/article/09Mar2017/welt/deutschland.html (it is missing the Russian language date "9 марта 2017")

adbar · 2022-07-21T12:51:04Z

The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.

kinoute · 2022-08-03T13:40:19Z

URL: https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083 (from 1991)
Code :

from htmldate import find_date
import requests
resp = requests.get('https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083')

find_date(
  resp.content.decode(errors='ignore'),
  extensive_search=True,
  outputformat='%Y-%m-%d %H:%M:%S',
)

results : 2022-07-26 00:00:00

But in the HTML source code there is a meta entry with the correct date:

<meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"/>

I thought htmldate will look at this in the first place or am I missing something?

adbar · 2022-08-03T15:54:18Z

Hi @kinoute, htmldate considers that the date 1991-01-02 isn't valid. You can try to set the parameter min_date in find_date() to change this, e.g. min_date="1990-01-01".

kinoute · 2022-08-03T16:41:19Z

@adbar It still doesn't work with your min_date

Here is the debugging without the min_date:

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

With min_date at "1990-01-01":

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

adbar · 2022-08-04T10:35:48Z

@kinoute Thanks for pointing that out, it's a bug.

dideler · 2023-08-22T12:05:06Z

htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that's not being picked up. Example: http://www.paulgraham.com/greatwork.html

In a fork I tried updating to the latest version and it has the same issue.

adbar · 2023-08-30T15:37:53Z

@dideler Thanks, the year is detected correctly but not the month which is contained in a <font> tag. I'll see what I can do.

stevesong · 2023-11-21T16:13:53Z

Thank you for this wonderful tool! It would be great to see this news source added.

Capacity Media e.g. https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity

    <div class="ArticlePage-datePublished">
            February 13, 2023 11:42 AM
    </div>

adbar · 2024-01-16T17:04:09Z

@stevesong It already works:

$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13

stevesong · 2024-01-16T17:35:15Z

Wow, thanks! I must have been using an older version. Passing urls through archive.org appears to have a normalising effect on some websites in that htmldate works on the archive.org versions but not the original?

adbar · 2024-01-17T14:50:40Z

It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.

stevesong · 2024-01-17T15:01:32Z

Ok, understood, but there does appear to be something interesting happening there.

$ htmldate -u https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
# ERROR no valid result for url: https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/

$ htmldate -u https://web.archive.org/web/https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
2023-01-20

adbar · 2024-01-17T16:36:16Z

I guess it's because the download fails, there are websites which restrict access to the download utility, see
https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

adbar added good first issue Good for newcomers up for grabs Good for (first) contributors labels Jan 3, 2020

rahulbot mentioned this issue Jul 7, 2021

Add new test cases including more global stories #29

Merged

adbar pinned this issue Sep 16, 2021

adbar mentioned this issue Jan 4, 2022

List of smaller extraction bugs (text & metadata) adbar/trafilatura#4

Open

adbar mentioned this issue Aug 4, 2022

Parsing fails for older dates #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test htmldate on further web pages and report bugs #8

Test htmldate on further web pages and report bugs #8

adbar commented Jan 3, 2020

adbar commented Sep 16, 2021 •

edited

Loading

rahulbot commented Jul 21, 2022 •

edited

Loading

adbar commented Jul 21, 2022

kinoute commented Aug 3, 2022 •

edited

Loading

adbar commented Aug 3, 2022

kinoute commented Aug 3, 2022 •

edited

Loading

adbar commented Aug 4, 2022

dideler commented Aug 22, 2023

adbar commented Aug 30, 2023

stevesong commented Nov 21, 2023

adbar commented Jan 16, 2024

stevesong commented Jan 16, 2024 •

edited

Loading

adbar commented Jan 17, 2024

stevesong commented Jan 17, 2024

adbar commented Jan 17, 2024

Test htmldate on further web pages and report bugs #8

Test htmldate on further web pages and report bugs #8

Comments

adbar commented Jan 3, 2020

adbar commented Sep 16, 2021 • edited Loading

rahulbot commented Jul 21, 2022 • edited Loading

adbar commented Jul 21, 2022

kinoute commented Aug 3, 2022 • edited Loading

adbar commented Aug 3, 2022

kinoute commented Aug 3, 2022 • edited Loading

adbar commented Aug 4, 2022

dideler commented Aug 22, 2023

adbar commented Aug 30, 2023

stevesong commented Nov 21, 2023

adbar commented Jan 16, 2024

stevesong commented Jan 16, 2024 • edited Loading

adbar commented Jan 17, 2024

stevesong commented Jan 17, 2024

adbar commented Jan 17, 2024

adbar commented Sep 16, 2021 •

edited

Loading

rahulbot commented Jul 21, 2022 •

edited

Loading

kinoute commented Aug 3, 2022 •

edited

Loading

kinoute commented Aug 3, 2022 •

edited

Loading

stevesong commented Jan 16, 2024 •

edited

Loading