Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing fails for older dates #62

Closed
adbar opened this issue Aug 4, 2022 · 0 comments
Closed

Parsing fails for older dates #62

adbar opened this issue Aug 4, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@adbar
Copy link
Owner

adbar commented Aug 4, 2022

By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue.

CLI:

htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"

Python:

Here is the debugging without min_date:

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

With min_date at "1990-01-01":

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

Bug originally posted by @kinoute in #8 (comment)

@adbar adbar added the bug Something isn't working label Aug 4, 2022
@adbar adbar closed this as completed Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant