-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test htmldate on further web pages and report bugs #8
Comments
|
The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't. |
from htmldate import find_date
import requests
resp = requests.get('https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083')
find_date(
resp.content.decode(errors='ignore'),
extensive_search=True,
outputformat='%Y-%m-%d %H:%M:%S',
)
But in the HTML source code there is a <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"/> I thought |
Hi @kinoute, |
@adbar It still doesn't work with your Here is the debugging without the
With
|
@kinoute Thanks for pointing that out, it's a bug. |
htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that's not being picked up. Example: http://www.paulgraham.com/greatwork.html In a fork I tried updating to the latest version and it has the same issue. |
@dideler Thanks, the year is detected correctly but not the month which is contained in a |
Thank you for this wonderful tool! It would be great to see this news source added. Capacity Media e.g. https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity
|
@stevesong It already works:
|
Wow, thanks! I must have been using an older version. Passing urls through archive.org appears to have a normalising effect on some websites in that htmldate works on the archive.org versions but not the original? |
It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future. |
Ok, understood, but there does appear to be something interesting happening there.
|
I guess it's because the download fails, there are websites which restrict access to the download utility, see |
I have mostly tested
htmldate
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.Please install the
dateparser
library beforehand as it significantly extends linguistic coverage:pip
orpip3 install -U dateparser
orpip install -U htmldate[all]
.Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see
DATE_EXPRESSIONS
andADDITIONAL_EXPRESSIONS
).Thanks!
The text was updated successfully, but these errors were encountered: