-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open problems #325
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Rewrite ArticleBody
ArticleBody
seems overly complex and error-prone, while lacking some features.US
don't use this kind of layout (see [Discussion]: Most articles only have one section #310).This would also simplify node extraction.
I.e. if we iterate over the elements of
ArticleBody
(alongside the nodes extracted from the HTML document) it should be possible to exact both the cleaned text and the 'raw' original text, also being able to access thelxml.html.HtmlElement
would be beneficial but complicated for upcoming serialization approaches.Write a spyder
While initially thought of as a feature to crawl publishers 'politely', adding optional functionality to process an entire domain in the form of a spyder and therefore not relying on sitemaps would be an excellent addition to Fundus.
HTMLSource
yieldingHTML
for the scrapers to scrape.This requires a rework of the naming and object hierarchy as there should be a proper interface now replacing
HTMLSource
which intern was used as an object and interface at the same time.Optional[HTML]
in return.This would cover all requests in all parts of Fundus.
SerializationShould we simplify serialization at all, either through utility functions or proper serialization implementations?ReplacePublisherEnum
'sEnum
implementation with a custom onePython's implementation of theEnum
class is a bit messy, especially typing-wise.Since Fundus'
PublisherEnum
is solely built around theEnum
class, this led to many typing issues but some unexpected behavior as well, and thus thePublisherEnum
became notorious for causing problems.The text was updated successfully, but these errors were encountered: