Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open problems #325

Open
MaxDall opened this issue Nov 2, 2023 · 0 comments
Open

Open problems #325

MaxDall opened this issue Nov 2, 2023 · 0 comments

Comments

@MaxDall
Copy link
Collaborator

MaxDall commented Nov 2, 2023

Rewrite ArticleBody

  • Problem: The current object implementation bundling all information about the articles' body called ArticleBody seems overly complex and error-prone, while lacking some features.
    1. Do we need a specific layout format (summary -> sub-headlines -> paragraphs) at all, since some regions like US don't use this kind of layout (see [Discussion]: Most articles only have one section  #310).
      This would also simplify node extraction.
    2. Support access to the 'raw' text or even lxml node on a per-element base.
      I.e. if we iterate over the elements of ArticleBody (alongside the nodes extracted from the HTML document) it should be possible to exact both the cleaned text and the 'raw' original text, also being able to access the lxml.html.HtmlElement would be beneficial but complicated for upcoming serialization approaches.

Write a spyder

  • Problem: Fundus heavily depends on the information given by publishers to crawl articles, in this case, a map linking articles.
    While initially thought of as a feature to crawl publishers 'politely', adding optional functionality to process an entire domain in the form of a spyder and therefore not relying on sitemaps would be an excellent addition to Fundus.
  • Requirements: The spyder should ...
    • operate on the same level as HTMLSource yielding HTML for the scrapers to scrape.
      This requires a rework of the naming and object hierarchy as there should be a proper interface now replacing HTMLSource which intern was used as an object and interface at the same time.
    • only operate in cases where no sitemap is provided and also not per default
    • support functionality to add delays between requests
  • Expensive: There could be a central request unit organizing request delays and access rules set by the publisher (see /robots.txt) so one would send a deferred request to this unit and expect Optional[HTML] in return.
    This would cover all requests in all parts of Fundus.

Serialization

Should we simplify serialization at all, either through utility functions or proper serialization implementations?

Replace PublisherEnum's Enum implementation with a custom one

Python's implementation of the Enum class is a bit messy, especially typing-wise.
Since Fundus' PublisherEnum is solely built around the Enum class, this led to many typing issues but some unexpected behavior as well, and thus the PublisherEnum became notorious for causing problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant