Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Scraper Manager #65

Open
marianelamin opened this issue Apr 18, 2021 · 0 comments
Open

Create a Scraper Manager #65

marianelamin opened this issue Apr 18, 2021 · 0 comments
Assignees

Comments

@marianelamin
Copy link
Collaborator

marianelamin commented Apr 18, 2021

Problem

This component is intended to handle all sources and execute the appropriate scraper, given the source. The idea is for this to be the frame under which all scrapers are invoked.

Proposed Solution

The scraper manager will be given a list of all links that need to be visited by the corresponding scraper and return a list where each element is a dictionary with [{source: url_of_source, result: content}]. this manager will interact closely to Angostura.

In: a list of sources to be scraped.
Result: a list with the content scrapped.

This assumes each source might have its own implementation of the scraper. For this it would be crucial that each scraper is implemented using an interface (following a contract, see #66), such interface shall include a scrape method.
Abstract Classes in python


Questions to be answered:

  • how do we know the page has articles or posts of interests? - by the url?
  • is there a way we can guarantee the content is related?

In order to answer these questions, no nlp should be required. This is still part of the scraping component.

@marianelamin marianelamin linked a pull request Apr 20, 2021 that will close this issue
1 task
@marianelamin marianelamin removed a link to a pull request Apr 20, 2021
1 task
@marianelamin marianelamin self-assigned this Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant