-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Luis/toy webscrapper #68
Changes from 14 commits
e1e2b7e
f88440f
4efb8a4
7e8a05e
e475d6c
796e6e6
dcccb13
ca34920
4126bc3
98a3c80
d33af66
a65ee41
46c27d7
34499a0
7ac403e
1eeda1f
65d86c9
43c707a
b0b94d7
32ead25
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
""" | ||
Main module interface | ||
""" | ||
|
||
# Local imports | ||
from scraper.scrapers.base_scraper import BaseScraper | ||
from .settings import URL_TO_SCRAPER | ||
from scraper.utils import get_domain_from_url, valid_url | ||
|
||
# Python imports | ||
from typing import List, Type | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know that this is the first step and because of that is ok for me if the following observation is handle in a follow-up task/PR. Currently we are returning a Dict, but I encourage you to start as soon as possible to work with explicit typed data structures. |
||
|
||
def scrape(url: str) -> dict: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these methods will need some unit tests as well :) |
||
""" | ||
Scrape data for the given url if such url is scrappable, | ||
Raise ValueError if not. | ||
|
||
Params: | ||
+ url - str : Url to scrape | ||
Return: | ||
A dict object, each describing the data that could be | ||
extracted for this url. Obtained data depends on the url itself, | ||
so available data may change depending on the scrapped url. | ||
Dict format: | ||
{ | ||
"url" : (str) url where the data came from, | ||
"data": (dict) Data scraped for this url | ||
} | ||
""" | ||
scraper = _get_scraper_from_url(url)() | ||
return scraper.scrape(url) | ||
|
||
|
||
def bulk_scrape(urls: List[str]) -> List[dict]: | ||
""" | ||
Performs a bulk scraping over a list of urls. | ||
Order in the item list it's not guaranteed to be | ||
the same as in the input list | ||
|
||
Parameters: | ||
+ urls : [str] = Urls to be scraped | ||
Return: | ||
A list of items scraped for each url in the original list | ||
""" | ||
|
||
items = [] | ||
scrapers = {} | ||
for url in urls: | ||
# Classify urls to its according scraper | ||
scraper = _get_scraper_from_url(url) | ||
|
||
if not (url_list := scrapers.get(scraper)): | ||
url_list = scrapers[scraper] = [] | ||
|
||
url_list.append(url) | ||
|
||
# Bulk scrape urls | ||
for (scraper, url_list) in scrapers.items(): | ||
s = scraper() # Create a new scraper instance | ||
items.extend(s.bulk_scrape(url_list)) | ||
|
||
return items | ||
|
||
|
||
def _get_scraper_from_url(url: str) -> Type[BaseScraper]: | ||
""" | ||
Validates if this url is scrapable and returns its | ||
corresponding spider when it is | ||
""" | ||
|
||
if not valid_url(url): | ||
raise ValueError(f"This is not a valid url: {url}") | ||
|
||
domain = get_domain_from_url(url) | ||
|
||
if not (scraper := URL_TO_SCRAPER.get(domain)): | ||
raise ValueError(f"Unable to scrap this url: {url}") | ||
|
||
return scraper |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
""" | ||
Base class for a scrapper. | ||
In order to create and wire a new scrapper: | ||
1) Create a new scraper in the "scrapers" directory | ||
2) Make your scraper a subclass of BaseScraper | ||
3) Implement missing methods (parse & scrape) | ||
4) add an entry in settings.py to the URL_TO_SCRAPER map, maping from | ||
a domain name to your new scraper. Import it if necessary | ||
""" | ||
|
||
# Python imports | ||
from typing import List | ||
|
||
|
||
class BaseScraper: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍🏽 |
||
""" | ||
Base class for scrapers implementations | ||
""" | ||
|
||
def parse(self, response) -> dict: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NIT: I noticed that you are following the typing which is great.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. About 1.
I'll be addressing this issue in some other PR |
||
""" | ||
return scraped data from a response object | ||
Parameters: | ||
+ response : any = some kind of structure holding an http response | ||
from which we can scrape data | ||
Return: | ||
A dict with scrapped fields from response | ||
""" | ||
pass | ||
|
||
def scrape(self, url: str) -> dict: | ||
""" | ||
return scraped data from url. | ||
Parameters: | ||
+ url : str = url to be scraped by this class | ||
Return: | ||
A dict with scrapped data from the given url | ||
if such url is a valid one | ||
""" | ||
pass | ||
|
||
def bulk_scrape(self, urls: List[str]) -> List[dict]: | ||
""" | ||
Return scraped data for a list of urls. Override it | ||
if your scraper implementation could handle an optimized | ||
bulk scraping. | ||
|
||
Parametes: | ||
+ urls : [str] = urls to be scraped | ||
Return: | ||
List of scraped items. Notice that the order it's not guaranteed to be | ||
the same as in the input list. | ||
""" | ||
|
||
items = [] | ||
for url in urls: | ||
if (item := self.scrape(url)) : | ||
items.append(item) | ||
|
||
return items |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
""" | ||
Base class for scrapy-based scrapers. | ||
|
||
In order to create a a new scrapy scraper: | ||
1) Create a new scraper un "scrapers" folder, and make it subclass | ||
of this BaseScrapyScraper | ||
2) override "spider" attribute of your new class with a valid | ||
scrapy spider | ||
3) wired it in settings as you would do with a regular scraper | ||
""" | ||
|
||
# External imports | ||
from scrapy import Spider | ||
|
||
# Internal imports | ||
from scraper.scrapers.base_scraper import BaseScraper | ||
from scraper.spider_manager import SpiderManager | ||
|
||
# Python imports | ||
from typing import Type, List | ||
|
||
|
||
class BaseScrapyScraper(BaseScraper): | ||
""" | ||
In order to create a new Scrappy Scrapper, just | ||
inherit this class and assign a new value to the | ||
"spider" field, a valid scrapy Spider sub class. | ||
""" | ||
|
||
spider: Type[Spider] = None | ||
|
||
def __init__(self): | ||
|
||
if self.spider is None: | ||
raise TypeError( | ||
"Spider not defined," | ||
+ "perhaps you forgot to override spider" | ||
+ "attribute in BaseScrapyScraper subclass?" | ||
) | ||
|
||
self._spider_manager = SpiderManager(self.spider) | ||
|
||
def parse(self, response) -> dict: | ||
return self._spider_manager.parse(response) | ||
|
||
def scrape(self, url: str) -> dict: | ||
return self._spider_manager.scrape(url) | ||
|
||
def bulk_scrape(self, urls: List[str]) -> List[dict]: | ||
return self._spider_manager.bulk_scrape(urls) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
""" | ||
Scraper to get data from El Pitazo | ||
""" | ||
# Internal imports | ||
from scraper.scrapers.base_scrapy_scraper import BaseScrapyScraper | ||
from scraper.spiders.el_pitazo import ElPitazoSpider | ||
|
||
|
||
class ElPitazoScraper(BaseScrapyScraper): | ||
""" | ||
Scrapes data from ElPitazo, relies in | ||
scrapy for this. | ||
""" | ||
|
||
spider = ElPitazoSpider |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
""" | ||
Settings specific to scrapy | ||
""" | ||
|
||
# Settings passed to the crawler | ||
CRAWLER_SETTINGS = {"LOG_ENABLED": False} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm glad to see that you guys are already considering configuration settings and at the same time keeping it simple!!! I like that, please create a task for me when you get time to bring a configuration library into the project. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
""" | ||
This file manages multiple settings shared across the scraper, | ||
such as mappings from urls to scrapers | ||
""" | ||
from scraper.scrapers.el_pitazo_scraper import ElPitazoScraper | ||
import os | ||
|
||
|
||
# Dict with information to map from domain to | ||
# Spider | ||
URL_TO_SCRAPER = { | ||
"elpitazo.net": ElPitazoScraper, | ||
} | ||
|
||
|
||
# root dir, so we can get resources from module directories | ||
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# external imports | ||
import scrapy | ||
from scrapy.crawler import CrawlerProcess | ||
import scrapy.signals | ||
|
||
# Project imports | ||
import scraper.scrapy_settings as settings | ||
|
||
# Python imports | ||
from typing import List | ||
|
||
|
||
class SpiderManager: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this class will need some unit tests too :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have a look at how to test this, given that it has the CrawlerProcess + spider things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator? p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's purpose is to implement scraping and parsing functions for spiders, so they are not directly managed by a BaseScraper subclass. This way, BaseScrapyScraper can just delegate that behavior on the manager object. Maybe it's better to merge that logic into BaseScrapyScraper? Since the latter is just redirecting behavior into SpiderManager. |
||
""" | ||
Utility class to perform common operations in | ||
Spider classes | ||
""" | ||
|
||
process = CrawlerProcess(settings.CRAWLER_SETTINGS) | ||
|
||
def __init__(self, spider) -> None: | ||
self.spider = spider | ||
|
||
def parse(self, response) -> dict: | ||
""" | ||
return scraped data from a valid response | ||
Parameters: | ||
+ response : scrapy.http.Response = response object holding the actual response | ||
Return: | ||
dict like object with scraped data | ||
""" | ||
spider = self.spider() | ||
return spider.parse(response) | ||
|
||
def scrape(self, url: str) -> dict: | ||
""" | ||
Return scraped data from a single Url | ||
Parameters: | ||
+ url : str = url whose data is to be scraped. Should be compatible with the given spider | ||
Return: | ||
dict like object with scraped data | ||
""" | ||
scraped = self.bulk_scrape([url]) | ||
|
||
return scraped[0] if scraped else None | ||
|
||
def bulk_scrape(self, urls: List[str]) -> List[dict]: | ||
""" | ||
return scraped data from a list of valid URLs | ||
Parameters: | ||
+ urls : [str] = urls whose data is to be scraped. | ||
Should be compatible with the provided spider | ||
Return: | ||
list of dict like object with scraped data | ||
""" | ||
|
||
# if nothing to do, just return an empty list | ||
if not urls: | ||
return [] | ||
|
||
# Items accumulator | ||
items = [] | ||
|
||
# callback function to collect items on the fly | ||
def items_scrapped(item, response, spider): | ||
items.append({"url": response._url, "data": item}) | ||
|
||
# set up urls to scrape | ||
self.spider.start_urls = urls | ||
|
||
# create crawler for this spider, connect signal so we can collect items | ||
crawler = self.process.create_crawler(self.spider) | ||
crawler.signals.connect(items_scrapped, signal=scrapy.signals.item_scraped) | ||
|
||
# start scrapping | ||
self.process.crawl(crawler) | ||
self.process.start() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a blocking call, is it? If so, I'm not sure about this API. The doc is saying that you return the scrapped data, but start in the crawler process is intended as an async pattern and you need to use the join method. But maybe I'm wrong. |
||
|
||
# return post processed scrapped objects | ||
return items |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Internal imports | ||
import scraper.utils as utils | ||
|
||
# External imports | ||
import scrapy | ||
|
||
# Python imports | ||
from typing import List | ||
|
||
|
||
class ElPitazoSpider(scrapy.Spider): | ||
""" | ||
Spider to scrape ElPitazo data | ||
""" | ||
|
||
name = "el_pitazo" | ||
|
||
start_urls = [] | ||
|
||
def parse(self, response): | ||
""" | ||
Returns a dict like structure with the following | ||
fields: | ||
+ title | ||
+ date | ||
+ categories | ||
+ body | ||
+ author | ||
+ tags | ||
""" | ||
|
||
# These are simple properties, just get its text with a valid | ||
# selector | ||
title = utils.get_element_text(".tdb-title-text", response) or "" | ||
date = utils.get_element_text(".entry-date", response) or "" | ||
author = utils.get_element_text(".tdb-author-name", response) or "" | ||
|
||
body = self._get_body(response) | ||
|
||
tags = self._get_tags(response) | ||
|
||
# categories | ||
categories = response.css(".tdb-entry-category").getall() | ||
categories = list(map(utils.strip_http_tags, categories)) | ||
|
||
return { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is related to my previous comment about using typed objects for outputs. Or more precisely, a hierarchy of DataClasses. |
||
"title": title, | ||
"date": date, | ||
"categories": categories, | ||
"body": body, | ||
"author": author, | ||
"tags": tags, | ||
} | ||
|
||
def _get_body(self, response) -> str: | ||
""" | ||
Get article body as a single string | ||
""" | ||
body = response.css("#bsf_rt_marker > p").getall() | ||
body = filter(lambda p: p.startswith("<p>") and p.endswith("</p>"), body) | ||
body = map(utils.strip_http_tags, body) | ||
|
||
body = "\n".join(body) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Logics like this are suitable to be in the data class as an computed property. That will allow you to reuse this little amount of code easily. I know that is small code, but once you are scrapping 100 sources, one line of repeated code becomes 100 lines of sparse repeated code. And if you need to change it, then it will be a pain. |
||
|
||
return body.strip() | ||
|
||
def _get_tags(self, response) -> List[str]: | ||
""" | ||
Try to get tags from document if available | ||
""" | ||
tags = response.css(".tdb-tags > li > a").getall() | ||
tags = list(map(utils.strip_http_tags, tags)) | ||
return tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole package is misplaced.
By convention, we would like everything inside our c4v-py project be imported like this:
and with this you would have:
The latter is the pythonic way to organize things, the former would be a library with more than one global package, which is extremely rare at least for me).
I recommend to stick to the normal case.