Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Luis/toy webscrapper #68

Merged
merged 20 commits into from
May 17, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,445 changes: 1,577 additions & 868 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ pandas = "1.0.5"
nltk = "^3.5"
google-cloud-bigquery = "1.25.0"
tensorflow-cpu = "2.3.1"
Scrapy = "^2.5.0"
beautifulsoup4 = "^4.9.3"

[tool.poetry.dev-dependencies]
pytest = "^5.4.3"
Expand Down
Empty file added src/scraper/__init__.py
Empty file.
80 changes: 80 additions & 0 deletions src/scraper/scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole package is misplaced.
By convention, we would like everything inside our c4v-py project be imported like this:

import c4v.scraper.X
...
from c4v.scraper import X

and with this you would have:

import from scraper.X
...
from scraper import X

The latter is the pythonic way to organize things, the former would be a library with more than one global package, which is extremely rare at least for me).

I recommend to stick to the normal case.

Main module interface
"""

# Local imports
from scraper.scrapers.base_scraper import BaseScraper
from .settings import URL_TO_SCRAPER
from scraper.utils import get_domain_from_url, valid_url

# Python imports
from typing import List, Type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is the first step and because of that is ok for me if the following observation is handle in a follow-up task/PR.

Currently we are returning a Dict, but I encourage you to start as soon as possible to work with explicit typed data structures.
More precisely I would like to see a hierarchy of DataClasses representing outputs. And this hierarchy, up to some extent, should be aligned with the current hierarchy you already put in place in this PR.
And, if can be more strict and picky, I would like to see the current hierarchy of classes being defined as Generics which are parametrized by the output DataClasses.
Another important observation is that those DataClasses should be seeing as Inmutable objects. They could have methods performing computations, but they should avoid mutating the object itself. That design should help you to make everything more friendly with pipelining of tasks and future parallelizations.


def scrape(url: str) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods will need some unit tests as well :)

"""
Scrape data for the given url if such url is scrappable,
Raise ValueError if not.

Params:
+ url - str : Url to scrape
Return:
A dict object, each describing the data that could be
extracted for this url. Obtained data depends on the url itself,
so available data may change depending on the scrapped url.
Dict format:
{
"url" : (str) url where the data came from,
"data": (dict) Data scraped for this url
}
"""
scraper = _get_scraper_from_url(url)()
return scraper.scrape(url)


def bulk_scrape(urls: List[str]) -> List[dict]:
"""
Performs a bulk scraping over a list of urls.
Order in the item list it's not guaranteed to be
the same as in the input list

Parameters:
+ urls : [str] = Urls to be scraped
Return:
A list of items scraped for each url in the original list
"""

items = []
scrapers = {}
for url in urls:
# Classify urls to its according scraper
scraper = _get_scraper_from_url(url)

if not (url_list := scrapers.get(scraper)):
url_list = scrapers[scraper] = []

url_list.append(url)

# Bulk scrape urls
for (scraper, url_list) in scrapers.items():
s = scraper() # Create a new scraper instance
items.extend(s.bulk_scrape(url_list))

return items


def _get_scraper_from_url(url: str) -> Type[BaseScraper]:
"""
Validates if this url is scrapable and returns its
corresponding spider when it is
"""

if not valid_url(url):
raise ValueError(f"This is not a valid url: {url}")

domain = get_domain_from_url(url)

if not (scraper := URL_TO_SCRAPER.get(domain)):
raise ValueError(f"Unable to scrap this url: {url}")

return scraper
Empty file.
60 changes: 60 additions & 0 deletions src/scraper/scrapers/base_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""
Base class for a scrapper.
In order to create and wire a new scrapper:
1) Create a new scraper in the "scrapers" directory
2) Make your scraper a subclass of BaseScraper
3) Implement missing methods (parse & scrape)
4) add an entry in settings.py to the URL_TO_SCRAPER map, maping from
a domain name to your new scraper. Import it if necessary
"""

# Python imports
from typing import List


class BaseScraper:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

"""
Base class for scrapers implementations
"""

def parse(self, response) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I noticed that you are following the typing which is great.
If we would like to be strict, the typing here has two mistakes:

  1. response has no type: maybe is because Scrapy does not have type hints. Is that right?
  2. dict is not a valid type hint. The typing module defined a type called Dict. That one is Generic and a typical annotation example is like this:
def parse(self, response: MyType) -> Dict[str, Any]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About 1.
No, it does have type hints, we decided to keep it as "any" type so every scraper could have its own response type according to its needs. But I'll agree that we should have a base class in order to have a minimal interface for every response object, and (using scrapy as an example) a ScrapyResponse implementation for scrapy Response object could just delegate the interface implementation to scrapy functions, for example:

class BaseResponse:
  def url(self) -> str: # just an example
     pass
     
class ScrapyResponse(BaseResponse):
  def __init__(self, scrapy_response : scrapy.http.Response):
     self.scrapy_response = scrapy_response
     
  def url(self) -> str:
     return self.scrapy_response.url()

I'll be addressing this issue in some other PR
About 2.
You're right, I should change every dict signature. I wasn't sure which one I should have used when I first wrote it. I'll change it ASAP

"""
return scraped data from a response object
Parameters:
+ response : any = some kind of structure holding an http response
from which we can scrape data
Return:
A dict with scrapped fields from response
"""
pass

def scrape(self, url: str) -> dict:
"""
return scraped data from url.
Parameters:
+ url : str = url to be scraped by this class
Return:
A dict with scrapped data from the given url
if such url is a valid one
"""
pass

def bulk_scrape(self, urls: List[str]) -> List[dict]:
"""
Return scraped data for a list of urls. Override it
if your scraper implementation could handle an optimized
bulk scraping.

Parametes:
+ urls : [str] = urls to be scraped
Return:
List of scraped items. Notice that the order it's not guaranteed to be
the same as in the input list.
"""

items = []
for url in urls:
if (item := self.scrape(url)) :
items.append(item)

return items
50 changes: 50 additions & 0 deletions src/scraper/scrapers/base_scrapy_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""
Base class for scrapy-based scrapers.

In order to create a a new scrapy scraper:
1) Create a new scraper un "scrapers" folder, and make it subclass
of this BaseScrapyScraper
2) override "spider" attribute of your new class with a valid
scrapy spider
3) wired it in settings as you would do with a regular scraper
"""

# External imports
from scrapy import Spider

# Internal imports
from scraper.scrapers.base_scraper import BaseScraper
from scraper.spider_manager import SpiderManager

# Python imports
from typing import Type, List


class BaseScrapyScraper(BaseScraper):
"""
In order to create a new Scrappy Scrapper, just
inherit this class and assign a new value to the
"spider" field, a valid scrapy Spider sub class.
"""

spider: Type[Spider] = None

def __init__(self):

if self.spider is None:
raise TypeError(
"Spider not defined,"
+ "perhaps you forgot to override spider"
+ "attribute in BaseScrapyScraper subclass?"
)

self._spider_manager = SpiderManager(self.spider)

def parse(self, response) -> dict:
return self._spider_manager.parse(response)

def scrape(self, url: str) -> dict:
return self._spider_manager.scrape(url)

def bulk_scrape(self, urls: List[str]) -> List[dict]:
return self._spider_manager.bulk_scrape(urls)
15 changes: 15 additions & 0 deletions src/scraper/scrapers/el_pitazo_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""
Scraper to get data from El Pitazo
"""
# Internal imports
from scraper.scrapers.base_scrapy_scraper import BaseScrapyScraper
from scraper.spiders.el_pitazo import ElPitazoSpider


class ElPitazoScraper(BaseScrapyScraper):
"""
Scrapes data from ElPitazo, relies in
scrapy for this.
"""

spider = ElPitazoSpider
6 changes: 6 additions & 0 deletions src/scraper/scrapy_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""
Settings specific to scrapy
"""

# Settings passed to the crawler
CRAWLER_SETTINGS = {"LOG_ENABLED": False}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see that you guys are already considering configuration settings and at the same time keeping it simple!!!

I like that, please create a task for me when you get time to bring a configuration library into the project.
here the one I'm thinking to use: https://www.dynaconf.com
cc: @dieko95

17 changes: 17 additions & 0 deletions src/scraper/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""
This file manages multiple settings shared across the scraper,
such as mappings from urls to scrapers
"""
from scraper.scrapers.el_pitazo_scraper import ElPitazoScraper
import os


# Dict with information to map from domain to
# Spider
URL_TO_SCRAPER = {
"elpitazo.net": ElPitazoScraper,
}


# root dir, so we can get resources from module directories
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
80 changes: 80 additions & 0 deletions src/scraper/spider_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# external imports
import scrapy
from scrapy.crawler import CrawlerProcess
import scrapy.signals

# Project imports
import scraper.scrapy_settings as settings

# Python imports
from typing import List


class SpiderManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class will need some unit tests too :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at how to test this, given that it has the CrawlerProcess + spider things.
@m4r4c00ch0 maybe you would have some input here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?

p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's purpose is to implement scraping and parsing functions for spiders, so they are not directly managed by a BaseScraper subclass. This way, BaseScrapyScraper can just delegate that behavior on the manager object. Maybe it's better to merge that logic into BaseScrapyScraper? Since the latter is just redirecting behavior into SpiderManager.

"""
Utility class to perform common operations in
Spider classes
"""

process = CrawlerProcess(settings.CRAWLER_SETTINGS)

def __init__(self, spider) -> None:
self.spider = spider

def parse(self, response) -> dict:
"""
return scraped data from a valid response
Parameters:
+ response : scrapy.http.Response = response object holding the actual response
Return:
dict like object with scraped data
"""
spider = self.spider()
return spider.parse(response)

def scrape(self, url: str) -> dict:
"""
Return scraped data from a single Url
Parameters:
+ url : str = url whose data is to be scraped. Should be compatible with the given spider
Return:
dict like object with scraped data
"""
scraped = self.bulk_scrape([url])

return scraped[0] if scraped else None

def bulk_scrape(self, urls: List[str]) -> List[dict]:
"""
return scraped data from a list of valid URLs
Parameters:
+ urls : [str] = urls whose data is to be scraped.
Should be compatible with the provided spider
Return:
list of dict like object with scraped data
"""

# if nothing to do, just return an empty list
if not urls:
return []

# Items accumulator
items = []

# callback function to collect items on the fly
def items_scrapped(item, response, spider):
items.append({"url": response._url, "data": item})

# set up urls to scrape
self.spider.start_urls = urls

# create crawler for this spider, connect signal so we can collect items
crawler = self.process.create_crawler(self.spider)
crawler.signals.connect(items_scrapped, signal=scrapy.signals.item_scraped)

# start scrapping
self.process.crawl(crawler)
self.process.start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking call, is it?

If so, I'm not sure about this API. The doc is saying that you return the scrapped data, but start in the crawler process is intended as an async pattern and you need to use the join method. But maybe I'm wrong.


# return post processed scrapped objects
return items
Empty file added src/scraper/spiders/__init__.py
Empty file.
73 changes: 73 additions & 0 deletions src/scraper/spiders/el_pitazo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Internal imports
import scraper.utils as utils

# External imports
import scrapy

# Python imports
from typing import List


class ElPitazoSpider(scrapy.Spider):
"""
Spider to scrape ElPitazo data
"""

name = "el_pitazo"

start_urls = []

def parse(self, response):
"""
Returns a dict like structure with the following
fields:
+ title
+ date
+ categories
+ body
+ author
+ tags
"""

# These are simple properties, just get its text with a valid
# selector
title = utils.get_element_text(".tdb-title-text", response) or ""
date = utils.get_element_text(".entry-date", response) or ""
author = utils.get_element_text(".tdb-author-name", response) or ""

body = self._get_body(response)

tags = self._get_tags(response)

# categories
categories = response.css(".tdb-entry-category").getall()
categories = list(map(utils.strip_http_tags, categories))

return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my previous comment about using typed objects for outputs. Or more precisely, a hierarchy of DataClasses.

"title": title,
"date": date,
"categories": categories,
"body": body,
"author": author,
"tags": tags,
}

def _get_body(self, response) -> str:
"""
Get article body as a single string
"""
body = response.css("#bsf_rt_marker > p").getall()
body = filter(lambda p: p.startswith("<p>") and p.endswith("</p>"), body)
body = map(utils.strip_http_tags, body)

body = "\n".join(body)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logics like this are suitable to be in the data class as an computed property. That will allow you to reuse this little amount of code easily. I know that is small code, but once you are scrapping 100 sources, one line of repeated code becomes 100 lines of sparse repeated code. And if you need to change it, then it will be a pain.


return body.strip()

def _get_tags(self, response) -> List[str]:
"""
Try to get tags from document if available
"""
tags = response.css(".tdb-tags > li > a").getall()
tags = list(map(utils.strip_http_tags, tags))
return tags
Loading