More consistency in how we write scrapers #572

chrismytton · 2017-02-06T12:27:16Z

Problem

There are still some inconsistencies between scrapers. Sometimes we build up a big hash of data and do a single database insert, sometimes we call #to_h on a scraper subclass in a loop. Sometimes we merge default data in when we save to the database, sometimes we add a field to the class. It would be good if we could eliminate more of the differences.

Proposed Solution

The main place where we're currently seeing inconsistencies is in scraper.rb, where we tie together the scraper classes and insert the data into the scraper database. Adding a new class which encapsulated the common scraping patterns we use, e.g. scrape a list of people, then scrape individual urls, should help make scraper.rb just a single class instantiation and method call.

EveryPolitician::Scraper::IndexToMembers.new(
  url:           'http://www.cdep.ro/pls/parlam/structura2015.de?leg=2012&idl=2',
  members_class: MembersPage,
  member_class:  MemberPage,
  default_data:  { term: 2012 }
).run

Acceptance Criteria

When writing a standard official-site scraper the scraper.rb file should not contain any custom code and should not require much thought to write, test or review.

Not Required

This doesn't need to handle every scraper interaction ever. The initial version can just handle scraping an index list of members and the individual member pages, which is what ~80% of our official site scrapers use.

Prerequisites

None.

Prior art

Add Scraper/ScraperRun classes WIP: Add Scraper/ScraperRun classes everypolitician-scrapers/romanian-parliament#7 is an initial spike by @tmtmtmtm for feedback on the overall approach.

Related Issues

Scraper data-driven testing framework Scraper data-driven testing framework #571 might need to be made aware of these top level classes.

Due Date

As with #571 having this soon will allow us to speed up the rate at which we can be going through scrapers and reviews.

Design?

None.

Text?

We'll need to document this somewhere, but that should be done with README Driven Development as we go.

Bloggable?

Possibly!

The text was updated successfully, but these errors were encountered:

chrismytton · 2017-02-14T18:21:41Z

I've done a prototype of an Everypolitician::Scraper class in the Romanian scraper everypolitician-scrapers/romanian-parliament#8 and the same code is running in the Japan scraper (which is a slightly unusual one) everypolitician-scrapers/japan#4.

Those prototypes only handle encapsulating the steps the scraper has to take to combine the various classes it contains to produce an array of hashes for inserting into morph's database. They don't handle anything to do with tracking scraper runs (#574) or anything to do with abstracting away the storage mechanism.

Next steps

Write a README for the Everypolitician::Scraper class
Create a new RubyGem to drop the docs, tests and code into
Test out the new library with a few different scrapers (the Romanian and Japanese ones are probably good starting points)

This was referenced Feb 6, 2017

[WIP] Add Scraped.scrape class method everypolitician/scraped#48

Closed

WIP: Add Scraper/ScraperRun classes everypolitician-scrapers/romanian-parliament#7

Open

This was referenced Feb 14, 2017

Archive a JSON representation of each page scraped #555

Open

Add a shortcut for scraping a url with a class everypolitician/scraped#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More consistency in how we write scrapers #572

More consistency in how we write scrapers #572

chrismytton commented Feb 6, 2017

chrismytton commented Feb 14, 2017

More consistency in how we write scrapers #572

More consistency in how we write scrapers #572

Comments

chrismytton commented Feb 6, 2017

Problem

Proposed Solution

Acceptance Criteria

Not Required

Prerequisites

Prior art

Related Issues

Due Date

Design?

Text?

Bloggable?

chrismytton commented Feb 14, 2017

Next steps