Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistency in how we write scrapers #572

Open
chrismytton opened this issue Feb 6, 2017 · 1 comment
Open

More consistency in how we write scrapers #572

chrismytton opened this issue Feb 6, 2017 · 1 comment

Comments

@chrismytton
Copy link
Contributor

Problem

There are still some inconsistencies between scrapers. Sometimes we build up a big hash of data and do a single database insert, sometimes we call #to_h on a scraper subclass in a loop. Sometimes we merge default data in when we save to the database, sometimes we add a field to the class. It would be good if we could eliminate more of the differences.

Proposed Solution

The main place where we're currently seeing inconsistencies is in scraper.rb, where we tie together the scraper classes and insert the data into the scraper database. Adding a new class which encapsulated the common scraping patterns we use, e.g. scrape a list of people, then scrape individual urls, should help make scraper.rb just a single class instantiation and method call.

EveryPolitician::Scraper::IndexToMembers.new(
  url:           'http://www.cdep.ro/pls/parlam/structura2015.de?leg=2012&idl=2',
  members_class: MembersPage,
  member_class:  MemberPage,
  default_data:  { term: 2012 }
).run

Acceptance Criteria

When writing a standard official-site scraper the scraper.rb file should not contain any custom code and should not require much thought to write, test or review.

Not Required

This doesn't need to handle every scraper interaction ever. The initial version can just handle scraping an index list of members and the individual member pages, which is what ~80% of our official site scrapers use.

Prerequisites

None.

Prior art

Related Issues

Due Date

As with #571 having this soon will allow us to speed up the rate at which we can be going through scrapers and reviews.

Design?

None.

Text?

We'll need to document this somewhere, but that should be done with README Driven Development as we go.

Bloggable?

Possibly!

@chrismytton
Copy link
Contributor Author

I've done a prototype of an Everypolitician::Scraper class in the Romanian scraper everypolitician-scrapers/romanian-parliament#8 and the same code is running in the Japan scraper (which is a slightly unusual one) everypolitician-scrapers/japan#4.

Those prototypes only handle encapsulating the steps the scraper has to take to combine the various classes it contains to produce an array of hashes for inserting into morph's database. They don't handle anything to do with tracking scraper runs (#574) or anything to do with abstracting away the storage mechanism.

Next steps

  • Write a README for the Everypolitician::Scraper class
  • Create a new RubyGem to drop the docs, tests and code into
  • Test out the new library with a few different scrapers (the Romanian and Japanese ones are probably good starting points)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant