You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are still some inconsistencies between scrapers. Sometimes we build up a big hash of data and do a single database insert, sometimes we call #to_h on a scraper subclass in a loop. Sometimes we merge default data in when we save to the database, sometimes we add a field to the class. It would be good if we could eliminate more of the differences.
Proposed Solution
The main place where we're currently seeing inconsistencies is in scraper.rb, where we tie together the scraper classes and insert the data into the scraper database. Adding a new class which encapsulated the common scraping patterns we use, e.g. scrape a list of people, then scrape individual urls, should help make scraper.rb just a single class instantiation and method call.
When writing a standard official-site scraper the scraper.rb file should not contain any custom code and should not require much thought to write, test or review.
Not Required
This doesn't need to handle every scraper interaction ever. The initial version can just handle scraping an index list of members and the individual member pages, which is what ~80% of our official site scrapers use.
Those prototypes only handle encapsulating the steps the scraper has to take to combine the various classes it contains to produce an array of hashes for inserting into morph's database. They don't handle anything to do with tracking scraper runs (#574) or anything to do with abstracting away the storage mechanism.
Next steps
Write a README for the Everypolitician::Scraper class
Create a new RubyGem to drop the docs, tests and code into
Test out the new library with a few different scrapers (the Romanian and Japanese ones are probably good starting points)
Problem
There are still some inconsistencies between scrapers. Sometimes we build up a big hash of data and do a single database insert, sometimes we call
#to_h
on a scraper subclass in a loop. Sometimes we merge default data in when we save to the database, sometimes we add a field to the class. It would be good if we could eliminate more of the differences.Proposed Solution
The main place where we're currently seeing inconsistencies is in
scraper.rb
, where we tie together the scraper classes and insert the data into the scraper database. Adding a new class which encapsulated the common scraping patterns we use, e.g. scrape a list of people, then scrape individual urls, should help makescraper.rb
just a single class instantiation and method call.Acceptance Criteria
When writing a standard official-site scraper the
scraper.rb
file should not contain any custom code and should not require much thought to write, test or review.Not Required
This doesn't need to handle every scraper interaction ever. The initial version can just handle scraping an index list of members and the individual member pages, which is what ~80% of our official site scrapers use.
Prerequisites
None.
Prior art
Related Issues
Due Date
As with #571 having this soon will allow us to speed up the rate at which we can be going through scrapers and reviews.
Design?
None.
Text?
We'll need to document this somewhere, but that should be done with README Driven Development as we go.
Bloggable?
Possibly!
The text was updated successfully, but these errors were encountered: