Downloading Web Data Without Scraping

Michelle Minkoff and Scott Klein, NICAR Denver 2016

Software to Install

Google Chrome: Internet Explorer and Firefox both have excellent developer tools but the Web Inspector in Chrome will be basis of our examples. Some examples use Firefox extensions, but if you don’t have it, no worries.

JSONView Chrome Extension: http://goo.gl/njpwwh

CSVKit: http://csvkit.readthedocs.org/en/latest/index.html

Requires Python. Mac/Linux comes with Python installed. Use Anthony DeBarros’s great guide to install Python on Windows. Install Python 2.7.x because CSVKit doesn’t work with Python 3. http://goo.gl/Nm90gD

Google Spreadsheets: You can import an HTML table directly by typing =ImportHTML(“url”, “elementtype”, numberElement on page)

Ex: =ImportHTML(“http://www.atlanta.k12.ga.us/Page/832”, “table”, 2)

Tools we’re using

Open Refine: http://openrefine.org/

Tool to inspect and manipulate spreadsheet files, allowing to run queries on it to manipulate it the way you would like.

Scraper: https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en

Click on a type of information (names, emails, URLs, etc, and then right-click (control-click) and Scrape Similar. There’s an option to bring the results into a Google spreadsheet.

Tabula: http://tabula.nerdpower.org/

Tabula turns tabular PDF data into tables. Free software from the Knight-Mozilla OpenNews project.

DownThemAll: https://addons.mozilla.org/en-US/firefox/addon/downthemall/

A Firefox extension that detects what types of assets you might want to download on a page and allows you to download them, well, all. Doesn’t work for everything, but a good quick one to try.

OutwitHub: https://addons.mozilla.org/en-us/firefox/addon/outwit-hub/

Break your page down into its elements that it is made up of. Grab all pictures on a page at once, create a scraper using various HTML elements as start and end points. This is quite powerful, if you learn how to use it well. More robust than DownThemAll.

Import.Io - Like Outwit, but less code required https://import.io/

Break your page down into its elements that it is made up of. Click on elements to "train" the program as to what you want in a certain column. Crawl through multiple paginated results. Solid tutorial here: http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

JQ: http://stedolan.github.io/jq/download/

Helpful command line utility for working with JSON files.

JSON Viewer: http://jsonviewer.stack.hu/

Example Sites

When grabbing data, please remember to make sure you have permission to get the information you are grabbing. Just because you can take photos off a site, doesn’t mean you can use them for your project. With great power comes great responsibility!

Denver elected officials – for html table -- https://www.denvergov.org/content/denvergov/en/denver-elections-divison/voter-election-information/current-elected-officials.html#8eef4c9d-881f-402e-bb77-59dc06991b05,
Denver university list – for Scraper - http://www.hometodenver.com/stats_denver.htm
Denver image search - https://www.google.com/search?q=denver&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiL2p_6qbTLAhWBPCYKHRCfC0EQ_AUIBygB&biw=1440&bih=801
ProPublica’s Recovery Tracker http://projects.propublica.org/recovery/
ProPublica’s Intern Lawsuits Tracker http://projects.propublica.org/graphics/intern-suits
White House Recovery Act Projects https://www.whitehouse.gov/anniversary-map

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Form 990 Annual Masterfile.pdf		Form 990 Annual Masterfile.pdf
README.md		README.md
datashow-notes.docx		datashow-notes.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Downloading Web Data Without Scraping

Michelle Minkoff and Scott Klein, NICAR Denver 2016

Software to Install

Tools we’re using

Other Important Links

Example Sites

About

Releases

Packages

kleinmatic/datashow

Folders and files

Latest commit

History

Repository files navigation

Downloading Web Data Without Scraping

Michelle Minkoff and Scott Klein, NICAR Denver 2016

Software to Install

Tools we’re using

Other Important Links

Example Sites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages