Skip to content

5 min tutorial on webscraping at the Charlotte R Users Group

Notifications You must be signed in to change notification settings

benporter/rvest-webscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

rvest-webscraping

Code from the 5 min tutorial on webscraping at the Charlotte R Users Group on 3/21/2016

Resources

Getting Started

If the rvest package is not installed already, then run the following. This only needs to be run once on your machine.

install.packages("rvest")

Then load the rvest library, which needs to be done every time you spin up R.

library(rvest)

View the available demos within the rvest package:

demo(package = "rvest")
#produces the following list
Demos in package ‘rvest’:
tripadvisor                          Scrape review data from tripadvisor
united                               Scrape mileage details from united.co
zillow                               Scrape housing info from tripadvisor

Then run one of the demos:

demo(package = "rvest",topic = "tripadvisor")

Example: Scraping an html table from Second Harvest Food Bank

1 - Scrap the "Complete Location Listings" table using the html_table() method.

locations_page <- read_html("http://www.secondharvestmetrolina.org/agencies/Get-Food-Assistance")

locations_page %>% 
html_nodes("table") %>% 
.[[1]] %>%
html_table() %>% 
head()

2 - Pull the table that stores the lats and longs, one field at a time using html_attr() method.

title <- locations_page %>% 
html_nodes(".xmp-location-listing") %>% 
html_attr("data-title") 

latitude <- locations_page %>% 
html_nodes(".xmp-location-listing") %>% 
html_attr("data-latitude") 

longitude <- locations_page %>% 
html_nodes(".xmp-location-listing") %>% 
html_attr("data-longitude") 

description <- locations_page %>% 
html_nodes(".xmp-location-listing") %>% 
html_attr("data-description") 

id <- locations_page %>% 
html_nodes(".xmp-location-listing") %>% 
html_attr("data-id") 

Assemble each column into a dataframe.

df <- data.frame(title,latitude,longitude,description,id)

3 - Using the xpath method:

locations_page %>% 
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "findLoc", " " ))] | //td') %>% 
html_text() %>% 
head()

About

5 min tutorial on webscraping at the Charlotte R Users Group

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages