Skip to content

Latest commit

 

History

History
489 lines (400 loc) · 24.8 KB

ScrapeCandidates.md

File metadata and controls

489 lines (400 loc) · 24.8 KB

Women_Candidates

Jenny Listman 2/5/2018

Interactive App To Visualize Women Runnning for State or Federal Offices

This project was created with data scraped from the website of the Center for American Women in Politics at Rutgers University. The Center conducts research on women's participation in politics in America and maintains a list of women who are or potentially will be running for US Congress and State offices.

The code scrapes and combines data from their lists of women candidates, cleans and creates new variables, and visualizes the data in an interactive shiny map and searchable data table using plotly and DT packages . The CAWP states that their products or data are available for non-for-profit distribution as long as they are given credit (we are allowed to scrape their data table - thank you CAWP).

See the app in action here. Read a blog about it here

The design of the app was inspired by the adavis app from Gokhan Ciflikli

Load packages needed:

library(rvest)
library(tidyverse)
library(janitor)
library(shiny)
library(plotly)
library(filesstrings)
library(zoo)
library(ggpubr)
library(DT)

To turn the website into data:

  1. Specify and read in the url for website with tables to be scraped.
  2. Get HTML nodes with rvest::html_nodes and
  3. parse table contents into dataframes with rvest::html_table.
url <- 'http://cawp.rutgers.edu/buzz-2018-potential-women-candidates-us-congress-and-statewide-elected-executive'

webpage <- read_html(url)

candidate_tables <- webpage %>%
        html_nodes("table") %>%
        html_table(fill = TRUE)

Viewing the tables we've got a list with 2 elements corresponding to 2 of the 3 tables on the website. Looking at the website, we can see that candidate_tables[[1]] corresponds to special elections and candidate_tables[[2]] corresponds to general elections. This website has changed its format recently, so the number of tables has to be varified before updating app.

glimpse(candidate_tables[[1]])
## Observations: 9
## Variables: 8
## $ State                    <chr> "AZ", "", "", "", "", "OH", "", "", ""
## $ Office                   <chr> "", "U.S. Rep.", "U.S. Rep.", "U.S. R...
## $ Dist.                    <int> NA, 8, 8, 8, NA, NA, 12, 12, 12
## $ `Candidate Name & Party` <chr> "", "Brianna Westbrook (D)", "Hiral V...
## $ `Seat Status`            <chr> "", "O", "O", "O", "", "", "O", "O", "O"
## $ `Filing\n\t\t\tDate`     <chr> "", "Filed", "Filed", "Filed", "", ""...
## $ `Primary\n\t\t\tDate`    <chr> "", "Lost", "WON", "WON", "", "", "5/...
## $ `Election\n\t\t\tDate`   <chr> "", "", "Lost", "WON", "", "", "8/7/2...
glimpse(candidate_tables[[2]])
## Observations: 859
## Variables: 8
## $ State                    <chr> "AK", "", "", "", "AL", "", "", "", "...
## $ Office                   <chr> "", "Lt. Gov.", "Lt. Gov.", "", "", "...
## $ Dist.                    <chr> "", "", "", "", "", "", "", "", "", "...
## $ `Candidate Name & Party` <chr> "", "Lynn Gattis (R)", "Edie Grunwald...
## $ `General\n\t\t\tSeat`    <chr> "", "C", "C", "", "", "I", "C", "O", ...
## $ `Filing\n\t\t\tDate`     <chr> "", "6/1/2018", "6/1/2018", "", "", "...
## $ `Primary\n\t\t\tDate`    <chr> "", "8/21/2018", "8/21/2018", "", "",...
## $ `Election\n\t\t\tDate`   <chr> "", "11/6/2018", "11/6/2018", "", "",...

The special elections and general elections dataframes are formatted differently, so must be made compatible before combining. Change Dist. variable from numeric in the special elections table to character to match the general elections table. Use janitor::clean_names to remove spaces and special characters from variable names.

special_elec <- candidate_tables[[1]] %>%
        mutate(Dist. = as.character(Dist.)) %>% 
        mutate(Dist. = replace(Dist., Dist. == "8", "08")) %>% 
        clean_names() 

Combine the special elections and general elections dataframes and then clean things up. First rename one variable that has different names in each of these dataframes.

On the CAWP website database, state abbreviation appears once, one row above all entries for a given state. Spaces below it in that column are blank, until data for the next state appears. Use zoo::na.locf to replace blanks below a state abbreviation with the abbreviation above. In order for this function to work, the blanks first need to be changed to NA.

Remove special characters such as * with filesstrings::trim_anything and remove all blank rows.

Remove candidates who lost their Party's Primary. This step was added in after Primaries began.

Remove candidates whose election date has passed, whether they won or lost. This step was added in after special elections began.

cols <- c(1,2)

elections <- candidate_tables[[2]] %>%
        clean_names() %>%
        rename(seat_status = general_seat) %>%
        bind_rows(special_elec) %>%
        mutate(state = replace(state, state == "", NA)) %>%
        transform(state = na.locf(state)) %>%
        subset(primary_date != "Lost") %>%
        subset(election_date != "WON") %>%
        subset(election_date != "Lost") %>%
        subset(office != "") %>% 
        mutate(office = trim_anything(office, "*", "right")) %>%
        mutate_each_(funs(factor(.)),cols)

Rename offices from abbreviations to full names. Some are not self-explanatory in their current form (P.S.C). Some offices have more than one name/spelling/punctuation formatting (Comp, Comp., Comtr.) or have been misspelled (Govenor).

These will have to be updated periodically as variants are introduced to the database. Every time the website is re-scraped, check the levels for the variable elections$office to identify additional misspellings or abbreviations that have been introduced and then add them to the next code block.

levels(elections$office)

elections$office <- fct_recode(elections$office,
                   `Corporation Commissioner` = "Corp. Comm.",     
                   `Commissioner of Agriculture` = "Agriculture", 
                   `Commissioner of Agriculture` = "Comm. Agri.",
                   `Attorney General` = "At. Gen.",
                   `Attorney General`= "Atty. Gen.",
                   `State Auditor` =  "Auditor", 
                   `State Auditor` =  "st. Auditor", 
                   `State Auditor` = "St. Aud.", 
                   `State Auditor` = "St. Auditor",
                   `Chief Financial Officer` = "CFO",  
                   Comptroller = "Comp", 
                   Comptroller = "Comp.", 
                   Comptroller = "Comptr.",
                   Governor = "Govenor", 
                   Governor = "Governor",
                   `Insurance Commissioner` = "Insurance Comm.", 
                   `Insurance Commissioner` = "Insur. Comm.",
                   `Insurance Commissioner` = "Ins. Comm.",
                   `Labor Commissioner` = "Labor Comm.",
                   `Land Commissioner` = "Land",
                   `Land Commissioner` = "Land Comm.",
                   `Lieutenant Governor` = "Lt. Gov.", 
                   `Lieutenant Governor` = "Lt. Gov",  
                   `Lieutenant Governor` = "Lt. Governor",
                   `Public Service Commissioner` = "P.S.Comm.",
                   `Public Service Commissioner` = "P.S.C.",
                   `Railroad Commissioner` = "Rail. Comm.",
                   `Superintendent of Public Instruction` = "S.P.I.",
                   `Superintendent of Public Instruction` = "Sup.Pub.Inst.", 
                   `Superintendent of Public Instruction` = "Sup.Pub.Instr.",
                   `Superintendent of Public Instruction` =  "Sup. Pub. Instr.",
                   `Superintendent of Public Instruction` = "Sup. Public Instr.",
                   `Superintendent of Public Instruction` = "Sup.Pub. Instr.",
                   `Secretary of State` = "Sec. St.",
                   `State Treasurer` = "Treasurer",
                   `State Treasurer` = "St. Treas.",
                   `Tax Commissioner` = "Tax Comm.", 
                   `US Delegate` = "U.S. Del", 
                   `US Delegate` = "U.S. Del.",
                   `US Congressional Representative` = "U.S. Rep", 
                   `US Congressional Representative` = "U.S. Rep.", 
                   `US Congressional Representative` = "U.S Rep", 
                   `US Congressional Representative` = "U S. Rep",
                   `US Congressional Representative` = "U.S Rep.",
                   `US Congressional Representative` = "U. S. Rep.", 
                   `US Congressional Senator` = "U.S. Sen.", 
                   `US Congressional Senator` = "U.S. Senator")  

Candidate name and party affiliation are a single character variable. Separate these into two variables using filestrings::str_elem and base::trimws.

Some candidates in Minnesota have party listed as DFL. According to Wikipedia: Minnesota Democratic–Farmer–Labor Party (DFL) is a social liberal political party in Minnesota affiliated with the Democratic Party. Recode these candidates as Democrats.

Territories that elect US Delegates and states with a small population that have one Federal Representative are listed as having district AL, to mean At-Large. Change district AL to At-Large. State-level offices have blanks for district variable. Change these to State.

elections <- elections %>%
        mutate(party = str_elem(candidate_name_party, -2)) %>%
        mutate(party = replace(party, party == "L", "D")) %>% 
        mutate(party = as.factor(party)) %>%
        mutate(candidate = trimws(str_before_first(candidate_name_party,"\\("), 
                                  which = "both")) %>%
        mutate(candidate = gsub("(^\\s+)|(\\s+$)", "", candidate)) %>%
        mutate(dist = replace(dist, dist == "", "State")) %>% 
        mutate(dist = replace(dist, dist == "AL", "At-Large")) %>%
        mutate(dist = as.factor(dist)) %>%
        droplevels()

Make new variables for mapping and other possible visualizations: # of women running in each state, # and % running as Republican, # and % running as Democrat, # running for a given office.

Some states or territories have no women running at this time. Combine dataframe with a list of all state names and abbreviations to add missing states as rows with values NA. Remove states with no state abbreviation, since those are territories that won't show up on the statebins - style map. Capitalize variable names that will be included in the DT tab of the shiny app.

statenames <- read.csv("StateNamesAbbrevPostalCode.csv")

elections <- elections %>%
        add_count(state) %>%
        rename("Women Running" = n) %>%
        add_count(state, office) %>%
        rename(officeperstate = n) %>% 
        add_count(state, party) %>%
        rename(partyperstate = n) %>%
        add_count(state, office, dist) %>%
        rename(officeperdist = n) %>%
        merge(statenames, by.x = "state", by.y = "Postal.Code", all = TRUE) %>%
        mutate(percentparty = round(100*(partyperstate/`Women Running`), digits = 0)) %>%
        mutate(percentdem = ifelse(party == "D", percentparty, 100-percentparty)) %>%
        mutate(percentrepub = 100-percentdem) %>%
        subset(!(Abbreviation %in% c("","Guam"))) %>%
        rename(Office = office,Candidate = candidate,Party = party, District = dist) %>%
        droplevels()

rvest::html_table takes the text from a website's table, but doesn't get hyperlinks to elements in the table. I scraped links to candidate websites to add them to the candidate name variable in the dataframe.

Not all candidates in the CAWP list have a link and not all candidates that have links are from states (if they are from territories). The corresponding names are (re)scraped and these will be used to match the links back to the correct candidate in the elections dataframe. Rows missing state abbreviations will be removed (again).

candidate_names <- webpage %>%
        html_nodes(xpath = "//td/a") %>% html_text()

candidate_links <- webpage %>%
        html_nodes(xpath = "//td/a") %>% html_attr("href")

names_w_links <- as.data.frame(cbind(candidate_names, candidate_links)) %>%
        subset(candidate_names != "") %>%
        subset(candidate_links != "#list")
        
elections <- elections %>%
        merge(names_w_links,
              by.x = "Candidate", by.y = "candidate_names", all = TRUE) %>%
        subset(State != "")

For candidates that do have a website to which CAWP links, mutate the elections$Candidate variable to combine urls with candidate names in HTML code, to render as clickable links in the shiny app. For candidates without a link, don't alter the elections$Candidate variable.

elections$candidate_links <- as.character(elections$candidate_links)

for (x in 1:length(elections$Candidate)){
ifelse(is.na(elections$candidate_links)[x], 
       (elections$Candidate[x] <- elections$Candidate[x]), 
       (elections$Candidate[x] <- paste0("<a href='", 
                                         elections$candidate_links[x], "'target='_blank'>",
                                         elections$Candidate[x],"</a>"))
        )
}

Save the file.

saveRDS(elections, "elections.rds")

Make a dataframe for the map: select a subset of needed variables and remove duplicates so there is only one row per state. Rename and recode variables to be readable for the map's hover text: change abbreviation state and full name State from factor to character variables and change NA values to 0.

statedata <- elections[,c(2,11,15,18,19)] %>%
        unique() %>%
        rename("Percent Democrat" = "percentdem") %>%
        rename("Percent Republican" = "percentrepub") %>%
        mutate(state = as.character(state)) %>%
        mutate(State = as.character(State)) %>%
        mutate_all(funs(replace(., is.na(.), 0)))

To change continuous dempercent variable to categorical for mapping, bin and label it. Currently, statebins is not fully compatible with plotly, resulting in a non-interactive map. To get around that, adapt code from Kenton Russell to combine dataframe with statebins coordinates data statebins:::state_coords and plot it with no lines, tickmarks, axis labels, or grid to make a plotly map with working interactive hover info.

Rearrange column order to create hover text variable that will be made with Map Reduce.

The default statebins coordinates place Alaska in the bottom left of the plot, just above Hawaii. Intuitively, it should be placed in the upper left corner of the plot, northwest of an imaginary Canada. Change Alaska's row variable to 1 in dataframe to move it.

breaks <- c(0,20,40,60,80,100)

labels <- c("1-20% Dem", "21-40% Dem", "41-60% Dem", "61-80% Dem", "81-100% Dem")

statedata <- statedata %>%
        mutate(bins = cut(statedata$`Percent Democrat`, breaks, 
                          include.lowest = T, right=FALSE, labels = labels)) %>%
        mutate(bins = as.character(bins)) %>%
        mutate(bins = as.factor(ifelse(`Women Running` == 0, "0 Women Running", bins))) %>%
        merge(statebins:::state_coords[,-2], by.x = "state", by.y = "abbrev") %>%
        .[,c(3,2,4,5,1,6:8)] 

vars <- Map(
        function(x, y) paste0(x, ": ", y),
        names(statedata)[1:(ncol(statedata)-4)],
        statedata[,1:(ncol(statedata)-4)]
)

statedata <- statedata %>%
        mutate(txt = Reduce(function(x, y) paste0(x, "<br />", y), vars))

statedata[1,8] <- 1

Save the file.

saveRDS(statedata, "statedata.rds")

Make color palette with gray to represent states with no women currently running and 5 colors ranging from the traditional American political party red (Republican) to blue (Democrat) to represent the bins for percent Republican/Democrat among women candidates per state.

mapcolors <- c("#808080",get_palette(c("#b2182b","#2166ac"), 5))

The full shiny app code, including code for the DT section of the app, can be found in the app.R and global.R files in this GitHub repo.

Make a plotly - based statebins - style map for use in a shiny app.

The working shiny app can be found here.

y_Axis <- list(title = "",
                zeroline = FALSE,
                showline = FALSE,
                showticklabels = FALSE,
                showgrid = FALSE
                )
                
x_Axis <- list(title = "Hover Over State for Details",
               zeroline = FALSE,
               showline = FALSE,
               showticklabels = FALSE,
               showgrid = FALSE
                )
                
hovermap<- (plot_ly(statedata, x = ~col, y = ~-row) %>%
        add_markers(color = ~bins,
                    colors = mapcolors,
                    text = ~txt,
                    symbol = I("square"),
                    size = I(35),
                    hoverinfo = "text"
                    ) %>%
        add_text(text = ~state, color = I("white"), hoverinfo = "none") %>%
        layout(showlegend = FALSE, xaxis = x_Axis, yaxis = y_Axis)
)
    
hovermap            

Shiny app

library(shiny)
library(plotly)
library(DT)
library(shinythemes)

elections <- readRDS("elections.rds")
statedata <- readRDS("statedata.rds")

mapcolors <- c("#808080", "#B2182B", "#8D2B4B", "#693F6B", "#45528B", "#2166AC")

columns <- c("State", "Candidate", "District", "Office", "Party")

shinyApp(
        
ui <- fluidPage(theme = shinytheme("yeti"),
        titlePanel("Women Running for 2018 Congressional or State Office"),
        tabsetPanel(type = "tabs",
                    tabPanel("Map", align="center", plotlyOutput("statebins", width = "900px", height = "600px")),
                    tabPanel("Searchable Table",
                             fluidPage(
                             column(4,
                                     selectizeInput("Party", "Choose Party",
                                           c("Democrat" = "D",
                                             "Republican" = "R",
                                             "All" = "All"
                                             ),
                                           multiple = TRUE,
                                           selected = "D",
                                           width = '200px'
                                           )),
                             column(4,
                             selectizeInput("State", "Choose State(s)",
                                            c("All"= "All", levels(elections$State)),
                                            multiple = TRUE,
                                            selected = "California",
                                            width = '200px'
                                            )),
                             column(12,
                             DTOutput("data" )))
                    ),
                    tabPanel("About",
                             fluidPage(
                                     column(10,
                                            h3("An App to Search for Women Running for 2018 US Congressional or State Office"),
                                            p("I created this project to visualize a database maintained by the",
                                              a("Center for American Women in Politics",
                                                href = "http://www.cawp.rutgers.edu", target = "_blank"), "at",
                                              a("Rutgers University.", 
                                                href = "https://www.rutgers.edu", target = "_blank"), 
                                              "The Center conducts research on women's participation in politics in America and maintains", 
                                              a("a list", 
                                                href = "http://cawp.rutgers.edu/buzz-2018-potential-women-candidates-us-congress-and-statewide-elected-executive",
                                                target = "_blank"),
                                              "of women potentially running for US Congress and State offices. The data are current as of 3/12/18."),
                                            p("You can read more about", a("how", href = "https://medium.com/@jblistman", target = "_blank"),
                                              "or", a("why", href = "https://medium.com/@jblistman/an-app-to-search-for-women-running-for-office-in-2018-346f5a013ec9", target = "_blank"), 
                                              "I created this app on my blog."),
                                            p(a("R", href = "https://www.r-project.org", target = "_blank"),
                                              "code and details of data processing and visualization are available on",
                                              a("GitHub.", icon("github"), href = "https://github.com/JListman/Scrape_WomenRunning_CAWP", target = "_blank"),
                                              "Find me, Jenny Listman, on", a("Twitter", icon("twitter"),
                                                      href = "https://twitter.com/jblistman", target = "_blank"), "or",
                                              a("LinkedIn", icon("linkedin"), 
                                                href = "https://www.linkedin.com/in/jenniferlistman/", target = "_blank")),
                                            HTML("<br><br><br>")
                                            )
                                     )
                             )
                    ),
        tags$div(class="footer", checked=NA, tags$p("An interactive app to view data curated by the",
                a("CAWP", href = "http://www.cawp.rutgers.edu", target = "_blank")),
                tags$p(a("View code",icon("github"), href = "https://github.com/JListman/Scrape_WomenRunning_CAWP", target = "_blank"))
                 )
        ),


server <- function(input, output) {
   
        output$statebins <- renderPlotly({
                y_Axis <- list(
                        title = "", 
                        zeroline = FALSE,
                        showline = FALSE, 
                        showticklabels = FALSE, 
                        showgrid = FALSE
                )
                
                x_Axis <- list(
                        title = "", 
                        zeroline = FALSE,
                        showline = FALSE, 
                        showticklabels = FALSE, 
                        showgrid = FALSE
                )
                
                plot_ly(statedata, x = ~col, y = ~-row) %>%
                        add_markers(
                                color = ~bins,
                                colors = mapcolors,
                                text = ~txt,
                                symbol = I("square"), 
                                size = I(50),
                                hoverinfo = "text",
                                showlegend = FALSE
                        ) %>%
                        add_text(text = ~state, color = I("white"), hoverinfo = "none") %>%
                        add_annotations(
                                x= .5,
                                y= .95,
                                xref = "paper",
                                yref = "paper",
                                text = "Hover Over State For Details",
                                showarrow = F
                        ) %>%
                        layout(title = "", xaxis = x_Axis, yaxis = y_Axis)
                
        })
        
        output$data <- renderDT({
                if ("All" %in% input$State & "All" %in% input$Party)
                        candidates <- elections[, columns]
                else if ("All" %in% input$State & input$Party != "All")
                        elections[elections$Party %in% input$Party, columns]
                else if (input$State != "All" & "All" %in% input$Party)
                        elections[elections$State %in% input$State, columns]
                else
                        elections[elections$Party %in% input$Party & elections$State %in% input$State, columns]
                
        },
        escape = FALSE)
        
},

options = list(height = 500)
)