The data this week comes from the BBC by way of Joshua Feldman.
The BBC has revealed its list of 100 inspiring and influential women from around the world for 2020.
This year 100 Women is highlighting those who are leading change and making a difference during these turbulent times.
The list includes Sanna Marin, who leads Finland's all-female coalition government, Michelle Yeoh, star of the new Avatar and Marvel films and Sarah Gilbert, who heads the Oxford University research into a coronavirus vaccine, as well as Jane Fonda, a climate activist and actress.
And in an extraordinary year - when countless women around the world have made sacrifices to help others - one name on the 100 Women list has been left blank as a tribute.
# Get the Data
# Read in with tidytuesdayR package
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest
# Either ISO-8601 date or year/week works!
tuesdata <- tidytuesdayR::tt_load('2020-12-08')
tuesdata <- tidytuesdayR::tt_load(2020, week = 50)
women <- tuesdata$women
# Or read in the data manually
women <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-12-08/women.csv')
variable | class | description |
---|---|---|
name | character | Name of woman |
img | character | Link to headshot |
category | character | Category for award |
country | character | Country of residence |
role | character | Role/Career |
description | character | Description of the woman and their achievements |
# Load packages
library(rvest)
library(tidyverse)
# Load web page
bbc_women <- html("https://www.bbc.co.uk/news/world-55042935")
# Save even and odd indices for data extraction later
odd_index <- seq(1,200,2)
even_index <- seq(2,200,2)
# Extract name
name <- bbc_women %>%
html_nodes("article h4") %>%
html_text()
# Extract image
img <- bbc_women %>%
html_nodes(".card__header") %>%
html_nodes("img") %>%
html_attr("src")
img <- img[odd_index]
# Extract category
category <- bbc_women %>%
html_nodes("article .card") %>%
str_extract("card category--[A-Z][a-z]+") %>%
str_remove_all("card category--")
# Extract country & role
country_role <- bbc_women %>%
html_nodes(".card__header__strapline__location") %>%
html_text()
country <- country_role[odd_index]
role <- country_role[even_index]
# Extract description
description <- bbc_women %>%
html_nodes(".first_paragraph") %>%
html_text()
# Finalise data frame
df <- data.frame(
name,
img,
category,
country,
role,
description
)
# Export
write.csv(df, "data.csv", row.names=FALSE)