-
Notifications
You must be signed in to change notification settings - Fork 0
/
tidy_tuesday1.Rmd
123 lines (93 loc) · 3.84 KB
/
tidy_tuesday1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: "Tidy Tuesday"
author: "Amanda Skarlupka"
date: "10/4/2019"
output: html_document
---
Welcome to my attempt at doing Tidy Tuesday. This week's dataset is looking at pizza places.
Load the necessary packages.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dplyr)
library(janitor)
library(jsonlite)
library(here)
```
Load the data for analysis.
```{r}
pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv")
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
pizza_datafiniti <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_datafiniti.csv")
```
I want to look at the data to see what I'm working with.
```{r}
head(pizza_barstool)
head(pizza_datafiniti)
head(pizza_jared)
```
I want to merge the tables so that the places column of the pizza_jared matches up with the name column of the pizza_barstool.
```{r}
pizza <- inner_join(pizza_barstool, pizza_jared, by = c("name" = "place"))
head(pizza)
```
As I look at it, I'm wondering if the question variable is necessary. If it's all the same "How was ..." then that's redudant and can be removed. The time and poll_ID I'm not too interested in.
```{r}
pizza %>%
select(question) %>%
distinct()
```
The questions are all the same so I'm just going to remove the variables I'm not interested in.
```{r}
pizza_clean <- pizza %>%
select(-c(pollq_id, polla_qid, question, time, latitude, longitude, country, zip, address1))
pizza_clean %>%
select(name) %>%
distinct()
str(pizza_clean)
```
There are 22 distinct pizza places that over lap between the two datasets.
I want to get an idea about how the reviews spread out.
```{r}
pizza_clean %>%
ggplot(aes(city)) +
geom_histogram(stat = "count")
pizza_clean %>%
ggplot(aes(price_level)) +
geom_histogram()
pizza_clean %>%
ggplot(aes(review_stats_all_average_score)) +
geom_histogram()
pizza_clean %>%
ggplot(aes(name, review_stats_all_count)) +
geom_point() +
theme(axis.text.x = element_text(angle=60, hjust=1))
```
It looks like most of the data is from New York. So I'm just going to focus on that city, and remove the rest of the cities. I'm not finding the price level to be very interesting either. So I'm going to remove that as well. The are a couple places that have high review counts. But that'll be good to just keep in mind.
```{r}
nyc_pizza <- subset(pizza_clean, city == "New York")
nyc_pizza <- nyc_pizza %>%
select(-c(price_level, review_stats_all_count))
```
Ok, now I want to actually kind look at the data. So I'm going to see of the restaurants left what their ratings are.
```{r}
nyc_pizza %>%
ggplot(aes(x = reorder(name, review_stats_all_average_score), y = review_stats_all_average_score)) +
geom_point() +
theme(axis.text.x = element_text(angle=60, hjust=1))
```
Now I want to see plot the reviews agaisnt each other to see if they match up well.
```{r}
nyc_pizza %>%
ggplot(aes(review_stats_critic_average_score, review_stats_community_average_score)) +
geom_point()
nyc_pizza %>%
ggplot(aes(review_stats_dave_average_score, review_stats_community_average_score, color = provider_rating, size = (review_stats_community_count + review_stats_dave_count))) +
guides(size = guide_legend(title = "Total Count"),
color = guide_legend(title = "Rating")) +
geom_point() +
xlim(0, 10) +
ylim(0, 10) +
labs(title = "Average Ratings", x = "Dave's Score", y = "Community Score")
```
Uh. Well I guess the critic reviews don't really cover the data set that I'm looking at. But there is a bit of overlab for the dave ratings and the community scores. I would fit a line to this to see the deviation between them in the future.