-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-visualization.Rmd
293 lines (200 loc) · 11.4 KB
/
data-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Data visualization in R"
output: github_document
---
> ## Learning Objectives
>
> * Produce scatterplots, boxplots, and histograms summarizing the Spotify data.
> * Learn about universal and local plot settings.
> * Use faceting effectively in ggplot2
First, lets load the packages and data we'll need. If your R session is still open from the Data Manipulation session, you don't need to do this.
`ggplot2` is included in the tidyverse
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```
```{r}
spotify <- read_csv("data/spotify.csv")
```
## Plotting with ggplot2
`ggplot2` is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
`ggplot2` functions like tidy data. Well-structured data will save you lots of time when making figures with ggplot2
`ggplot2` graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a `ggplot`, we will use the following basic template that can be used for different types of plots:
```{r, eval=FALSE}
ggplot(data = < DATA > , aes( < MAPPINGS > )) + geom_FUNCTION()
```
**data**
Bind the plot to a specific data frame using the `data` argument
**aes()**
Use the `aes()` (aesthetics) function to select the variables to be plotted and specify how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
**geom**
Add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). Some examples that you will use today:
* `geom_point()` for scatter plots, dot plots, etc.
* `geom_smooth()` for trendlines
* `geom_boxplot()` for boxplots
* `geom_histogram()` for histograms
To add a geom to the plot use the `+` operator. This is somewhat similar to `%>%`. I allows you to be modular with your plots.
Lets get our hands dirty and make a quick scatterplot
```{r}
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point()
```
It looks like song energy and loundness have a positive relationship!
The `+` operator can really come in handy. It allows you to modify existing `ggplot` objects. That way you can set up plot templates and explore different types of plots without having to reinvent the wheel.
Here's an example:
```{r, results='hide'}
# Assign a plot to an object. This is your template
energy_loudness <- ggplot(data = spotify, aes(x = energy, y = loudness))
# draw the plot. this will return the same plot as above
energy_loudness +
geom_point()
```
Now, if you want to just draw a trendline without points, you can!
```{r}
energy_loudness +
geom_smooth()
```
Anything you put in the `ggplot()` function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes(). You can also specify mappings for a given geom independently of the mappings defined globally in the `ggplot()` function.
**Exercise 1**
Create the same scatterplot as above (`geom_point()`), but define the x variable as energy and the y variable as tempo. What kind of trend do you see?
## Building plots iteratively
Usually it takes some experimentation to get a plot looking just right. `ggplot` makes it easy to iteratively build your plot.
Let's return to the scatterplot of energy and loudness:
```{r}
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point()
```
There are a ton of points on the plot. You can avoid overplotting by adjusting the transparency of the points with the `alpha` parameter, which you supply to the `geom_point()` function. Feel free to play around with the `alpha` parameter. It ranges from 0 to 1, with 1 being the most opaque.
```{r}
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point(alpha = 0.2)
```
Interesting. The point density gets denser as the two variables increase. Le'ts see if there may be some effects of `genre` on this relationship. We do this by supplying `genre` to the `color` aesthetic.
```{r}
ggplot(data = spotify, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.2)
```
Woah buddy, that's a lot of colors! Lets use our data filtering skills to reduce the data set to a more manageable five genres. Lets go with Folk, Rock, Soul, Rap, and Pop.
```{r}
spotify_filtered <- spotify %>%
filter(genre == "Folk" | genre == "Rock" | genre == "Soul" | genre == "Rap" | genre == "Pop")
```
Let's make the same scatterplot as before, but use this reduced data set. There are going to be fewer data points on the graph, so we can raise the value of alpha.
```{r}
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8)
```
It looks like there aren't differences in the relationship between energy and loudness across genera, at least for the genera we're working with.
**Exercise 2**
Let's take one last effort to see if there are differences in the relationship between energy and loudness across groups. Rather than plotting the full scatterplot, plot the trendlines with the `geom_smooth()` function.
-----
## Boxplots
What if you're interested in the differences in the distribution of a single variable across genera? Boxplots help you do that.
Here is a quick look at how the distribution of danceability differs across genera.
```{r}
bp <- ggplot(data = spotify_filtered, aes(x = genre, y = danceability))
bp +
geom_boxplot()
```
It looks like Pop, Rap, and Soul are the most danceable genera, but Folk and Rock aren't so much.
[Boxplots](https://www.r-graph-gallery.com/boxplot) are great for general trends, but they don't represent the distribution of the data super well. [Violin plots](https://www.r-graph-gallery.com/violin) do a better job of this. Since you saved the last `ggplot` object as `bp`, you don't have to rewrite the long ``ggplot`` call again!
```{r}
bp +
geom_violin()
```
The distributions look unimodal, with some skew. This looks fine to me!
## Customization
The data visualizations so far convey patterns in the data well enough, but they lack refinement. Let's spruce a plot up to make it more professional.
First, I'm going to introduce the `geom_histogram()` function, which will allow you to visualize the distribution of a single variable.
Let's take a look at danceability. Since you're only visualizing one variable, you only need to supply an argument to the `x` aesthetic.
```{r}
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram()
```
First we need to fix that notification. Picking binwidths is a tricky endeavor which can bias the representation of your distribution. For simplicity, let's just pick the number of bins as 40.
```{r}
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40)
```
Now, let's fix the axis labels. Capitalizing the first letter should do the trick! While we're add it, let's add an informative title. We can do all of this with the `labs()` function.
```{r}
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count")
```
This is already looking better! The gray background isn't very pleasing, though. To remove this, the `theme()` function will work, but it can be a headache to tweak everything just the way you want it. See [here](https://ggplot2.tidyverse.org/reference/theme.html) for the insane list of options. Fortunately, `ggplot` comes prepackaged with [a set of themes](https://ggplot2.tidyverse.org/reference/ggtheme.html) where most of the decisions are made for you. There are a few to choose from, but I'm going to make an executive decision and pick one of my favorites, `theme_minimal()`.
```{r}
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count") +
theme_minimal()
```
This looks almost publishable! The font sizes could use a little tweaking and I'm sure there are more interesting colors you can think of, but this is pretty good for a few lines of code!
**Exercise 3**
Try making a histogram like this yourself! Pick your favorite numeric variable from the `spotify_filtered` data set and either replicate the above plot or tweak it and have some fun.
-----
## Faceting
`ggplot2` has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset.
There are two types of facet functions:
* `facet_wrap()` arranges a one-dimensional sequence of panels to allow them to cleanly fit on one page.
* `facet_grid()` allows you to form a matrix of rows and columns of panels.
The spotify data set doesn't lend itself well to `facet_grid`, so we're going to stick with `facet_wrap`.
To use `facet_wrap`, you need to supply it with the variable you want the plots to be separated into. The variable needs to be wrapped in the `vars()` function, which tells `facet_wrap` that this is a variable from the data set being used to plot. A complete function looks like this:
`facet_wrap(facets = vars(genre))`.
You're now going to take the pretty histogram you made earlier and split it up according to the genera.
```{r}
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count") +
theme_minimal() +
facet_wrap(facets = vars(genre))
```
Nice! We can see that Rap, Pop, and Soul are a little skewed towards higher Danceability values.
**Exercise 4**
Can you try this with the filtered scatter plot?
Here's the original code:
```{r}
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8)
```
See if you can spruce the plot up a bit first, like we did with the histogram!
*Extension* If we have time, do this same faceted plot, but with `geom_smooth` rather than `geom_point`
## Answers
**Exercise 1**
Doesn't look like there is a strong trend! Maybe slightly positive.
```{r}
ggplot(data = spotify, aes(x = energy, y = tempo)) +
geom_point()
```
**Exercise 2**
Looks like there is still no difference in the relationship across genera!
```{r}
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_smooth()
```
**Exercise 3**
There's really no right answer to this. Have some fun!
**Exercise 4**
Here is one example of a faceted scatterplot:
```{r}
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8) +
facet_wrap(facets = vars(genre))
```
Here are faceted trendlines:
```{r}
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_smooth() +
facet_wrap(facets = vars(genre))
```
## Resources
There are many good data visualization resources to check out! Here are some of my favorites.
[Fundamentals of Data Visualization](https://serialmentor.com/dataviz/)
[R for Data Science](https://r4ds.had.co.nz/data-visualisation.html)
[From data to viz](https://www.data-to-viz.com/)