This repository has been archived by the owner on Sep 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 370
/
Copy path111Scraping_Workthrough.Rmd
411 lines (281 loc) · 12.5 KB
/
111Scraping_Workthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
---
output:
html_document:
toc: true
toc_depth: 4
---
![](https://i.ytimg.com/vi/8_5TCqHVEW8/maxresdefault.jpg)
We are starting in this lecture, and end in the next one.
The goal is to build a little collection of songs from our own preferred artist. Let's say, it's _Straight Line Stitch_ (they are great!). A little kicker for the [morning](https://www.youtube.com/watch?v=4_5VAKdHMek).
The **highly suggested** browser (or, at least, the one that I'll be using) is [Firefox](https://www.mozilla.org/en-US/firefox/developer/), the developer edition.
## Packages
> Don't be afraid of the dark you're still held up by the stars
We are going to use a bunch of the usual packages:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(magrittr)
library(purrr)
library(glue)
library(stringr)
```
and introduce a new one:
```{r message=FALSE, warning=FALSE}
library(rvest)
library(xml2)
```
which is meant explicitly to scrape stuff from a webpage. We are going to use a couple more in the bonus section, if we get there.
## The lyrics
We are going to extract the lyrics from here: https://www.musixmatch.com/ . Chose it because it's rather consistent, and it's from Bologna, Italy (yeah!).
The webiste offers the first 15 lyrics up front. That will do for the moment (and fixing that is not that easy). Let's take a look [here](https://www.musixmatch.com/artist/Straight-Line-Stitch#).
## Titles
First thing first, we would like to get a list of those title. Let's see how.
```{r}
url_titles <- "https://www.musixmatch.com/artist/Straight-Line-Stitch#"
page_title <- read_html(url_titles)
```
Now, what is this `page_title` object?
let's see:
```{r}
page_title
```
OK. It's a document. Thanks. And it's an XML document. That's sort of html. We'll handle it with `xml2` and `rvest`. Let's see a bit more of that page.
```{r}
page_title %>% html_structure()
```
Wait, whaaaaaat?
![](https://media.giphy.com/media/ZkEXisGbMawMg/giphy.gif)
To the browser! Look at that "class" tags: they are _css selectors_, and we will use them as handles to navigate into the extremely complex list that we get from a web page.
Sometimes, we can be lucky. For example, the css selector for the titles are in the class ".title". Let's see.
```{r}
page_title %>%
html_nodes(".title")
```
That's still quite a mess: we have too much stuff, such as some links (called "href") and more text than we need. Let's clean it up with `html_text()`
```{r}
page_title %>%
html_nodes(".title") %>%
html_text()
```
Wundebar! Now we have 15 song titles. But we want the lyrics! Let's do better.
```{r}
SLS_df <- data_frame(Band = "Straight Line Stitch",
Title = page_title %>%
html_nodes(".title") %>%
html_text())
```
Now we are going to use a bit of string magic
```{r}
SLS_lyrics <- SLS_df %>% mutate(Link = glue('https://www.musixmatch.com/lyrics/{Band}/{Title}') %>%
str_replace_all(" ","-"))
```
It seems it works.
There is a better trick to do this job. If we look again at what we get when we select the `.title` you may see that the _actual_ link is there, coded as `href`. Can we extract that? Yes we can!
```{r}
page_title %>%
html_nodes(".title") %>%
html_attrs() %>%
glimpse()
```
In particular, we want the element called `href`. Hey, we can get that with `map`!
```{r}
page_title %>%
html_nodes(".title") %>%
html_attrs() %>%
map_chr("href")
```
Or, even better, by letting `rves` do the job for us:
```{r}
page_title %>%
html_nodes(".title") %>%
html_attr("href")
```
```{r}
SLS_df %<>%
mutate(Link = page_title %>%
html_nodes(".title") %>%
html_attr("href"))
```
Cool, we don't gain much in terms of line of code, but it will be usefull later!
## And `purrr`!
Cool, now we want to put grab all lyrics. Let's start with one at a time. What is the url we want?
```{r}
url_song <- glue("https://www.musixmatch.com{SLS_df$Link[1]}")
url_song
```
And let's grab the lyrics for that song. The content is marked by a css selector called "p.mxm-lyrics__content". That stands for "p", an object of class paragraph, plus "mxm-lyrics__content", the specific class for the lyrics.
```{r}
url_song %>%
read_html() %>%
html_nodes(".mxm-lyrics__content") %>%
html_text()
```
Ach, notice that it comes in different blocks: one for each section of text, broken by the advertisment. Well, we can just `collapse()` them together with `glue`. As we are doing this, let's turn that flow into a function:
```{r}
get_lyrics <- function(link){
lyrics_chunks <- glue("https://www.musixmatch.com{link}#") %>%
read_html() %>%
html_nodes(".mxm-lyrics__content")
# we do a sanity check to see that there's something inside the lyrics!
stopifnot(length(lyrics_chunks) > 0)
lyrics <- lyrics_chunks %>%
html_text() %>%
collapse(sep = "\n")
return(lyrics)
}
```
Let's test it!
```{r}
SLS_df$Link[3] %>%
get_lyrics() %>%
glue() # we paste into glue to get the nice formatting
```
Now we can use purrr to map that function over our dataframe!
```{r}
SLS_df %<>%
mutate(Lyrics = map_chr(Link, get_lyrics))
```
Ok, here we were quite lucky, as all the links were right. In general we may want to play safe, and use a `possibly` wrapper so not to have to stop everything in case something bad happens.
## The flow
**Explore, try, test, automatize, test.**
Scraping data from the web will require a lot of trial and error. In general, I like this flow: I explore the pages that I want to scrape, trying to identify patterns that I can exploit. Then I try, on a smaller subset, and I test if it worked. Then I automatize it, using `purrr` or something similar. And finally some more testing.
## Another Artist
Let's do this for Angel Haze. Notice that here we **have** to use the attributes from the web page, as the name of the authors of the lyrics is not always the same (the `glue` approach would fail).
```{r}
AH_url <- "https://www.musixmatch.com/artist/Angel-Haze"
AH_lyrics <- data_frame(Band = "Angel Haze",
Title = AH_url %>%
read_html() %>%
html_nodes(css = ".title") %>%
html_text(),
Link = AH_url %>%
read_html() %>%
html_nodes(css = ".title") %>%
html_attr("href"),
Lyrics = map_chr(Link,get_lyrics))
```
### Bonus: sentiment analysis
The idea is to attribute to each word a score, expressing wether it's more negative and positive, and then to sum up. To do this, we are going to use Julia Silge's and David Robinson's great [_Tidytext_](https://github.com/juliasilge/tidytext) library and a _vocabulary_ of words for which we have the scores (there are different options, we are using "afinn").
```{r}
library(tidytext)
afinn <- get_sentiments("afinn")
```
Now, a bit of massaging: we breaks the lyrics into their words, remove the words that are considered not interesting (they are called "stop words"), stitch the dataframe to the scoress from afinn, and do the math for each song.
```{r}
SLS_df %>%
unnest_tokens(word, Lyrics) %>% #split words
anti_join(stop_words, by = "word") %>% #remove dull words
inner_join(afinn, by = "word") %>% #stitch scores
group_by(Title) %>% #and for each song
summarise(Length = n(), #do the math
Score = sum(score)/Length) %>%
arrange(-Score)
```
So, what was the most positive song?
```{r}
SLS_df %>%
filter(Title == "Promise Me") %$%
Lyrics %>%
glue()
```
And we can easily do the same with Angela Haze:
```{r}
AH_lyrics %>%
unnest_tokens(word, Lyrics) %>% #split words
anti_join(stop_words, by = "word") %>% #remove dull words
inner_join(afinn, by = "word") %>% #stitch scores
group_by(Title) %>% #and for each song
summarise(Length = n(), #do the math
Score = sum(score)/Length) %>%
arrange(-Score)
```
More resources about Sentiment Analysis (with Tidytext) are available [here](http://varianceexplained.org/r/yelp-sentiment/) and [here](http://www.jakubglinka.com/2017-03-01-text_mining_part1/).
## What about the rest?
We want to do it also for other artists. Best things is to turn some of those scripts into functions. Let's try with _Billie Holiday_ and _A Tribe Called Red_ (I picked them 'cause they are great, and also because they will show some limitations of the code I'm interested to tackle).
When we are about to do something over over, it's better to write functions. So, let's do it!
```{r}
get_words <- function(band_name){
# remove white space from band name
collapsed_name <- str_replace_all(band_name, " ", "-")
# define url to get the title and links
url <- glue("https://www.musixmatch.com/artist/{collapsed_name}")
# read title page and extract the title chunks
title_page <- url %>%
read_html() %>%
html_nodes(css = ".title")
# and build the dataframe
lyrics <- data_frame(Band = band_name,
# extract text title
Title = title_page %>%
html_text(),
# extract title link
Link = title_page %>%
html_attr("href"),
# map to get lyrics
Lyrics = map_chr(Link,get_lyrics))
return(lyrics)
}
```
And the sentiment analysis:
```{r}
get_soul <- function(Lyrics_df) {
Lyrics_df %>%
unnest_tokens(word, Lyrics) %>% #split words
#anti_join(stop_words, by = "word") %>% #remove dull words
inner_join(afinn, by = "word") %>% #stitch scores
group_by(Title) %>% #and for each song
summarise(Length = n(), #do the math
Score = sum(score)/Length) %>%
arrange(-Score) %>%
return()
}
```
Let's see if it works:
```{r}
Billie_words <- "Billie Holiday" %>% get_words()
Billie_sentiment <- Billie_words %>% get_soul()
```
Most positive song:
<iframe width="560" height="315" src="https://www.youtube.com/embed/HxG6K59FUGQ" frameborder="0" allowfullscreen></iframe>
and most negative
<iframe width="560" height="315" src="https://www.youtube.com/embed/EIgVCU19pjg" frameborder="0" allowfullscreen></iframe>
It works for me! Let's finally with _A Tribe Called Red_.
```{r, eval=FALSE}
ATCR_words <- "A Tribe Called Red" %>% get_words()
```
Uuuh, we get an error now! What the problem? Where, let's try to see. Some of the songs do not have lyrics, yet! So, when we try to scrape this [page](https://www.musixmatch.com/lyrics/A-Tribe-Called-Red-ft-Black-Bear/Stadium-Pow-Wow) we get an error, as the text is not there.
This is a rather common situation when scraping, as often what we are looking for is not there. Thus, we need a more safe approach. We can either write ad hoc `if ...else ...` statements, to control for the presence/absence of things, or (and it is better to do it anyhow) wrap our function into `purrr::possibly()` construct. We can do it by modifying just slightly our workflow. Notice that now we use `get_lyrics_safe()` inside the mapping instead of `get_lyrics()`.
```{r}
get_lyrics_safe <- purrr::possibly(get_lyrics,NA_character_)
get_words <- function(band_name){
# remove white space from band name
collapsed_name <- str_replace_all(band_name, " ", "-")
# define url to get the title and links
url <- glue("https://www.musixmatch.com/artist/{collapsed_name}")
# read title page and extract the title chunks
title_page <- url %>%
read_html() %>%
html_nodes(css = ".title")
# and build the dataframe
lyrics <- data_frame(Band = band_name,
# extract text title
Title = title_page %>%
html_text(),
# extract title link
Link = title_page %>%
html_attr("href"),
# map to get lyrics
Lyrics = map_chr(Link,get_lyrics_safe))
return(lyrics)
}
```
Let's try again:
```{r}
ATCR_words <- "A Tribe Called Red" %>% get_words()
```
Much better.
```{r}
ATCR_sentiment <- ATCR_words %>% get_soul()
```
### Challenge
Another singer you should, should, should listen to is _Militia Vox_. Try to replicate our work with her lyrics. What's the problem? (If you think you get the answer, please discuss with me :-) )
**note**: this workthrough is loosely inspired by Max Humber's [post](https://www.r-bloggers.com/fantasy-hockey-with-rvest-and-purrr/) and David Laing's post [here](https://laingdk.github.io/kendrick-lamar-data-science/). Great things are from them, errors are mine.