-
Notifications
You must be signed in to change notification settings - Fork 0
/
preprocess.Rmd
83 lines (62 loc) · 1.46 KB
/
preprocess.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: "Preprocess",
author: "Luuc van der Zee",
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##"
)
```
```{r, echo = FALSE}
library(DBI)
library(RSQLite)
library(tidyverse)
```
First, load the data from the database:
```{r}
db_path <- "../js/genius/db/data.db"
con <- dbConnect(RSQLite::SQLite(), db_path)
table_name <- "songs"
songs <- dbReadTable(con, table_name)
dbDisconnect(con)
songs_tbl <- as_tibble(songs)
```
## Removing missing lyrics
Turns out there are also a few lyrics missing. We will remove those too:
```{r}
songs_tbl <- songs_tbl %>%
filter(!is.na(lyrics))
```
## Removing short lyrics
Some songs (mainly skits) have very short lyrics. We will remove those too:
```{r}
songs_tbl <- songs_tbl %>%
mutate(length = nchar(lyrics)) %>%
filter(length > 133) %>%
select(id, artist, url, lyrics, date) %>%
rename(year = date)
```
## Removing duplicate songs
```{r}
songs_tbl <- songs_tbl %>% distinct(url, .keep_all = TRUE)
```
## Remove French and German artists
```{r}
foreign_language_artists <- c(
"Casper",
"Manau"
)
songs_tbl <- songs_tbl %>%
filter(!(artist %in% foreign_language_artists))
```
## For some reason, one song seems to be form the year 1901. Fix this
```{r}
songs_tbl <- songs_tbl %>% mutate(year = replace(year, year == 1901, 2010))
```
## Output
Finally, we will save the resulting cleaned data to csv:
```{r}
write_csv(songs_tbl, "./songs.csv")
```