forked from ntaback/dhsi-ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
AuthorAttribution.Rmd
115 lines (76 loc) · 4.43 KB
/
AuthorAttribution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "Austhor Attribution"
output: html_notebook
---
Now we're going to try some supervised learning with authorship attribution. We'll load the plays of Ben Jonson, Christopher Marlowe, and a few of Shakespeare's plays into data frames. From there, we'll fix up the data a bit and 'train' a machine learning algorithm on those plays and use what it knows to try and guess who wrote another play (Henry VI)
First, let's load, from project Gutenberg, all of the plays that we'll need. Thankfully, there's a guntenberg library for R:
```{r}
library(gutenbergr)
```
The Project Gutenberg library is a quick and easy way to download books directly from Gutenberg in a way that is easy for R to work with and manipulate.
We can, for instance, very quickly find all of the works of a particular author:
```{r}
library(stringr)
gutenberg_works(str_detect(author, "Carroll"))
```
But there may be more than one author in the Gutenberg library with the last name Carroll. So let's be more specific:
```{r}
gutenberg_works(str_detect(author, "Carroll, Lewis"))
```
Now we've narrowed down these works to only Lewis Carroll.
Let's say we can't remember Jonson's first name. In that case, we can search through all of the Gutenberg works to find anyone with the name Jonson:
```{r}
gutenberg_works(str_detect(author, 'Marlowe'))
gutenberg_works(str_detect(author, 'Jonson'))
gutenberg_works(str_detect(author, 'Shakespeare'))
```
OK, but our first table has returned any book by anyone named Jonson. Let's be more let's be more specific and look up the authors we actually want:
```{r}
gutenberg_metadata %>% filter(author == "Marlowe, Christopher")
gutenberg_metadata %>% filter(author == "Jonson, Ben")
gutenberg_metadata %>% filter(author == "Shakespeare, William")
```
Now we'd like to download all of those books. We use the gutenberg_id to tell gutenberg(download) what books we'd like to download.
Note that we could have passed the gutenberg_ids from our gutenberg_metadata queries to just automatically start downloading all of the plays that came out of our search. This isn't actually what we want though because Gutenberg has multiple copies of some plays. Also, we don't actually want ALL of each writer's plays -- just a selection for our experiment.
For those reasons, I've hard coded the gutenberg_ID numbers into the code below:
If we wanted to include the title of each work in our plays data we would include 'meta-data = "table" as a separate argument after the gutenberg_IDs.
```{r}
Jonson_plays <- gutenberg_download(c(3694, 3695, 3771, 4011, 4039, 4081, 5134, 5166, 5232, 5333,
49461, 50150))
Marlowe_plays <- gutenberg_download (c(811, 901, 1094, 1589, 16169, 18781, 20288))
#For Shakespeare, we'll just download a selection of his plays, leaving a few key plays excluded
#Exclude: Henry the 6th,
Shakes_plays <- gutenberg_download(c(1041, 1103, 1104, 1106, 1107, 1108, 1109, 1111, 1112, 1113, 1114, 1115, 1116))
class(Jonson_plays)
Jonson_plays
```
Now lets turn this dataframe into texts we can work with. First we'll use the unnest_tokens function to 'tokenize' each collection of plays. We're also including a count for each token for how often it appears in the author's corpus:
***Paul: This data still needs to be cleaned up. There are a number of terms that have leading or trailing underscores. How do we remove these underscores?
```{r}
library(tidytext)
Jonson_words <- Jonson_plays %>% unnest_tokens(word, text) %>% count(word)
Shakes_words <- Shakes_plays %>% unnest_tokens(word, text) %>% count(word)
Marlowe_words <- Marlowe_plays %>% unnest_tokens(word, text) %>% count (word)
```
Now let's add the appropriate author column to our data frames and remove our stop words:
```{r}
#We add a column to our data frame to get a particular author
Jonson_words$author <- "Jonson"
Shakes_words$author <- "Shakespeare"
Marlowe_words$author <- "Marlowe"
data("stop_words")
Jonson_words <- Jonson_words %>% anti_join(stop_words)
Shakes_words <- Shakes_words %>% anti_join(stop_words)
Marlowe_words <- Marlowe_words %>% anti_join(stop_words)
```
Now we need to get our machine learning packages installed. Lets load Caret, lattice, and ggplot2 (remember you might need to install the package first):
This is where I'll start to need Nathan's help.
```{r}
library(caret)
library(ggplot2)
library(lattice)
control <- trainControl(method="cv", number = 10)
metric <- "Accuracy"
set.seed(7)
#fit.lda <- train(Shakes_words~., data = )
```