forked from Marissa-Shand/Intro-to-Text-Mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIntro to Text Mining.Rmd
205 lines (159 loc) · 6.43 KB
/
Intro to Text Mining.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "Intro to Text Mining"
author: "Pantea Ferdosian, Kevin Hoffman, Luke Moles, Marissa Shand"
output:
html_document:
toc: TRUE
theme: united
toc_depth: 3
number_sections: TRUE
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Text Mining Motivation
# Working with text data
"A **token** is a meaningful unit of text, such as wa word, that we are interested in using for analysis, and tokenization is the process of splitting test into tokens" [1]. Can be a word, n-gram, sentence or paragraph.
## Packages we will be using
```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(tidytext)
library(gutenbergr)
```
## Exploring Project Gutenberg
"Project Gutenberg is an online library of free eBooks. Project Gutenberg was the first provider of free electronic books, or eBooks." [2]
```{r}
## Find books by Charles Dickens
dickens <- gutenberg_works(author == 'Dickens, Charles')
dim(dickens)
head(dickens)
## We will be working with A Tale of Two Cities, which has id 98
two_cities <- gutenberg_download(98)
head(two_cities)
## There are three books that make up this book
## Get book number
two_cities <- two.cities %>% mutate(book = cumsum(str_detect(text, regex("^Book the"))),
linenumber = row_number())
## For each book get the linenumber in the book, and the chapter
## Roman numerals: https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html
two_cities <- two_cities %>% group_by(book) %>%
mutate(book_linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})[.]")))) %>%
ungroup()
## Convert to tidy text
tidy_two_cities <- two_cities %>% unnest_tokens(word, text)
head(tidy_two_cities)
```
## Stop words
```{r}
## Error arose: https://stackoverflow.com/questions/9221310/r-debugging-only-0s-may-be-mixed-with-negative-subscripts
data(stop_words)
stop_words
tidy_two_cities %>% anti_join(stop_words, by = c("word" = "word"))
```
# Sentiment Analysis
## Sentiment Lexicons
tidytext provides three general purpose lexicons:
1. AFINN
2. Bing
3. NRC
# Relationships between words: n-grams and correlations
# Latent Direchlet Allocation
- Intro to Topic Modeling
- similar to clustering numeric data
- topics are clusters of words
- goal is to discover latent patterns in documents
- Examples
- Apply it to a large batch of emails to understand what topics were discussed
- Presidential speeches to identify themes
- Collection of tweets from a group to identify what topics people tweet about
- Latent Dirichlet allocation is an unsupervised method for finding topics in a collection of documents
- probabilistic generative model
- every document is a collection of topics
- every topic is a collection of words
- Documents are a mixture of topics ie document 1 is 60% of topic 1 and 40% of topic 2
- Topics are a mixture of words ie 10% apples, 5% oranges, etc (maybe the topic is fruit)
- Algorithm
- From the paper:
- Choose N from a poisson distribution
- Choose $\Theta$ from a multinomial distribution
- For each of the N words
- Choose a topic $z_n$ from the multinomial($\theta$) distribution
- Choose a word $w_n$ from $p(w_n|z_n,\beta)$, the probability of a word given the document
- In easier terms maybe:
- first, select k number of topics
- randomly assign each word in each document in each topic
- calculate proportion of words in document assigned to a topic
- calculate proportion of words assigned to topic across all topics
- Reassign word to new topic by using gibbes to sample the posterior
- Repeat sampling for a number of draws
- LDA Example: Randomly select 20 books from the top 10 authors gutenburg. How many authors are present?
- authors are topics
- books are documents
- Tune k with https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
- Rather than fitting 10 models (k=1..10), use this package
- Discuss hyperparemeter tuning
- Identify and discuss one of the 4 criteria for evaluating each LDA model
## LDA Example
```{r}
library(gutenbergr)
# Set of popular authors
authors.popular <-
c(
"Dickens, Charles",
"Austen, Jane",
"Shelley, Mary Wollstonecraft",
"Twain, Mark",
"Doyle, Arthur Conan",
"Wilde, Oscar",
"Leech, John",
"Hawthorne, Nathaniel",
"Stevenson, Robert Louis",
"Carroll, Lewis"
)
# Download all books by these authors
books.authors.popular <- gutenberg_metadata %>%
filter(
author %in% authors.popular,
language == "en", # Only english
!str_detect(title, "Works"), # Ignore collections of works
has_text,
!str_detect(rights, "Copyright")
) %>%
distinct(title, .keep_all = TRUE) %>%
select(gutenberg_id, title)
# Select a random sample of 20
books.selection <- books.metadata %>% sample_n(20)
# Download the 20 books
books.list <-
books.selection$gutenberg_id %>% gutenberg_download(meta_fields = "title")
# Clean the text by removing blank rows
books.text <- books.list %>%
filter(text != '') %>% # Remove blank lines
select(-gutenberg_id) %>% # Drop the id
group_by(title) %>%
unite(document, title)
words.by.book <- books.text %>%
unnest_tokens(word, text)
# Generate word counts for each word in our documents
word.counts <- words.by.book %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
# Create a Document Term Matrix
books_dtm <- word.counts %>%
cast_dtm(document, word, n)
# Try a model with 5 authors
books.lda <- LDA(books_dtm, k = 5, control = list(seed = 555))
# TODO:
# - Show some visualizations of words and topics and the associated probabilities
# - Tune k with ldatuning package
# - Show results and evaluate if we chose the correct k
```
# References
1. [Text Mining with R](https://www.tidytextmining.com/index.html)
2. [Project Gutenberg](https://www.gutenberg.org/)
3. [Roman Numerals with Regex](https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html)
4. [Sentiment Datasets](https://www.datacamp.com/community/tutorials/sentiment-analysis-R)
5. [Latent Dirichlet Allocation](http://www.cse.cuhk.edu.hk/irwin.king/_media/presentations/latent_dirichlet_allocation.pdf)