forked from ntaback/dhsi-ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAusten NGrams.Rmd
60 lines (40 loc) · 1.18 KB
/
Austen NGrams.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
title: "R Notebook"
output: html_notebook
---
Using N-Grams:
```{r}
library(dplyr)
library(janeaustenr)
library(tidyr)
library(tidytext)
library(igraph)
austen_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
austen_quadgrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 4)
austen_bigrams %>% count(bigram, sort = TRUE)
austen_quadgrams %>% count(bigram, sort = TRUE)
```
These are the top bi-grams in Jane Austen's writing.
```{r}
bigrams_separated <- austen_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>% count(word1, word2, sort = TRUE)
bigram_counts
```
Trying something else:
```{r}
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_tf_idf <- bigrams_united %>%
count(book, bigram) %>%
bind_tf_idf(bigram, book, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
```
Now let's create a graph of our results:
```{r}
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame()
bigram_graph
```