Intro to Text Mining.Rmd

---
title: "Intro to Text Mining"
author: "Pantea Ferdosian, Kevin Hoffman, Luke Moles, Marissa Shand"
output:
 html_document:
   toc: TRUE
   theme: united
   toc_depth: 3
   number_sections: TRUE
   df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Text Mining Motivation


# Working with text data

"A **token** is a meaningful unit of text, such as wa word, that we are interested in using for analysis, and tokenization is the process of splitting test into tokens" [1]. Can be a word, n-gram, sentence or paragraph.

## Packages we will be using 

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(tidytext)
library(gutenbergr)
```

## Exploring Project Gutenberg

"Project Gutenberg is an online library of free eBooks. Project Gutenberg was the first provider of free electronic books, or eBooks." [2]

```{r}
## Find books by Charles Dickens
dickens <- gutenberg_works(author == 'Dickens, Charles')
dim(dickens)
head(dickens)

## We will be working with A Tale of Two Cities, which has id 98
two_cities <- gutenberg_download(98)
head(two_cities)

## There are three books that make up this book
## Get book number
two_cities <- two.cities %>% mutate(book = cumsum(str_detect(text, regex("^Book the"))),
                                    linenumber = row_number())

## For each book get the linenumber in the book, and the chapter
## Roman numerals: https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html
two_cities <- two_cities %>% group_by(book) %>% 
   mutate(book_linenumber = row_number(),
          chapter = cumsum(str_detect(text, regex("^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})[.]")))) %>%
   ungroup()

## Convert to tidy text
tidy_two_cities <- two_cities %>% unnest_tokens(word, text)
head(tidy_two_cities)
```
## Stop words

```{r}
## Error arose: https://stackoverflow.com/questions/9221310/r-debugging-only-0s-may-be-mixed-with-negative-subscripts
data(stop_words)
stop_words

tidy_two_cities %>% anti_join(stop_words, by = c("word" = "word"))
```


# Sentiment Analysis

## Sentiment Lexicons 

tidytext provides three general purpose lexicons:

1. AFINN

2. Bing

3. NRC


# Relationships between words: n-grams and correlations

# Latent Direchlet Allocation

- Intro to Topic Modeling
   - similar to clustering numeric data
   - topics are clusters of words
   - goal is to discover latent patterns in documents
   - Examples
      - Apply it to a large batch of emails to understand what topics were discussed
      - Presidential speeches to identify themes
      - Collection of tweets from a group to identify what topics people tweet about
   - Latent Dirichlet allocation is an unsupervised method for finding topics in a collection of documents
      - probabilistic generative model
      - every document is a collection of topics
      - every topic is a collection of words
      - Documents are a mixture of topics ie document 1 is 60% of topic 1 and 40% of topic 2
      - Topics are a mixture of words ie 10% apples, 5% oranges, etc (maybe the topic is fruit)
   - Algorithm
      - From the paper:
         - Choose N from a poisson distribution
         - Choose $\Theta$ from a multinomial distribution
         - For each of the N words
            - Choose a topic $z_n$ from the multinomial($\theta$) distribution
            - Choose a word $w_n$ from $p(w_n|z_n,\beta)$, the probability of a word given the document
      - In easier terms maybe:
         - first, select k number of topics
         - randomly assign each word in each document in each topic
         - calculate proportion of words in document assigned to a topic
         - calculate proportion of words assigned to topic across all topics
         - Reassign word to new topic by using gibbes to sample the posterior
         - Repeat sampling for a number of draws
   - LDA Example: Randomly select 20 books from the top 10 authors gutenburg. How many authors are present?
      - authors are topics
      - books are documents
      - Tune k with https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
         - Rather than fitting 10 models (k=1..10), use this package
         - Discuss hyperparemeter tuning
      - Identify and discuss one of the 4 criteria for evaluating each LDA model

## LDA Example
```{r}
library(gutenbergr)
# Set of popular authors
authors.popular <-
   c(
      "Dickens, Charles",
      "Austen, Jane",
      "Shelley, Mary Wollstonecraft",
      "Twain, Mark",
      "Doyle, Arthur Conan",
      "Wilde, Oscar",
      "Leech, John",
      "Hawthorne, Nathaniel",
      "Stevenson, Robert Louis",
      "Carroll, Lewis"
   )

# Download all books by these authors
books.authors.popular <- gutenberg_metadata %>%
   filter(
      author %in% authors.popular,
      language == "en", # Only english
      !str_detect(title, "Works"),  # Ignore collections of works
      has_text,
      !str_detect(rights, "Copyright")
   ) %>%
   distinct(title, .keep_all = TRUE) %>%
   select(gutenberg_id, title)

# Select a random sample of 20
books.selection <- books.metadata %>% sample_n(20)

# Download the 20 books
books.list <-
   books.selection$gutenberg_id %>% gutenberg_download(meta_fields = "title")

# Clean the text by removing blank rows
books.text <- books.list %>%
   filter(text != '') %>% # Remove blank lines
   select(-gutenberg_id) %>% # Drop the id
   group_by(title)  %>%
   unite(document, title)

words.by.book <- books.text %>%
   unnest_tokens(word, text)

# Generate word counts for each word in our documents
word.counts <- words.by.book %>%
   anti_join(stop_words) %>%
   count(document, word, sort = TRUE) %>%
   ungroup()

# Create a Document Term Matrix
books_dtm <- word.counts %>%
   cast_dtm(document, word, n)

# Try a model with 5 authors
books.lda <- LDA(books_dtm, k = 5, control = list(seed = 555))

# TODO:
# - Show some visualizations of words and topics and the associated probabilities
# - Tune k with ldatuning package
# - Show results and evaluate if we chose the correct k

```


# References

1. [Text Mining with R](https://www.tidytextmining.com/index.html)

2. [Project Gutenberg](https://www.gutenberg.org/)

3. [Roman Numerals with Regex](https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html)

4. [Sentiment Datasets](https://www.datacamp.com/community/tutorials/sentiment-analysis-R)

5. [Latent Dirichlet Allocation](http://www.cse.cuhk.edu.hk/irwin.king/_media/presentations/latent_dirichlet_allocation.pdf)