MilestoneReport.Rmd

---
title: "Milestone Report"
author: "Edilmo Palencia"
date: "March 20, 2016"
output: html_document
---

## Theoric Background ##

Natural language in general has been a hard topic to abord. It's attack for several 
branch of science like psycology, phylosophy, mathmeatics, biology and so on.
In order to avoid confusions when a language problem is going to be addressed, 
it's important differentiate the several kind of problems related to language and 
identify to which one the specific problem belongs to.
* Communication problems: refers to how the symbols are transmitted or should be 
transmitted.
* Semantic problems: refers to what is the meaning of the symbols.
* Effectiveness problems: refers to how the meaning of symbols produce the desired 
effect.
The final project of the capstone course is create a natural language model that 
predicts the next word given a previous text and nothing else. This limitation is 
important because puts the problem at the level of communication, which means that 
the theory on Markov Process is the right tool to abord the problem and not context 
variables could be considered.

## Data Summary ##
```{r, echo=FALSE}
knitr::opts_chunk$set(error = TRUE)
```

```{r corpusAccess, echo=FALSE, cache=TRUE, message=FALSE}
if(file.exists("Corpus10.RData")){
    load("Corpus10.RData")
}else{
    # Create a empty dataFrame where the corpus description is going to be stored
    # One row per document
    corpusDescription <- data.frame(
        row.names = c("src", "lan", "src-type", "f-name", "f-path"), 
        stringsAsFactors = FALSE)
    # Check if the corpus has been downloaded
    if(!file.exists("corpus")){
        download.file(
                "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                "corpus.zip", "curl")
        unzip("corpus.zip")
        dir.create("corpus")
        file.rename("final","corpus/HC Corpora")
        file.remove("corpus.zip")
    }
    # There is a directory per Source, so build a list of sources. 
    corpusSources <- list.dirs(path = "corpus", full.names = FALSE, 
                   recursive = FALSE)
    # Loop over the sources
    for(s in corpusSources){
        # There is a directory per Language, so build a list of 
        # languages for this source. 
        corpusLanguages <- list.dirs(path = filePath("corpus",s), 
                         full.names = FALSE, recursive = FALSE)
        for(l in corpusLanguages){
            # There is a corpus document per source type, so build a list 
            # of documents with its source type
            filelist <- list.files(filePath("corpus",s,l), 
                           full.names = FALSE, recursive = FALSE)
            for(f in filelist){
                # Let's extract the source type from the file name
                st <- strsplit(f,".",fixed = TRUE)
                # Let's build the full path of the file
                fname <- filePath("corpus",s,l,f)
                # Create the row to add in the data frame
                r <- c(s,l,st[[1]][2],f,fname)
                # Add the new row as column
                corpusDescription <- cbind(corpusDescription,r)
            }
        }
    }
    # Let's transpose the data frame because all the rows were added as columns
    corpusDescription <- t(corpusDescription)
}
```

```{r corpusLoading, echo=FALSE, cache=TRUE, message=FALSE}
library(tm)
if(!exists("readCorpus")){
    # Let's load the corpus using the tm package
    readCorpus <- function(src,lan){
        Corpus(DirSource(directory = filePath("corpus",src,lan),
                        encoding = "",
                        pattern = NULL,
                        recursive = FALSE,
                        ignore.case = FALSE,
                        mode = "text"),
                    readerControl = list(reader = readPlain,
                             language = lan,
                             load = FALSE))
    }
    corpusHCcorporaEnUS <- readCorpus("HC Corpora","en_US")
}
```

```{r corpusQuestions, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("charCountPerLineEnUSblogs")){
    # Let's count the amount of characters per line in each document of the english
    # corpus
    charCountPerLineEnUSblogs <- sapply(
        corpusHCcorporaEnUS[["en_US.blogs.txt"]]$content,nchar)
    charCountPerLineEnUSnews <- sapply(
        corpusHCcorporaEnUS[["en_US.news.txt"]]$content,nchar)
    charCountPerLineEnUStwitter <- sapply(
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content,nchar)
    
    # Let's compute the lenght of the longest line in each document
    longestLineLenghtEnUSblogs <- max(charCountPerLineEnUSblogs)
    longestLineLenghtEnUSnews <- max(charCountPerLineEnUSnews)
    longestLineLenghtEnUStwitter <- max(charCountPerLineEnUStwitter)
    
}
if(!exists("lineCountEnUStwitter")){
    # Let's compute the amount of characters in the blogs document
    charCountEnUSblogs <- sum(charCountPerLineEnUSblogs)
    # Let's compute the amount of lines in the blogs document
    lineCountEnUSblogs <- length(corpusHCcorporaEnUS[["en_US.blogs.txt"]]$content)
    # Let's compute the amount of characters in the news document
    charCountEnUSnews <- sum(charCountPerLineEnUSnews)
    # Let's compute the amount of lines in the news document
    lineCountEnUSnews <- length(corpusHCcorporaEnUS[["en_US.news.txt"]]$content)
    # Let's compute the amount of characters in the twitter document
    charCountEnUStwitter <- sum(charCountPerLineEnUStwitter)
    # Let's compute the amount of lines in the twitter document
    lineCountEnUStwitter <- length(corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content)
}   
```

```{r simpleTokenization, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("corpusHCcorporaEnUS.OnlyLetters")){
    # Removing the numbers from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS, removeNumbers, 
              lazy = FALSE)
    # Removing the puntuations from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          removePunctuation, lazy = FALSE)
    # Removing the extra white spaces from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          stripWhitespace, lazy = FALSE)
    # Removing the white spaces at the begging and the end
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          content_transformer(trim))
    # Dictionary with the vocabulary and the index of each word
    # Use a hashed environment that contains one object per word
    # The object name is the word itself and the value is the index
    # of the word. The index is used for the tokenization
    vocabularyHCcorporaEnUSbyWord <- new.env(hash = TRUE)
    # Dictionary vector where the index points to the corresponding word
    vocabularyHCcorporaEnUSbyIndex <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the blogs document of the english corpus
    wordFreqHCcorporaEnUSblogs <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the blogs document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUSblogs <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the news document of the english corpus
    wordFreqHCcorporaEnUSnews <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the news document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUSnews <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the twitter document of the english corpus
    wordFreqHCcorporaEnUStwitter <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the twitter document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUStwitter <- NULL
    # temporal variable used to store word-frequency dictionaries
    temp2 <- NULL
    # Function used to tokenize a line o a document
    procDocLine <- function(docLine){
        # result variable to store the tokenize version of the line
        tokenizeLine <- numeric(length = length(docLine))
        # index variable of the word to tokenize
        i <- 1
        for(w in docLine){
            # get the index of the word in our vocabulary dictionary
            wIndex <- vocabularyHCcorporaEnUSbyWord[[w]]  
            # check if the word exist in our vocabulary
            if(is.null(wIndex)){
                # the index for a new word it's just the lenght of the 
                # vocabulary
                wIndex <- length(vocabularyHCcorporaEnUSbyIndex) + 1
                # add the word to the vocabulary dictionary by word
                vocabularyHCcorporaEnUSbyWord[[w]] <<- wIndex
                # add the word to the vocabulary dictionary by index
                vocabularyHCcorporaEnUSbyIndex[wIndex] <<- w
                # initialize the frequency for the word to 0
                temp2[wIndex] <<- 0
            }
            # increase the frequency of the word
            temp2[wIndex] <<- temp2[wIndex] + 1
            # append the index of the word the tokenized line
            tokenizeLine[i] <- wIndex
            # increase the index of the word to process
            i <- i + 1
        }
        # return the tokenized line
        tokenizeLine
    }
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.blogs.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUSblogs <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUSblogs <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.news.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUSnews <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUSnews <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUStwitter <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUStwitter <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
}
```

```{r profanitySourcing, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("badWords")){
    # Load a bad word list
    temp <- url("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
    badWords <- readLines(temp)
    # Convert the frame to vector
    badWords <- tolower(badWords)
    badWords <- trim(badWords)
    badWords <- badWords[which(badWords!="")]
    close(temp)
    # Load a swear word list
    temp <- url("http://www.bannedwordlist.com/lists/swearWords.txt")
    swearWords <- readLines(temp)
    # Convert the frame to vector
    swearWords <- tolower(swearWords)
    swearWords <- trim(swearWords)
    swearWords <- swearWords[which(swearWords!="")]
    close(temp)
    # Merge bad and swear words
    temp <- swearWords %in% badWords
    bad_swearWords <- c(badWords, swearWords[which(!temp)])
}
```

In order to create the model, a dataset was provided. The focus of this report is 
the english part of the corpus can be sumarized as follow:

```{r dataSummary, echo=FALSE, cache=TRUE, message=FALSE}
library(knitr)
library(R.utils)
corpusEnglishSummary <- data.frame(stringsAsFactors = FALSE)
corpusEnglishSummary <- rbind(corpusEnglishSummary, t(c("Twitter", lineCountEnUStwitter, charCountEnUStwitter, longestLineLenghtEnUStwitter)))
corpusEnglishSummary <- rbind(corpusEnglishSummary, t(c("News", lineCountEnUSnews, charCountEnUSnews, longestLineLenghtEnUSnews)))
corpusEnglishSummary <- rbind(corpusEnglishSummary, t(c("Blogs", lineCountEnUSblogs, charCountEnUSblogs, longestLineLenghtEnUSblogs)))
colnames(corpusEnglishSummary) <- c("File Source", "Line Count", "Char Count", "Longest Line (in chars)")
# Print the data frame
kable(corpusEnglishSummary)
```

## Vocabulary & profanities  ##
Two list of bad words has been found in internet, one of bad words and other of 
swear words.  

```{r profanityTally, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("profanyDFEnUStwitter")){
    # Logical vector with the index of the vocabulary words present in the
    # swear and bad word list
    bad_swearWordsInVocabulary <- vocabularyHCcorporaEnUSbyIndex %in% bad_swearWords
    # Frequency of the bad-swear words
    profanyFreqEnUSblogs <- 
        wordFreqHCcorporaEnUSblogs[which(bad_swearWordsInVocabulary)]
    profanyFreqEnUSnews <- 
        wordFreqHCcorporaEnUSnews[which(bad_swearWordsInVocabulary)]
    profanyFreqEnUStwitter <- 
        wordFreqHCcorporaEnUStwitter[which(bad_swearWordsInVocabulary)]
    # bad-swear words
    profanyEnUS <- 
        vocabularyHCcorporaEnUSbyIndex[which(bad_swearWordsInVocabulary)]
    # Data frame of Frequency of the bad-swear words
    profanyDFEnUSblogs <- data.frame(profanyEnUS,profanyFreqEnUSblogs)
    profanyDFEnUSblogs <- profanyDFEnUSblogs[order(profanyFreqEnUSblogs,profanyEnUS, 
                     decreasing = TRUE),]
    profanyDFEnUSnews <- data.frame(profanyEnUS,profanyFreqEnUSnews)
    profanyDFEnUSnews <- profanyDFEnUSnews[order(profanyFreqEnUSnews,profanyEnUS, 
                     decreasing = TRUE),]
    profanyDFEnUStwitter <- data.frame(profanyEnUS,profanyFreqEnUStwitter)
    profanyDFEnUStwitter <- profanyDFEnUStwitter[order(profanyFreqEnUStwitter,profanyEnUS, 
                     decreasing = TRUE),]
    ammountOfBad_SwearWordsInVocabularyEnUS <- length(profanyEnUS)
}
amountRows <- 10
consolidatedProfanyEnUS <- cbind(profanyDFEnUStwitter[1:amountRows,], 
                                 profanyDFEnUSnews[1:amountRows,], 
                                 profanyDFEnUSblogs[1:amountRows,])
colnames(consolidatedProfanyEnUS) <- c("Twitter: word", "freq", 
                                       "News: word", "freq", 
                                       "Blogs: word", "freq")
rownames(consolidatedProfanyEnUS) <- c()
kable(consolidatedProfanyEnUS, 
      caption = "First 30 more frequent bad-swear words in EN-US per source")
```

The above table shows the first 30 swear-bad words of the vocabularry and their
frequency. The total of swear-bad words is 
`r ammountOfBad_SwearWordsInVocabularyEnUS`.
As can be see it, the majority of the words are not bad words by definition. It's
depends on the context. So we prefer do not remove the bad and swear words and 
look for approaches that allows the user of the model to decide what to do with 
possible bad or swear words.

## Distribution of the frequencies of the vocabullary in En US blogs  

```{r vocabullaryStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUSblogs")){
    # Total of words in the document
    wordTotalHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[which(wordFreqHCcorporaEnUSblogs != 0)])
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSblogs <- 
        length(which(wordFreqHCcorporaEnUSblogs == 1))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 2))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 3))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 4))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 6))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 7))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 8))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSblogsPercentege <- 
        wordFreq0_5lHCcorporaEnUSblogs/
        length(which(wordFreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSblogsPercentege <- 
        wordFreq0_9lHCcorporaEnUSblogs/
        length(which(wordFreqHCcorporaEnUSblogs != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUSblogs <- order(wordFreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 word more frequents of the vocabulary
    percentege10kWordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:10000]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 2700 word more frequents of the vocabulary
    percentege2700WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:2700]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 140 word more frequents of the vocabulary
    percentege140WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:140]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 15 word more frequents of the vocabulary
    percentege15WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:15]]) /
        wordTotalHCcorporaEnUSblogs
}
hist(log10(wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:10000]]), 
     main = "Log10 Distribution of 2700 more frequents word of EnUSblogs", 
     xlab = "log10 of word frequency")
```

Total of words in the vocabulary is `r length(wordFreqHCcorporaEnUSblogs)`.
The vocabullary has been constructed with the set of all different words present 
in all the documents.

Facts for the EnUS blogs document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUSblogs != 0))/length(wordFreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of words in the document that is cover by the 10K (`r (10000/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege10kWordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 2,7K (`r (2700/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege2700WordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 140 (`r (140/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege140WordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 15 (`r (15/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege15WordMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUSblogs[1:15]]`  

## Distribution of the frequencies of the vocabullary in En US news  

```{r vocabullaryStatsEnUSnews, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUSnews")){
    # Total of words in the document
    wordTotalHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[which(wordFreqHCcorporaEnUSnews != 0)])
    # Compute the amount of words with a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUSnews <- 
        length(which(wordFreqHCcorporaEnUSnews == 1))
    wordFreq0_2lHCcorporaEnUSnews <- wordFreq0_2lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 2))
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_2lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 3))
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 4))
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 6))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 7))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 8))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_2lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_5lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_9lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUSnews <- order(wordFreqHCcorporaEnUSnews, decreasing = TRUE)
    # Represents the 165000 word more frequents of the vocabulary
    percentege165kWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:165000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 115000 word more frequents of the vocabulary
    percentege115KWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:115000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 23000 word more frequents of the vocabulary
    percentege23KWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:23000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 4000 word more frequents of the vocabulary
    percentege4kWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:4000]]) /
        wordTotalHCcorporaEnUSnews
}
hist(log10(wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:115000]]), 
     main = "Log10 Distribution of 115000 more frequents word of EnUSnews", 
     xlab = "log10 of word frequency")
```

Facts for the EnUS news document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUSnews != 0))/length(wordFreqHCcorporaEnUSnews)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 2 is 
`r wordFreq0_2lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of words in the document that is cover by the 165K (`r (165000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege165kWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 115K (`r (115000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege115KWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 23K (`r (23000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege23KWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 4K (`r (4000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege4kWordMoreFreqHCcorporaEnUSnews*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUSnews[1:15]]`  

## Distribution of the frequencies of the vocabullary in En US tweets  

```{r vocabullaryStatsEnUStwitter, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUStwitter")){
    # Total of words in the document
    wordTotalHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[which(wordFreqHCcorporaEnUStwitter != 0)])
    # Compute the amount of words with a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUStwitter <- 
        length(which(wordFreqHCcorporaEnUStwitter == 1))
    wordFreq0_2lHCcorporaEnUStwitter <- wordFreq0_2lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 2))
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_2lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 3))
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 4))
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 6))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 7))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 8))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_2lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_5lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_9lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUStwitter <- order(wordFreqHCcorporaEnUStwitter, decreasing = TRUE)
    # Represents the 350000 word more frequents of the vocabulary
    percentege350kWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:350000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 285000 word more frequents of the vocabulary
    percentege285KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:285000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 80000 word more frequents of the vocabulary
    percentege80KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:80000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 11000 word more frequents of the vocabulary
    percentege11KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:11000]]) /
        wordTotalHCcorporaEnUStwitter
}
hist(log10(wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:285000]]), 
     main = "Log10 Distribution of 285000 more frequents word of EnUStwitter", 
     xlab = "log10 of word frequency")
```

Facts for the EnUS tweets document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUStwitter != 0))/length(wordFreqHCcorporaEnUStwitter)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUStwitterPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUStwitterPercentege*100`%  
* The percentege of words in the document that is cover by the 350K (`r (350000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege350kWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 285K (`r (285000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege285KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 80K (`r (80000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege80KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 11K (`r (11000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege11KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUStwitter[1:15]]`  

## Frequencies of word pair and triplet for EnUSblogs  

```{r nGramFreq, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordPairFreqHCcorporaEnUSblogs")){
    # Two environments intended to keep the 2-gram and 3-gram
    wordPairFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordTripletFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    nGramIndexEnUSblogs <- rep(-1,length(wordOrderHCcorporaEnUSblogs))
    for(i in 1:10000){
          nGramIndexEnUSblogs[wordOrderHCcorporaEnUSblogs[i]]<- i  
    }
    wordPairCountEnUSblogs <- 0
    wordTripletCountEnUSblogs <- 0
    
    # The dimmension has been setted to preserve the n-grams that includes
    # the more frequent words only
    procTokLine <- function(tL){
        temp2 <- length(tL)
        for(i in 1:temp2) {
            if((i+1)<=temp2){
                i2 <- nGramIndexEnUSblogs[tL[i]]
                if(i2==-1 || i2>2700)
                    i2 <- 2701
                i1 <- nGramIndexEnUSblogs[tL[i+1]]
                if(i1==-1)
                    i1 <- 10001
                pI <- paste0(i1,"_",i2)
                pC <- wordPairFreqHCcorporaEnUSblogs[[pI]]  
                if(is.null(pC))
                    pC <- 0
                wordPairFreqHCcorporaEnUSblogs[[pI]] <<- pC + 1
                if((i+2)<=temp2){
                    i3 <- nGramIndexEnUSblogs[tL[i]]
                    if(i3==-1 || i3>140)
                        i3 <- 141
                    i2 <- nGramIndexEnUSblogs[tL[i+1]]
                    if(i2==-1 || i2>2700)
                        i2 <- 2701
                    i1 <- nGramIndexEnUSblogs[tL[i]]
                    if(i1==-1 || i1>2700)
                        i1 <- 2701
                    tI <- paste0(i1,"_",i2, "_", i3)
                    tC <- wordTripletFreqHCcorporaEnUSblogs[[tI]]  
                    if(is.null(tC))
                        tC <- 0
                    wordTripletFreqHCcorporaEnUSblogs[[tI]] <<- tC + 1
                    wordTripletCountEnUSblogs <<- 
                        wordTripletCountEnUSblogs + 1
                }
                wordPairCountEnUSblogs <<- wordPairCountEnUSblogs + 1
            }
            else
            {
                break
            }
        }
    }
}
```

```{r nGramFreq1, echo=FALSE, cache=TRUE, message=FALSE}
if(wordPairCountEnUSblogs != 0){
    rm(wordPairFreqHCcorporaEnUSblogs,wordTripletFreqHCcorporaEnUSblogs,
           wordPairCountEnUSblogs, wordTripletCountEnUSblogs)
    wordPairFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordTripletFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordPairCountEnUSblogs <- 0
    wordTripletCountEnUSblogs <- 0
}
loadCorpus <- TRUE
if(!file.exists("Corpus1.RData")){
    loadCorpus <- FALSE
    tokenizedHCcorporaEnUSblogsPiece <- 
        split(1:length(tokenizedHCcorporaEnUSblogs), 1:10)

    currentTime <- Sys.time()
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[1]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save(wordPairFreqHCcorporaEnUSblogs,wordTripletFreqHCcorporaEnUSblogs,
        wordPairCountEnUSblogs, wordTripletCountEnUSblogs, 
        file = "Corpus1.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq2, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus2.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus1.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[2]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus2.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq3, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus3.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus2.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[3]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus3.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq4, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus4.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus3.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[4]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus4.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq5, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus5.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus4.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[5]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus5.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq6, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus6.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus5.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[6]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus6.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq7, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus7.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus6.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[7]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus7.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq8, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus8.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus7.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[8]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus8.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq9, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus9.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus8.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[9]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus9.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq10, echo=FALSE, message=FALSE, cache=TRUE}
if(!file.exists("Corpus10.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus9.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[10]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus10.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

### Word pairs stats  

```{r biGramStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(file.exists("CorpusBiGramStatsEnUSblogs.RData")){
    load("CorpusBiGramStatsEnUSblogs.RData")
}else{
    # Extract the word pairs frequencies from the environment used
    gram2FreqHCcorporaEnUSblogs <- ls(wordPairFreqHCcorporaEnUSblogs)
    names(gram2FreqHCcorporaEnUSblogs) <- gram2FreqHCcorporaEnUSblogs
    gram2FreqHCcorporaEnUSblogs <- sapply(gram2FreqHCcorporaEnUSblogs, 
                          function(v){
        wordPairFreqHCcorporaEnUSblogs[[v]]})
    gram2FreqHCcorporaEnUSblogs <- as.numeric(gram2FreqHCcorporaEnUSblogs)

    # Total of pair in the document
    pairTotalHCcorporaEnUSblogs <- which(gram2FreqHCcorporaEnUSblogs != 0)
    pairTotalHCcorporaEnUSblogs <- gram2FreqHCcorporaEnUSblogs[pairTotalHCcorporaEnUSblogs]
    pairTotalHCcorporaEnUSblogs <- sum(unlist(pairTotalHCcorporaEnUSblogs))
    # Compute the amount of pairs with a frequency less or equal to 5
    pairFreq0_5lHCcorporaEnUSblogs <- 
        length(which(gram2FreqHCcorporaEnUSblogs == 1))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 2))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 3))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 4))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 5))
    # Compute the amount of pairs with a frequency less or equal to 9
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 6))
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 7))
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 8))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the bigrams that is cover by the pairs that 
    # have a frequency less or equal to 5
    pairFreq0_5lHCcorporaEnUSblogsPercentege <- 
        pairFreq0_5lHCcorporaEnUSblogs/
        length(which(gram2FreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the bigrams that is cover by the pairs that 
    # have a frequency less or equal to 9
    pairFreq0_9lHCcorporaEnUSblogsPercentege <- 
        pairFreq0_9lHCcorporaEnUSblogs/
        length(which(gram2FreqHCcorporaEnUSblogs != 0))
    
    # Pair indexes order by frequency
    pairOrderHCcorporaEnUSblogs <- order(gram2FreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 bigrams more frequents of the vocabulary
    percentege10kPairMoreFreqHCcorporaEnUSblogs <- sum(unlist(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:10000]])) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 2700 bigrams more frequents of the vocabulary
    percentege2700PairMoreFreqHCcorporaEnUSblogs <- sum(unlist(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:2700]])) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 140 bigrams more frequents of the vocabulary
    percentege140PairMoreFreqHCcorporaEnUSblogs <- sum(unlist(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:140]])) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 15 bigrams more frequents of the vocabulary
    percentege15PairMoreFreqHCcorporaEnUSblogs <- sum(unlist(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:15]])) /
        pairTotalHCcorporaEnUSblogs
    save(percentege15PairMoreFreqHCcorporaEnUSblogs, 
         percentege140PairMoreFreqHCcorporaEnUSblogs, 
         percentege2700PairMoreFreqHCcorporaEnUSblogs,
         percentege10kPairMoreFreqHCcorporaEnUSblogs, 
         pairFreq0_9lHCcorporaEnUSblogsPercentege, 
         pairFreq0_5lHCcorporaEnUSblogsPercentege, 
         pairFreq0_5lHCcorporaEnUSblogs,
         pairFreq0_9lHCcorporaEnUSblogs,
         gram2FreqHCcorporaEnUSblogs,
         pairOrderHCcorporaEnUSblogs,
        file = "CorpusBiGramStatsEnUSblogs.RData")
}
hist(log10(unlist(gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs])), 
     main = "Log10 Distribution of bi-grams of EnUSblogs", 
     xlab = "log10 of word frequency")
```

* The percentege of bigram used is 
`r length(which(gram2FreqHCcorporaEnUSblogs != 0))/length(gram2FreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the bigram that have a 
frequency less or equal to 5 is 
`r pairFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the bigram that have a 
frequency less or equal to 9 is 
`r pairFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of bigram in the document that is cover by the 10K (`r (10000/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege10kPairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 2,7K (`r (2700/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege2700PairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 140 (`r (140/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege140PairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 15 (`r (15/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege15PairMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent bigram are:  
```{r biGramMoreFreq, echo=FALSE, cache=TRUE, message=FALSE}
kable(sapply((names(gram2FreqHCcorporaEnUSblogs))[pairOrderHCcorporaEnUSblogs[1:15]], function(pair){
    temp <- sapply(strsplit(pair,"_", fixed = TRUE),as.numeric)
    temp2 <- character(length = 2)
    temp2[1] <- ifelse(temp[[1]]==10001,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[1]]])
    temp2[2] <- ifelse(temp[[2]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[2]]])
    t(temp2)
}))
```

### Word triplets stats  

```{r triGramStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(file.exists("CorpusTriGramStatsEnUSblogs.RData")){
    load("CorpusTriGramStatsEnUSblogs.RData")
}else{
    # Extract the word triplets frequencies from the environment used
    gram3FreqHCcorporaEnUSblogs <- ls(wordTripletFreqHCcorporaEnUSblogs)
    gram3FreqHCcorporaEnUSblogs <- sapply(gram3FreqHCcorporaEnUSblogs, 
                          function(v){
        wordTripletFreqHCcorporaEnUSblogs[[v]]})
    gram3FreqHCcorporaEnUSblogs <- as.numeric(gram3FreqHCcorporaEnUSblogs)

    # Total of triplet in the document
    tripletTotalHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[
            which(gram3FreqHCcorporaEnUSblogs != 0)])
    # Compute the amount of triplets with a frequency less or equal to 5
    tripletFreq0_5lHCcorporaEnUSblogs <- 
        length(which(gram3FreqHCcorporaEnUSblogs == 1))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 2))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 3))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 4))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 5))
    # Compute the amount of triplets with a frequency less or equal to 9
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 6))
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 7))
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 8))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the bigrams that is cover by the triplets that 
    # have a frequency less or equal to 5
    tripletFreq0_5lHCcorporaEnUSblogsPercentege <- 
        tripletFreq0_5lHCcorporaEnUSblogs/
        length(which(gram3FreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the bigrams that is cover by the triplets that 
    # have a frequency less or equal to 9
    tripletFreq0_9lHCcorporaEnUSblogsPercentege <- 
        tripletFreq0_9lHCcorporaEnUSblogs/
        length(which(gram3FreqHCcorporaEnUSblogs != 0))
    
    # Triplet indexes order by frequency
    tripletOrderHCcorporaEnUSblogs <- order(gram3FreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 trigram more frequents of the vocabulary
    percentege10kTripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:10000]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 2700 trigram more frequents of the vocabulary
    percentege2700TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:2700]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 140 trigram more frequents of the vocabulary
    percentege140TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:140]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 15 trigram more frequents of the vocabulary
    percentege15TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:15]]) /
        tripletTotalHCcorporaEnUSblogs

        save(percentege15TripletMoreFreqHCcorporaEnUSblogs, 
         percentege140TripletMoreFreqHCcorporaEnUSblogs, 
         percentege2700TripletMoreFreqHCcorporaEnUSblogs,
         percentege10kTripletMoreFreqHCcorporaEnUSblogs, 
         tripletFreq0_9lHCcorporaEnUSblogsPercentege, 
         tripletFreq0_5lHCcorporaEnUSblogsPercentege, 
         tripletFreq0_5lHCcorporaEnUSblogs,
         tripletFreq0_9lHCcorporaEnUSblogs,
         gram3FreqHCcorporaEnUSblogs,
         tripletOrderHCcorporaEnUSblogs,
        file = "CorpusTriGramStatsEnUSblogs.RData")
}

hist(log10(unlist(gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs])), 
     main = "Log10 Distribution of tri-grams of EnUSblogs", 
     xlab = "log10 of trigram frequency")
```

* The percentege of trigram used is 
`r length(which(gram3FreqHCcorporaEnUSblogs != 0))/length(gram3FreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the trigram that have a 
frequency less or equal to 5 is 
`r tripletFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the trigram that have a 
frequency less or equal to 9 is 
`r tripletFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of trigram in the document that is cover by the 10K (`r (10000/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege10kTripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 2,7K (`r (2700/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege2700TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 140 (`r (140/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege140TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 15 (`r (15/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege15TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent trigram are: 
```{r triGramMoreFreq, echo=FALSE, cache=TRUE, message=FALSE}
kable(sapply((names(gram3FreqHCcorporaEnUSblogs))[tripletOrderHCcorporaEnUSblogs[1:15]], 
        function(triplet){
    temp <- sapply(strsplit(triplet,"_", fixed = TRUE),as.numeric)
    temp2 <- character(length = 2)
    temp2[1] <- ifelse(temp[[1]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[1]]])
    temp2[2] <- ifelse(temp[[2]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[2]]])
    temp2[3] <- ifelse(temp[[2]]==141,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[3]]])
    t(temp2)
}))
```

## Questions about frequency of words  

1. Some words are more frequent than others - what are the distributions of word 
frequencies?  
    * People tends to use as few words as possible. The log10 of the 
    frequency has a exponential decay distribution.  
2. What are the frequencies of 2-grams and 3-grams in the dataset? 
    * The behaviour is the same, People tends to use as few n-grams as
    possible. The log10 of the frequency has a exponential decay distribution.   
3. How many unique words do you need in a frequency sorted dictionary to cover 
50% of all word instances in the language? 90%?  
    * As we showed above, it depends on the source (blogs, news, twitter, 
    etc.). But it is clear that araound 20% of the vocabullary covers more 
    than 80% of the words.  

## Conclusions

Based on the evidence showed above, our the plan is create a model based in a 
Markov Process that explots the exponential behaviour of the log of the 
distribution of words.
And, in order to provide tools to manage the profanity of words, some methods to 
consider context are going to be evaluated.