Notebook.Rmd

---
title: "NoteBook"
author: "Edilmo Palencia"
output: html_document
---

# Task 0 - Obtaining the data & Familiarizing with NLP

The corpus:  

```{r corpusDescription, echo=FALSE, cache=TRUE, message=FALSE}
currentTime <- Sys.time()
library(knitr)
library(R.utils)
# Check if there is a previous run already saved
if(file.exists("Corpus.RData")){
    load("Corpus.RData")
}else{
    # Create a empty dataFrame where the corpus description is going to be stored
    # One row per document
    corpusDescription <- data.frame(
        row.names = c("src", "lan", "src-type", "f-name", "f-path"), 
        stringsAsFactors = FALSE)
    # Check if the corpus has been downloaded
    if(!file.exists("corpus")){
        download.file(
                "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                "corpus.zip", "curl")
        unzip("corpus.zip")
        dir.create("corpus")
        file.rename("final","corpus/HC Corpora")
        file.remove("corpus.zip")
    }
    # There is a directory per Source, so build a list of sources. 
    corpusSources <- list.dirs(path = "corpus", full.names = FALSE, 
                   recursive = FALSE)
    # Loop over the sources
    for(s in corpusSources){
        # There is a directory per Language, so build a list of 
        # languages for this source. 
        corpusLanguages <- list.dirs(path = filePath("corpus",s), 
                         full.names = FALSE, recursive = FALSE)
        for(l in corpusLanguages){
            # There is a corpus document per source type, so build a list 
            # of documents with its source type
            filelist <- list.files(filePath("corpus",s,l), 
                           full.names = FALSE, recursive = FALSE)
            for(f in filelist){
                # Let's extract the source type from the file name
                st <- strsplit(f,".",fixed = TRUE)
                # Let's build the full path of the file
                fname <- filePath("corpus",s,l,f)
                # Create the row to add in the data frame
                r <- c(s,l,st[[1]][2],f,fname)
                # Add the new row as column
                corpusDescription <- cbind(corpusDescription,r)
            }
        }
    }
    # Let's transpose the data frame because all the rows were added as columns
    corpusDescription <- t(corpusDescription)
    # Print time elapsed
    Sys.time() - currentTime
}
# Print the data frame
kable(corpusDescription)
```

## Questions about the HC Corpora  

```{r corpusLoading, echo=FALSE, cache=TRUE, message=FALSE}
library(tm)
if(!exists("readCorpus")){
    currentTime <- Sys.time()
    # Let's load the corpus using the tm package
    readCorpus <- function(src,lan){
        Corpus(DirSource(directory = filePath("corpus",src,lan),
                        encoding = "",
                        pattern = NULL,
                        recursive = FALSE,
                        ignore.case = FALSE,
                        mode = "text"),
                    readerControl = list(reader = readPlain,
                             language = lan,
                             load = FALSE))
    }
    corpusHCcorporaEnUS <- readCorpus("HC Corpora","en_US")
    #corpusHCcorporaDeDE <- readCorpus("HC Corpora","de_DE")
    #corpusHCcorporaFiFI <- readCorpus("HC Corpora","fi_FI")
    #corpusHCcorporaRuRU <- readCorpus("HC Corpora","ru_RU")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r corpusQuestions, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("charCountPerLineEnUSblogs")){
    currentTime <- Sys.time()
    # Let's count the amount of characters per line in each document of the english
    # corpus
    charCountPerLineEnUSblogs <- sapply(
        corpusHCcorporaEnUS[["en_US.blogs.txt"]]$content,nchar)
    charCountPerLineEnUSnews <- sapply(
        corpusHCcorporaEnUS[["en_US.news.txt"]]$content,nchar)
    charCountPerLineEnUStwitter <- sapply(
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content,nchar)
    
    # Let's compute the lenght of the longest line in each document
    longestLineLenghtEnUSblogs <- max(charCountPerLineEnUSblogs)
    longestLineLenghtEnUSnews <- max(charCountPerLineEnUSnews)
    longestLineLenghtEnUStwitter <- max(charCountPerLineEnUStwitter)
    
    # Let's count the amount of times that "love" and "hate" appears in the 
    # twitter document of the english corpus
    loveTimesEnUStwitter <- length(which(grepl("love",
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content)))
    hateTimesEnUStwitter <- length(which(grepl("hate",
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content)))
    # Let's look for the twitts that contains the word "biostats"
    twittsWithBiostats <- grep("biostats",
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content, value = TRUE)
    # Let's look for the exact twitt "A computer once beat me at chess, but it was 
    # no match for me at kickboxing"
    twittsWithSpecificText <- grep(
        "^A computer once beat me at chess, but it was no match for me at kickboxing$",
        corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content, value = FALSE)
    # Print time elapsed
    Sys.time() - currentTime
}

# Let's compute the amount of characters in the blogs document
charCountTemp <- sum(charCountPerLineEnUSblogs)
# Let's compute the amount of lines in the twitter document
lineCountTemp <- length(corpusHCcorporaEnUS[["en_US.twitter.txt"]]$content)
    
```

1. The en_US.blogs.txt file is how many megabytes?  
    * Araoud `r charCountTemp/1024/1024` Mega bytes.  
2. The en_US.twitter.txt has how many lines of text?  
    * Araoud `r lineCountTemp/1000/1000` million lines.  
3. What is the length of the longest line seen in any of the three en_US data 
sets?  
    * EnUSblogs `r longestLineLenghtEnUSblogs`  
    * EnUSnews `r longestLineLenghtEnUSnews`  
    * EnUStwitter `r longestLineLenghtEnUStwitter`  
4. In the en_US twitter data set, if you divide the number of lines where the 
word "love" (all lowercase) occurs by the number of lines the word "hate" 
(all lowercase) occurs, about what do you get?  
    * `r loveTimesEnUStwitter/hateTimesEnUStwitter`  
5. The one tweet in the en_US twitter data set that matches the word "biostats" 
says what?  
    * `r twittsWithBiostats[[1]]`  
6. How many tweets have the exact characters "A computer once beat me at chess, 
but it was no match for me at kickboxing". (I.e. the line matches those 
characters exactly.)  
    * `r length(twittsWithSpecificText)`  
7. What does the data look like?  
    * The data present in the corpus is completed unstructured documents of 
    natural language texts in 4 languages. There are no pre-proccessing 
    actions taken.  
8. Where does the data come from?  
    * The data comes from 3 kind of sources: news articles, blogs and tweets.  
9. Can you think of any other data sources that might help you in this project?  
    * Dictionaries, theasaurus and ontologies.  
10. What are the common steps in natural language processing?  
    * Tokenize: corpus segmentation and transformation.  
        + Word separation  
        + Multi-word recognition  
        + Character n-gram  
        + Lexicon matching  
        + Punctuation handling  
        + Stop-words ignoring  
        + Hyphenation handling  
    * Normalize: many different strings convey identical meanings.  
        + Case folding: problematic because case and accent change 
        meaning sometimes.  
        + Stemming: problematic because agglutinative languages has many 
        concepts combined in a single word.  
        + Lemmatize.  
        + Profanity Handling.  
    * Annotation: identical strings may have different meaning.  
        + Part of the speech tagging.  
        + Word sense tagging.  
        + Parsing: making words according to their grammatical role.  
11. What are some common issues in the analysis of text data?  
    * Polysemy: it is the capacity of a sign to have multiple meanings.  
    * Paralanguage recognition: component of meta-communication that may 
    modify or nuance meaning, or convey emotion.  
    * Curse of dimensionality - Data Sparsity: curse of dimensionality are 
    problems that arise when the data is high dimensional. Data sparsity is 
    specific problem that happens when exist a lot of dimension combinations 
    that are not populated at all (maybe because it doesn’t make any sense 
    or just because the data set is incomplete).  
    * Noisy Morphological Segmentation: morphological analysis is the task 
    of segmenting a word into morphemes, the smallest meaning- bearing 
    elements of natural languages. Normally this can be spoiled for 
    orthographic errors and so.  
    * Order-Invariance of Factor Composition: happens when a model is not 
    able to differentiate things like "hangover" de "overhang", which are 
    two words compose of two morphemes "over" and "hang".  
    * Inherent Ambiguity of Factor Composition: happens when a model doesn’t 
    know how to compose in cases like "un[[lock]able]" vs "[un[lock]]able]".  
12. What is the relationship between NLP and the concepts you have learned in 
the Specialization?  
    * Now days, the different problems attacked by NLP techniques are 
    considered machine learning problems, and the best techniques in this 
    field are the same or some extension of the tools tought in the 
    specialization.  

# Task 1 - Tokenization & Profanity filtering

```{r matrixTermDoc, echo=FALSE, cache=TRUE, message=FALSE}
#currentTime <- Sys.time()
# Let's create a term-document matrix with the frequency of each word
#termDocMatrixEnUS <- TermDocumentMatrix(corpusHCcorporaEnUS)
# This method is very slow.
# Print time elapsed
#Sys.time() - currentTime
```

```{r simpleTokenization, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("corpusHCcorporaEnUS.OnlyLetters")){
    currentTime <- Sys.time()
    # Removing the numbers from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS, removeNumbers, 
              lazy = FALSE)
    # Removing the puntuations from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          removePunctuation, lazy = FALSE)
    # Removing the extra white spaces from the corpus
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          stripWhitespace, lazy = FALSE)
    # Removing the white spaces at the begging and the end
    corpusHCcorporaEnUS.OnlyLetters <- tm_map(corpusHCcorporaEnUS.OnlyLetters, 
                          content_transformer(trim))
    # Dictionary with the vocabulary and the index of each word
    # Use a hashed environment that contains one object per word
    # The object name is the word itself and the value is the index
    # of the word. The index is used for the tokenization
    vocabularyHCcorporaEnUSbyWord <- new.env(hash = TRUE)
    # Dictionary vector where the index points to the corresponding word
    vocabularyHCcorporaEnUSbyIndex <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the blogs document of the english corpus
    wordFreqHCcorporaEnUSblogs <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the blogs document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUSblogs <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the news document of the english corpus
    wordFreqHCcorporaEnUSnews <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the news document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUSnews <- NULL
    # Dictionary vector where the index points to the frequency of the 
    # corresponding word in the twitter document of the english corpus
    wordFreqHCcorporaEnUStwitter <- NULL
    # Dictionary vector where the index points to the tokenized version of the 
    # corresponding line in the twitter document of the english corpus. The tokenized
    # version of a line is a vector of integers with the index of each word.
    tokenizedHCcorporaEnUStwitter <- NULL
    # temporal variable used to store word-frequency dictionaries
    temp2 <- NULL
    # Function used to tokenize a line o a document
    procDocLine <- function(docLine){
        # result variable to store the tokenize version of the line
        tokenizeLine <- numeric(length = length(docLine))
        # index variable of the word to tokenize
        i <- 1
        for(w in docLine){
            # get the index of the word in our vocabulary dictionary
            wIndex <- vocabularyHCcorporaEnUSbyWord[[w]]  
            # check if the word exist in our vocabulary
            if(is.null(wIndex)){
                # the index for a new word it's just the lenght of the 
                # vocabulary
                wIndex <- length(vocabularyHCcorporaEnUSbyIndex) + 1
                # add the word to the vocabulary dictionary by word
                vocabularyHCcorporaEnUSbyWord[[w]] <<- wIndex
                # add the word to the vocabulary dictionary by index
                vocabularyHCcorporaEnUSbyIndex[wIndex] <<- w
                # initialize the frequency for the word to 0
                temp2[wIndex] <<- 0
            }
            # increase the frequency of the word
            temp2[wIndex] <<- temp2[wIndex] + 1
            # append the index of the word the tokenized line
            tokenizeLine[i] <- wIndex
            # increase the index of the word to process
            i <- i + 1
        }
        # return the tokenized line
        tokenizeLine
    }
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.blogs.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUSblogs <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUSblogs <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.news.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUSnews <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUSnews <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # split each line by white spaces in the blogs document
    temp <- strsplit(
        corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content," ",
        fixed = TRUE)
    # clear the temp2 variable
    temp2 <- NULL
    # process each splitted line
    tokenizedHCcorporaEnUStwitter <- lapply(temp, procDocLine)
    # save the temporal frequency dictionary
    wordFreqHCcorporaEnUStwitter <- temp2
    # save a image of the environment
    save.image("Corpus.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r profanitySourcing, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("badWords")){
    currentTime <- Sys.time()
    # Load a bad word list
    temp <- url("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
    badWords <- readLines(temp)
    # Convert the frame to vector
    badWords <- tolower(badWords)
    badWords <- trim(badWords)
    badWords <- badWords[which(badWords!="")]
    close(temp)
    # Load a swear word list
    temp <- url("http://www.bannedwordlist.com/lists/swearWords.txt")
    swearWords <- readLines(temp)
    # Convert the frame to vector
    swearWords <- tolower(swearWords)
    swearWords <- trim(swearWords)
    swearWords <- swearWords[which(swearWords!="")]
    close(temp)
    # Merge bad and swear words
    temp <- swearWords %in% badWords
    bad_swearWords <- c(badWords, swearWords[which(!temp)])
    # Print time elapsed
    Sys.time() - currentTime
}
```

## Vocabulary & profanities  
Two list of bad words from internet, one of bad words and other of swear words.  

```{r profanityTally, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("profanyDFEnUStwitter")){
    currentTime <- Sys.time()
    # Logical vector with the index of the vocabulary words present in the
    # swear and bad word list
    bad_swearWordsInVocabulary <- vocabularyHCcorporaEnUSbyIndex %in% bad_swearWords
    # Frequency of the bad-swear words
    profanyFreqEnUSblogs <- 
        wordFreqHCcorporaEnUSblogs[which(bad_swearWordsInVocabulary)]
    profanyFreqEnUSnews <- 
        wordFreqHCcorporaEnUSnews[which(bad_swearWordsInVocabulary)]
    profanyFreqEnUStwitter <- 
        wordFreqHCcorporaEnUStwitter[which(bad_swearWordsInVocabulary)]
    # bad-swear words
    profanyEnUS <- 
        vocabularyHCcorporaEnUSbyIndex[which(bad_swearWordsInVocabulary)]
    # Data frame of Frequency of the bad-swear words
    profanyDFEnUSblogs <- data.frame(profanyEnUS,profanyFreqEnUSblogs)
    profanyDFEnUSblogs <- profanyDFEnUSblogs[order(profanyFreqEnUSblogs,profanyEnUS, 
                     decreasing = TRUE),]
    profanyDFEnUSnews <- data.frame(profanyEnUS,profanyFreqEnUSnews)
    profanyDFEnUSnews <- profanyDFEnUSnews[order(profanyFreqEnUSnews,profanyEnUS, 
                     decreasing = TRUE),]
    profanyDFEnUStwitter <- data.frame(profanyEnUS,profanyFreqEnUStwitter)
    profanyDFEnUStwitter <- profanyDFEnUStwitter[order(profanyFreqEnUStwitter,profanyEnUS, 
                     decreasing = TRUE),]
    ammountOfBad_SwearWordsInVocabularyEnUS <- length(profanyEnUS)
    # Print time elapsed
    Sys.time() - currentTime
    kable(profanyDFEnUSblogs[1:30,], 
          caption = "First 30 more frequent bad-swear words in EN-US blogs")
    kable(profanyDFEnUSnews[1:30,], 
          caption = "First 30 more frequent bad-swear words in EN-US news")
    kable(profanyDFEnUStwitter[1:30,], 
          caption = "First 30 more frequent bad-swear words in EN-US twitter")
}
```

The above table shows the first 30 swear-bad words of the vocabularry and their
frequency. The total of swear-bad words is 
`r ammountOfBad_SwearWordsInVocabularyEnUS`.
As can be see it, the majority of the words are not bad words by definition. It's
depends on the context. So we prefer do not remove the bad and swear words and 
look for approaches that consider just the usage and not the context.

# Task 2 - Exploratory analysis & Understand frequencies of words and word pairs

## Distribution of the frequencies of the vocabullary in En US blogs  

```{r vocabullaryStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUSblogs")){
    # Total of words in the document
    wordTotalHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[which(wordFreqHCcorporaEnUSblogs != 0)])
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSblogs <- 
        length(which(wordFreqHCcorporaEnUSblogs == 1))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 2))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 3))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 4))
    wordFreq0_5lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_5lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 6))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 7))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 8))
    wordFreq0_9lHCcorporaEnUSblogs <- wordFreq0_9lHCcorporaEnUSblogs + 
        length(which(wordFreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSblogsPercentege <- 
        wordFreq0_5lHCcorporaEnUSblogs/
        length(which(wordFreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSblogsPercentege <- 
        wordFreq0_9lHCcorporaEnUSblogs/
        length(which(wordFreqHCcorporaEnUSblogs != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUSblogs <- order(wordFreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 word more frequents of the vocabulary
    percentege10kWordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:10000]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 2700 word more frequents of the vocabulary
    percentege2700WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:2700]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 140 word more frequents of the vocabulary
    percentege140WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:140]]) /
        wordTotalHCcorporaEnUSblogs
    # Represents the 15 word more frequents of the vocabulary
    percentege15WordMoreFreqHCcorporaEnUSblogs <- sum(
        wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:15]]) /
        wordTotalHCcorporaEnUSblogs
}
hist(log10(wordFreqHCcorporaEnUSblogs[wordOrderHCcorporaEnUSblogs[1:10000]]), 
     main = "Log10 Distribution of 2700 more frequents word of EnUSblogs", 
     xlab = "log10 of word frequency")
```

Total of words in the vocabulary is `r length(wordFreqHCcorporaEnUSblogs)`

Facts for the EnUS blogs document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUSblogs != 0))/length(wordFreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of words in the document that is cover by the 10K (`r (10000/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege10kWordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 2,7K (`r (2700/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege2700WordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 140 (`r (140/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege140WordMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of words in the document that is cover by the 15 (`r (15/length(which(wordFreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent words is `r percentege15WordMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUSblogs[1:15]]`  

## Distribution of the frequencies of the vocabullary in En US news  

```{r vocabullaryStatsEnUSnews, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUSnews")){
    # Total of words in the document
    wordTotalHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[which(wordFreqHCcorporaEnUSnews != 0)])
    # Compute the amount of words with a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUSnews <- 
        length(which(wordFreqHCcorporaEnUSnews == 1))
    wordFreq0_2lHCcorporaEnUSnews <- wordFreq0_2lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 2))
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_2lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 3))
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 4))
    wordFreq0_5lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_5lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 6))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 7))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 8))
    wordFreq0_9lHCcorporaEnUSnews <- wordFreq0_9lHCcorporaEnUSnews + 
        length(which(wordFreqHCcorporaEnUSnews == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_2lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_5lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUSnewsPercentege <- 
        wordFreq0_9lHCcorporaEnUSnews/
        length(which(wordFreqHCcorporaEnUSnews != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUSnews <- order(wordFreqHCcorporaEnUSnews, decreasing = TRUE)
    # Represents the 165000 word more frequents of the vocabulary
    percentege165kWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:165000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 115000 word more frequents of the vocabulary
    percentege115KWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:115000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 23000 word more frequents of the vocabulary
    percentege23KWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:23000]]) /
        wordTotalHCcorporaEnUSnews
    # Represents the 4000 word more frequents of the vocabulary
    percentege4kWordMoreFreqHCcorporaEnUSnews <- sum(
        wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:4000]]) /
        wordTotalHCcorporaEnUSnews
}
hist(log10(wordFreqHCcorporaEnUSnews[wordOrderHCcorporaEnUSnews[1:115000]]), 
     main = "Log10 Distribution of 115000 more frequents word of EnUSnews", 
     xlab = "log10 of word frequency")
```

Facts for the EnUS news document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUSnews != 0))/length(wordFreqHCcorporaEnUSnews)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 2 is 
`r wordFreq0_2lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUSnewsPercentege*100`%  
* The percentege of words in the document that is cover by the 165K (`r (165000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege165kWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 115K (`r (115000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege115KWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 23K (`r (23000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege23KWordMoreFreqHCcorporaEnUSnews*100`%  
* The percentege of words in the document that is cover by the 4K (`r (4000/length(which(wordFreqHCcorporaEnUSnews != 0)))*100`%) more 
frequent words is `r percentege4kWordMoreFreqHCcorporaEnUSnews*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUSnews[1:15]]`  

## Distribution of the frequencies of the vocabullary in En US tweets  

```{r vocabullaryStatsEnUStwitter, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordTotalHCcorporaEnUStwitter")){
    # Total of words in the document
    wordTotalHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[which(wordFreqHCcorporaEnUStwitter != 0)])
    # Compute the amount of words with a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUStwitter <- 
        length(which(wordFreqHCcorporaEnUStwitter == 1))
    wordFreq0_2lHCcorporaEnUStwitter <- wordFreq0_2lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 2))
    # Compute the amount of words with a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_2lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 3))
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 4))
    wordFreq0_5lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 5))
    # Compute the amount of words with a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_5lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 6))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 7))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 8))
    wordFreq0_9lHCcorporaEnUStwitter <- wordFreq0_9lHCcorporaEnUStwitter + 
        length(which(wordFreqHCcorporaEnUStwitter == 9))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 2
    wordFreq0_2lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_2lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 5
    wordFreq0_5lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_5lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    # represents the percentege of the vocabullary that is cover by the words that 
    # have a frequency less or equal to 9
    wordFreq0_9lHCcorporaEnUStwitterPercentege <- 
        wordFreq0_9lHCcorporaEnUStwitter/
        length(which(wordFreqHCcorporaEnUStwitter != 0))
    
    # Word indexes order by frequency
    wordOrderHCcorporaEnUStwitter <- order(wordFreqHCcorporaEnUStwitter, decreasing = TRUE)
    # Represents the 350000 word more frequents of the vocabulary
    percentege350kWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:350000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 285000 word more frequents of the vocabulary
    percentege285KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:285000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 80000 word more frequents of the vocabulary
    percentege80KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:80000]]) /
        wordTotalHCcorporaEnUStwitter
    # Represents the 11000 word more frequents of the vocabulary
    percentege11KWordMoreFreqHCcorporaEnUStwitter <- sum(
        wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:11000]]) /
        wordTotalHCcorporaEnUStwitter
}
hist(log10(wordFreqHCcorporaEnUStwitter[wordOrderHCcorporaEnUStwitter[1:285000]]), 
     main = "Log10 Distribution of 285000 more frequents word of EnUStwitter", 
     xlab = "log10 of word frequency")
```

Facts for the EnUS tweets document:  

* The percentege of words used is 
`r length(which(wordFreqHCcorporaEnUStwitter != 0))/length(wordFreqHCcorporaEnUStwitter)*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 5 is 
`r wordFreq0_5lHCcorporaEnUStwitterPercentege*100`%  
* The percentege of the vocabullary that is cover by the words that have a 
frequency less or equal to 9 is 
`r wordFreq0_9lHCcorporaEnUStwitterPercentege*100`%  
* The percentege of words in the document that is cover by the 350K (`r (350000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege350kWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 285K (`r (285000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege285KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 80K (`r (80000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege80KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The percentege of words in the document that is cover by the 11K (`r (11000/length(which(wordFreqHCcorporaEnUStwitter != 0)))*100`%) more 
frequent words is `r percentege11KWordMoreFreqHCcorporaEnUStwitter*100`%  
* The 15 more frequent words are: 
`r vocabularyHCcorporaEnUSbyIndex[wordOrderHCcorporaEnUStwitter[1:15]]`  

## Analysis of frequencies of LAST WORDS

```{r lastGramFreqTwitter, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("lastGramFreqEnUsTwitter")){
    # Environment used as dictionary for 3-grams in last position for twitter source
    vocabullarySimplifiedByWord <- new.env(hash = TRUE)
    vocabullarySimplifiedByIndex <- NULL
    vocabullarySimplifiedFreqByIndex <- NULL
    tokenizeLine <- function(line){
            line <- removeNumbers(line)
            line <- removePunctuation(line)
            line <- stripWhitespace(line)
            line <- str_trim(line)
            line <- str_to_lower(line)
            words <- strsplit(line," ",fixed = TRUE)
            words <- unlist(words)
            # result variable to store the tokenize version of the line
            tokenizeLine <- numeric(length = length(words))
            # index variable of the word to tokenize
            i <- 1
            for(w in words){
                # get the index of the word in our vocabulary dictionary
                wIndex <- vocabullarySimplifiedByWord[[w]]  
                # check if the word exist in our vocabulary
                if(is.null(wIndex)){
                    # the index for a new word it's just the lenght of the 
                    # vocabulary
                    wIndex <- length(vocabullarySimplifiedByIndex) + 1
                    # add the word to the vocabulary dictionary by word
                    vocabullarySimplifiedByWord[[w]] <<- wIndex
                    # add the word to the vocabulary dictionary by index
                    vocabullarySimplifiedByIndex[wIndex] <<- w
                    # initialize the frequency for the word to 0
                    vocabullarySimplifiedFreqByIndex[wIndex] <<- 0
                }
                # increase the frequency of the word
                vocabullarySimplifiedFreqByIndex[wIndex] <<- vocabullarySimplifiedFreqByIndex[wIndex] + 1
                # append the index of the word the tokenized line
                tokenizeLine[i] <- wIndex
                # increase the index of the word to process
                i <- i + 1
            }
            # return the tokenized line
            tokenizeLine
    }
    # Environment used as dictionary for 3-grams in last position for twitter source
    lastGramFreqEnUsTwitter <- new.env(hash = TRUE)
    lastGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    addLastGramFreqEnUsTiwtter <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5],"_",gramVec[6])
        pC <- lastGramFreqEnUsTwitter[[pI]]  
        if(is.null(pC))
            pC <- 0
        lastGramFreqEnUsTwitter[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5])
        lC <- lastGramFreqEnUsTwitterList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[6]
        else
            lC <- c(lC, gramVec[6])
        lastGramFreqEnUsTwitterList[[pI]] <<- lC
    }
    biGramFreqEnUsTwitter <- new.env(hash = TRUE)
    biGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    addBiGramFreqEnUsTiwtter <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2])
        pC <- biGramFreqEnUsTwitter[[pI]]  
        if(is.null(pC))
            pC <- 0
        biGramFreqEnUsTwitter[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1])
        lC <- biGramFreqEnUsTwitterList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[2]
        else
            lC <- c(lC, gramVec[2])
        biGramFreqEnUsTwitterList[[pI]] <<- lC
    }
    triGramFreqEnUsTwitter <- new.env(hash = TRUE)
    triGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    addTriGramFreqEnUsTiwtter <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3])
        pC <- triGramFreqEnUsTwitter[[pI]]  
        if(is.null(pC))
            pC <- 0
        triGramFreqEnUsTwitter[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2])
        lC <- triGramFreqEnUsTwitterList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[3]
        else
            lC <- c(lC, gramVec[3])
        triGramFreqEnUsTwitterList[[pI]] <<- lC
    }
    fourGramFreqEnUsTwitter <- new.env(hash = TRUE)
    fourGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    addFourGramFreqEnUsTiwtter <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4])
        pC <- fourGramFreqEnUsTwitter[[pI]]  
        if(is.null(pC))
            pC <- 0
        fourGramFreqEnUsTwitter[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3])
        lC <- fourGramFreqEnUsTwitterList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[4]
        else
            lC <- c(lC, gramVec[4])
        fourGramFreqEnUsTwitterList[[pI]] <<- lC
    }
    fiveGramFreqEnUsTwitter <- new.env(hash = TRUE)
    fiveGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    addFiveGramFreqEnUsTiwtter <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5])
        pC <- fiveGramFreqEnUsTwitter[[pI]]  
        if(is.null(pC))
            pC <- 0
        fiveGramFreqEnUsTwitter[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4])
        lC <- fiveGramFreqEnUsTwitterList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[5]
        else
            lC <- c(lC, gramVec[5])
        fiveGramFreqEnUsTwitterList[[pI]] <<- lC
    }

    procEndOfTweet <- function(tLt){
        temp2 <- length(tLt)
        if(temp2 > 5){
            iN <- tLt[temp2]
            iN_1 <- tLt[temp2-1]
            iN_2 <- tLt[temp2-2]
            i3 <- tLt[3]
            i2 <- tLt[2]
            i1 <- tLt[1]
            
            ind <- c(i1,i2,i3,iN_2,iN_1,iN)
            addLastGramFreqEnUsTiwtter(ind)
            addBiGramFreqEnUsTiwtter(ind)
            addTriGramFreqEnUsTiwtter(ind)
            addFourGramFreqEnUsTiwtter(ind)
            addFiveGramFreqEnUsTiwtter(ind)
#            for(sW in 1:5){
#                wildcard <- rep("*",sW)
#                for(j in 1:(6-sW)){
#                    indTemp <- replace(ind, j:(j+sW-1), wildcard)
#                    addLastGramFreqEnUsTiwtter(indTemp)
#                }
#            }
        }
    }

    text <- corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content[1]
    text <- "How are you Btw thanks for the RT You gonna be in DC anytime soon Love to see you Been way way too"
    text <- "When you meet someone special youll know Your heart will beat more rapidly and youll smile for no"
    vocabullarySimplifiedByWord <- new.env(hash = TRUE)
    vocabullarySimplifiedByIndex <- NULL
    vocabullarySimplifiedFreqByIndex <- NULL
    lastGramFreqEnUsTwitter <- new.env(hash = TRUE)
    lastGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    
    
    system.time(sapply(tokenizedHCcorporaEnUStwitter[1:10], procEndOfTweet))
    system.time(sapply(tokenizedHCcorporaEnUStwitter[11:110], procEndOfTweet))
    system.time(sapply(tokenizedHCcorporaEnUStwitter[111:1110], procEndOfTweet))
    system.time(sapply(tokenizedHCcorporaEnUStwitter[1111:10000], procEndOfTweet))
    system.time(sapply(tokenizedHCcorporaEnUStwitter[10001:20000], procEndOfTweet))
    system.time(sapply(tokenizedHCcorporaEnUStwitter[20001:50000], procEndOfTweet))
    iLe <- length(tokenizedHCcorporaEnUStwitter)
    iLi <- iLe - 20000
    system.time(sapply(tokenizedHCcorporaEnUStwitter[iLi:iLe], procEndOfTweet))
    iLe <- iLi-1
    iLi <- iLe - 20000
    system.time(sapply(tokenizedHCcorporaEnUStwitter[iLi:iLe], procEndOfTweet))
    save(vocabularyHCcorporaEnUSbyWord, vocabularyHCcorporaEnUSbyIndex, lastGramFreqEnUsTwitterList, lastGramFreqEnUsTwitter, file = "ReleaseTwitter.RData")
    
    vocabullarySimplifiedByWord <- new.env(hash = TRUE)
    vocabullarySimplifiedByIndex <- NULL
    vocabullarySimplifiedFreqByIndex <- NULL
    lastGramFreqEnUsTwitter <- new.env(hash = TRUE)
    lastGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    biGramFreqEnUsTwitter <- new.env(hash = TRUE)
    biGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    triGramFreqEnUsTwitter <- new.env(hash = TRUE)
    triGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    fourGramFreqEnUsTwitter <- new.env(hash = TRUE)
    fourGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    fiveGramFreqEnUsTwitter <- new.env(hash = TRUE)
    fiveGramFreqEnUsTwitterList <- new.env(hash = TRUE)
    iLe <- length(corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content)
    iLi <- iLe - 20000
    temp <- c(corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content[1:20000],
              corpusHCcorporaEnUS.OnlyLetters[["en_US.twitter.txt"]]$content[iLi:iLe])
    system.time(tokenizedSimplifiedEnUStwitter <- lapply(temp, tokenizeLine))
    system.time(sapply(tokenizedSimplifiedEnUStwitter, procEndOfTweet))
}
```

```{r lastGramFreqNews, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("lastGramFreqEnUsNews")){
    # Environment used as dictionary for 3-grams in last position for twitter source
    lastGramFreqEnUsNews <- new.env(hash = TRUE)
    lastGramFreqEnUsNewsList <- new.env(hash = TRUE)
    addLastGramFreqEnUsNews <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5],"_",gramVec[6])
        pC <- lastGramFreqEnUsNews[[pI]]  
        if(is.null(pC))
            pC <- 0
        lastGramFreqEnUsNews[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5])
        lC <- lastGramFreqEnUsNewsList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[6]
        else
            lC <- c(lC, gramVec[6])
        lastGramFreqEnUsNewsList[[pI]] <<- lC
    }
    biGramFreqEnUsNews <- new.env(hash = TRUE)
    biGramFreqEnUsNewsList <- new.env(hash = TRUE)
    addBiGramFreqEnUsNews <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2])
        pC <- biGramFreqEnUsNews[[pI]]  
        if(is.null(pC))
            pC <- 0
        biGramFreqEnUsNews[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1])
        lC <- biGramFreqEnUsNewsList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[2]
        else
            lC <- c(lC, gramVec[2])
        biGramFreqEnUsNewsList[[pI]] <<- lC
    }
    triGramFreqEnUsNews <- new.env(hash = TRUE)
    triGramFreqEnUsNewsList <- new.env(hash = TRUE)
    addTriGramFreqEnUsNews <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3])
        pC <- triGramFreqEnUsNews[[pI]]  
        if(is.null(pC))
            pC <- 0
        triGramFreqEnUsNews[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2])
        lC <- triGramFreqEnUsNewsList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[3]
        else
            lC <- c(lC, gramVec[3])
        triGramFreqEnUsNewsList[[pI]] <<- lC
    }
    fourGramFreqEnUsNews <- new.env(hash = TRUE)
    fourGramFreqEnUsNewsList <- new.env(hash = TRUE)
    addFourGramFreqEnUsNews <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4])
        pC <- fourGramFreqEnUsNews[[pI]]  
        if(is.null(pC))
            pC <- 0
        fourGramFreqEnUsNews[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3])
        lC <- fourGramFreqEnUsNewsList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[4]
        else
            lC <- c(lC, gramVec[4])
        fourGramFreqEnUsNewsList[[pI]] <<- lC
    }
    fiveGramFreqEnUsNews <- new.env(hash = TRUE)
    fiveGramFreqEnUsNewsList <- new.env(hash = TRUE)
    addFiveGramFreqEnUsNews <- function(gramVec){
        # Counter dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4],"_",gramVec[5])
        pC <- fiveGramFreqEnUsNews[[pI]]  
        if(is.null(pC))
            pC <- 0
        fiveGramFreqEnUsNews[[pI]] <<- pC + 1
        # Last word list dictionary
        pI <- paste0(gramVec[1],"_",gramVec[2],"_",gramVec[3],"_",gramVec[4])
        lC <- fiveGramFreqEnUsNewsList[[pI]]  
        if(is.null(lC))
            lC <- gramVec[5]
        else
            lC <- c(lC, gramVec[5])
        fiveGramFreqEnUsNewsList[[pI]] <<- lC
    }

    procEndOfNew <- function(tLt){
        temp2 <- length(tLt)
        if(temp2 > 5){
            iN <- tLt[temp2]
            iN_1 <- tLt[temp2-1]
            iN_2 <- tLt[temp2-2]
            i3 <- tLt[3]
            i2 <- tLt[2]
            i1 <- tLt[1]
            
            ind <- c(i1,i2,i3,iN_2,iN_1,iN)
            addLastGramFreqEnUsNews(ind)
            addBiGramFreqEnUsNews(ind)
            addTriGramFreqEnUsNews(ind)
            addFourGramFreqEnUsNews(ind)
            addFiveGramFreqEnUsNews(ind)
#            for(sW in 1:5){
#                wildcard <- rep("*",sW)
#                for(j in 1:(6-sW)){
#                    indTemp <- replace(ind, j:(j+sW-1), wildcard)
#                    addLastGramFreqEnUsNews(indTemp)
#                }
#            }
        }
    }

    lastGramFreqEnUsNews <- new.env(hash = TRUE)
    lastGramFreqEnUsNewsList <- new.env(hash = TRUE)
    system.time(sapply(tokenizedHCcorporaEnUSnews[1:10], procEndOfNew))
    system.time(sapply(tokenizedHCcorporaEnUSnews[11:110], procEndOfNew))
    system.time(sapply(tokenizedHCcorporaEnUSnews[111:1110], procEndOfNew))
    system.time(sapply(tokenizedHCcorporaEnUSnews[1111:10000], procEndOfNew))
    system.time(sapply(tokenizedHCcorporaEnUSnews[10001:20000], procEndOfNew))
    system.time(sapply(tokenizedHCcorporaEnUSnews[20001:50000], procEndOfNew))
    iLe <- length(tokenizedHCcorporaEnUSnews)
    iLi <- iLe - 20000
    system.time(sapply(tokenizedHCcorporaEnUSnews[iLi:iLe], procEndOfNew))
    iLe <- iLi-1
    iLi <- iLe - 20000
    system.time(sapply(tokenizedHCcorporaEnUSnews[iLi:iLe], procEndOfNew))
    save(lastGramFreqEnUsNewsList, lastGramFreqEnUsNews, file = "ReleaseNews.RData")

    
    lastGramFreqEnUsNews <- new.env(hash = TRUE)
    lastGramFreqEnUsNewsList <- new.env(hash = TRUE)
    biGramFreqEnUsNews <- new.env(hash = TRUE)
    biGramFreqEnUsNewsList <- new.env(hash = TRUE)
    triGramFreqEnUsNews <- new.env(hash = TRUE)
    triGramFreqEnUsNewsList <- new.env(hash = TRUE)
    fourGramFreqEnUsNews <- new.env(hash = TRUE)
    fourGramFreqEnUsNewsList <- new.env(hash = TRUE)
    fiveGramFreqEnUsNews <- new.env(hash = TRUE)
    fiveGramFreqEnUsNewsList <- new.env(hash = TRUE)
    iLe <- length(corpusHCcorporaEnUS.OnlyLetters[["en_US.news.txt"]]$content)
    iLi <- iLe - 20000
    temp <- c(corpusHCcorporaEnUS.OnlyLetters[["en_US.news.txt"]]$content[1:20000],
              corpusHCcorporaEnUS.OnlyLetters[["en_US.news.txt"]]$content[iLi:iLe])
    system.time(tokenizedSimplifiedEnUSNews <- lapply(temp, tokenizeLine))
    system.time(sapply(tokenizedSimplifiedEnUSNews, procEndOfNew))
    vocabullarySimplifiedFreqByIndexAndPos <- matrix(data = rep(1, length(vocabullarySimplifiedFreqByIndex)*5), 
                                                     nrow = length(vocabullarySimplifiedFreqByIndex), 
                                                     ncol = 5)
    for(l in tokenizedSimplifiedEnUStwitter){
        vocabullarySimplifiedFreqByIndexAndPos[l[1],1] <- vocabullarySimplifiedFreqByIndexAndPos[l[1],1] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[2],2] <- vocabullarySimplifiedFreqByIndexAndPos[l[2],2] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[3],3] <- vocabullarySimplifiedFreqByIndexAndPos[l[3],3] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-2],4] <- vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-2],4] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-1],5] <- vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-1],5] + 1
    }
    for(l in tokenizedSimplifiedEnUSNews){
        vocabullarySimplifiedFreqByIndexAndPos[l[1],1] <- vocabullarySimplifiedFreqByIndexAndPos[l[1],1] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[2],2] <- vocabullarySimplifiedFreqByIndexAndPos[l[2],2] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[3],3] <- vocabullarySimplifiedFreqByIndexAndPos[l[3],3] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-2],4] <- vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-2],4] + 1
        vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-1],5] <- vocabullarySimplifiedFreqByIndexAndPos[l[length(l)-1],5] + 1
    }
    vocabullarySimplifiedFreqByIndexAndPosOrder <- matrix(
        data = c(order(vocabullarySimplifiedFreqByIndexAndPos[,1], decreasing = TRUE),
                order(vocabullarySimplifiedFreqByIndexAndPos[,2], decreasing = TRUE),
                order(vocabullarySimplifiedFreqByIndexAndPos[,3], decreasing = TRUE),
                order(vocabullarySimplifiedFreqByIndexAndPos[,4], decreasing = TRUE),
                order(vocabullarySimplifiedFreqByIndexAndPos[,5], decreasing = TRUE)), 
                                                     nrow = length(vocabullarySimplifiedFreqByIndex), 
                                                     ncol = 5)
    system.time(save(vocabullarySimplifiedByWord, vocabullarySimplifiedByIndex, 
                     vocabullarySimplifiedFreqByIndex, 
                     vocabullarySimplifiedFreqByIndexAndPos, vocabullarySimplifiedFreqByIndexAndPosOrder,
                     lastGramFreqEnUsTwitterList, lastGramFreqEnUsTwitter, 
                     biGramFreqEnUsTwitterList, biGramFreqEnUsTwitter, 
                     triGramFreqEnUsTwitterList, triGramFreqEnUsTwitter, 
                     fourGramFreqEnUsTwitterList, fourGramFreqEnUsTwitter, 
                     fiveGramFreqEnUsTwitterList, fiveGramFreqEnUsTwitter, 
                     lastGramFreqEnUsNewsList, lastGramFreqEnUsNews, 
                     biGramFreqEnUsNewsList, biGramFreqEnUsNews, 
                     triGramFreqEnUsNewsList, triGramFreqEnUsNews, 
                     fourGramFreqEnUsNewsList, fourGramFreqEnUsNews, 
                     fiveGramFreqEnUsNewsList, fiveGramFreqEnUsNews, 
                     file = "ReleaseAll.RData"))
}
```

## Frequencies of word pair and triplet for EnUSblogs  

```{r nGramFreq, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("wordPairFreqHCcorporaEnUSblogs")){
    # Two environments intended to keep the 2-gram and 3-gram
    wordPairFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordTripletFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    nGramIndexEnUSblogs <- rep(-1,length(wordOrderHCcorporaEnUSblogs))
    for(i in 1:10000){
          nGramIndexEnUSblogs[wordOrderHCcorporaEnUSblogs[i]]<- i  
    }
    wordPairCountEnUSblogs <- 0
    wordTripletCountEnUSblogs <- 0
    
    # The dimmension has been setted to preserve the n-grams that includes
    # the more frequent words only
    procTokLine <- function(tL){
        temp2 <- length(tL)
        for(i in 1:temp2) {
            if((i+1)<=temp2){
                i2 <- nGramIndexEnUSblogs[tL[i]]
                if(i2==-1 || i2>2700)
                    i2 <- 2701
                i1 <- nGramIndexEnUSblogs[tL[i+1]]
                if(i1==-1)
                    i1 <- 10001
                pI <- paste0(i1,"_",i2)
                pC <- wordPairFreqHCcorporaEnUSblogs[[pI]]  
                if(is.null(pC))
                    pC <- 0
                wordPairFreqHCcorporaEnUSblogs[[pI]] <<- pC + 1
                if((i+2)<=temp2){
                    i3 <- nGramIndexEnUSblogs[tL[i]]
                    if(i3==-1 || i3>140)
                        i3 <- 141
                    i2 <- nGramIndexEnUSblogs[tL[i+1]]
                    if(i2==-1 || i2>2700)
                        i2 <- 2701
                    i1 <- nGramIndexEnUSblogs[tL[i]]
                    if(i1==-1 || i1>2700)
                        i1 <- 2701
                    tI <- paste0(i1,"_",i2, "_", i3)
                    tC <- wordTripletFreqHCcorporaEnUSblogs[[tI]]  
                    if(is.null(tC))
                        tC <- 0
                    wordTripletFreqHCcorporaEnUSblogs[[tI]] <<- tC + 1
                    wordTripletCountEnUSblogs <<- 
                        wordTripletCountEnUSblogs + 1
                }
                wordPairCountEnUSblogs <<- wordPairCountEnUSblogs + 1
            }
            else
            {
                break
            }
        }
    }
}
```

```{r nGramFreq1, echo=FALSE, cache=TRUE, message=FALSE}
if(wordPairCountEnUSblogs != 0){
    rm(wordPairFreqHCcorporaEnUSblogs,wordTripletFreqHCcorporaEnUSblogs,
           wordPairCountEnUSblogs, wordTripletCountEnUSblogs)
    wordPairFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordTripletFreqHCcorporaEnUSblogs <- new.env(hash = TRUE)
    wordPairCountEnUSblogs <- 0
    wordTripletCountEnUSblogs <- 0
}
loadCorpus <- TRUE
if(!file.exists("Corpus1.RData")){
    loadCorpus <- FALSE
    tokenizedHCcorporaEnUSblogsPiece <- 
        split(1:length(tokenizedHCcorporaEnUSblogs), 1:10)

    currentTime <- Sys.time()
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[1]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save(wordPairFreqHCcorporaEnUSblogs,wordTripletFreqHCcorporaEnUSblogs,
        wordPairCountEnUSblogs, wordTripletCountEnUSblogs, 
        file = "Corpus1.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq2, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus2.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus1.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[2]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus2.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq3, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus3.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus2.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[3]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus3.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq4, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus4.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus3.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[4]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus4.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq5, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus5.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus4.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[5]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus5.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq6, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus6.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus5.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[6]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus6.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq7, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus7.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus6.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[7]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus7.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq8, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus8.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus7.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[8]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus8.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq9, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("Corpus9.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus8.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[9]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus9.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

```{r nGramFreq10, echo=FALSE, message=FALSE, cache=TRUE}
if(!file.exists("Corpus10.RData")){
    currentTime <- Sys.time()
    if(loadCorpus)
        load("Corpus9.RData")
    loadCorpus <- FALSE
    temp <- sapply(
        tokenizedHCcorporaEnUSblogs[tokenizedHCcorporaEnUSblogsPiece[[10]]], 
        procTokLine)
    # Print time elapsed
    Sys.time() - currentTime
    currentTime <- Sys.time()
    # save a image of the environment
    save.image("Corpus10.RData")
    # Print time elapsed
    Sys.time() - currentTime
}
```

### Word pairs stats  

```{r biGramStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("gram2FreqHCcorporaEnUSblogs")){
    # Extract the word pairs frequencies from the environment used
    gram2FreqHCcorporaEnUSblogs <- ls(wordPairFreqHCcorporaEnUSblogs)
    names(gram2FreqHCcorporaEnUSblogs) <- gram2FreqHCcorporaEnUSblogs
    gram2FreqHCcorporaEnUSblogs <- sapply(gram2FreqHCcorporaEnUSblogs, 
                          function(v){
        wordPairFreqHCcorporaEnUSblogs[[v]]})

    # Total of pair in the document
    pairTotalHCcorporaEnUSblogs <- sum(
        gram2FreqHCcorporaEnUSblogs[
            which(gram2FreqHCcorporaEnUSblogs != 0)])
    # Compute the amount of pairs with a frequency less or equal to 5
    pairFreq0_5lHCcorporaEnUSblogs <- 
        length(which(gram2FreqHCcorporaEnUSblogs == 1))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 2))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 3))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 4))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 5))
    # Compute the amount of pairs with a frequency less or equal to 9
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 6))
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 7))
    pairFreq0_9lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 8))
    pairFreq0_5lHCcorporaEnUSblogs <- pairFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram2FreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the bigrams that is cover by the pairs that 
    # have a frequency less or equal to 5
    pairFreq0_5lHCcorporaEnUSblogsPercentege <- 
        pairFreq0_5lHCcorporaEnUSblogs/
        length(which(gram2FreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the bigrams that is cover by the pairs that 
    # have a frequency less or equal to 9
    pairFreq0_9lHCcorporaEnUSblogsPercentege <- 
        pairFreq0_5lHCcorporaEnUSblogs/
        length(which(gram2FreqHCcorporaEnUSblogs != 0))
    
    # Pair indexes order by frequency
    pairOrderHCcorporaEnUSblogs <- order(gram2FreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 bigrams more frequents of the vocabulary
    percentege10kPairMoreFreqHCcorporaEnUSblogs <- sum(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:10000]]) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 2700 bigrams more frequents of the vocabulary
    percentege2700PairMoreFreqHCcorporaEnUSblogs <- sum(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:2700]]) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 140 bigrams more frequents of the vocabulary
    percentege140PairMoreFreqHCcorporaEnUSblogs <- sum(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:140]]) /
        pairTotalHCcorporaEnUSblogs
    # Represents the 15 bigrams more frequents of the vocabulary
    percentege15PairMoreFreqHCcorporaEnUSblogs <- sum(
        gram2FreqHCcorporaEnUSblogs[pairOrderHCcorporaEnUSblogs[1:15]]) /
        pairTotalHCcorporaEnUSblogs
}

hist(log10(gram2FreqHCcorporaEnUSblogs), 
     main = "Log10 Distribution of bi-grams of EnUSblogs", 
     xlab = "log10 of word frequency")
```

* The percentege of bigram used is 
`r length(which(gram2FreqHCcorporaEnUSblogs != 0))/length(gram2FreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the bigram that have a 
frequency less or equal to 5 is 
`r pairFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the bigram that have a 
frequency less or equal to 9 is 
`r pairFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of bigram in the document that is cover by the 10K (`r (10000/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege10kPairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 2,7K (`r (2700/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege2700PairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 140 (`r (140/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege140PairMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of bigram in the document that is cover by the 15 (`r (15/length(which(gram2FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent bigram is `r percentege15PairMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent bigram are:  
```{r biGramMoreFreq, echo=FALSE, cache=TRUE, message=FALSE}
kable(sapply((names(gram2FreqHCcorporaEnUSblogs))[pairOrderHCcorporaEnUSblogs[1:15]], function(pair){
    temp <- sapply(strsplit(pair,"_", fixed = TRUE),as.numeric)
    temp2 <- character(length = 2)
    temp2[1] <- ifelse(temp[[1]]==10001,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[1]]])
    temp2[2] <- ifelse(temp[[2]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[2]]])
    t(temp2)
}))
```

### Word triplets stats  

```{r triGramStatsEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
if(!exists("gram3FreqHCcorporaEnUSblogs")){
    # Extract the word triplets frequencies from the environment used
    gram3FreqHCcorporaEnUSblogs <- ls(wordTripletFreqHCcorporaEnUSblogs)
    gram3FreqHCcorporaEnUSblogs <- sapply(gram3FreqHCcorporaEnUSblogs, 
                          function(v){
        wordTripletFreqHCcorporaEnUSblogs[[v]]})

    # Total of triplet in the document
    tripletTotalHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[
            which(gram3FreqHCcorporaEnUSblogs != 0)])
    # Compute the amount of triplets with a frequency less or equal to 5
    tripletFreq0_5lHCcorporaEnUSblogs <- 
        length(which(gram3FreqHCcorporaEnUSblogs == 1))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 2))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 3))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 4))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 5))
    # Compute the amount of triplets with a frequency less or equal to 9
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_5lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 6))
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 7))
    tripletFreq0_9lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 8))
    tripletFreq0_5lHCcorporaEnUSblogs <- tripletFreq0_9lHCcorporaEnUSblogs + 
        length(which(gram3FreqHCcorporaEnUSblogs == 9))
    # represents the percentege of the bigrams that is cover by the triplets that 
    # have a frequency less or equal to 5
    tripletFreq0_5lHCcorporaEnUSblogsPercentege <- 
        tripletFreq0_5lHCcorporaEnUSblogs/
        length(which(gram3FreqHCcorporaEnUSblogs != 0))
    # represents the percentege of the bigrams that is cover by the triplets that 
    # have a frequency less or equal to 9
    tripletFreq0_9lHCcorporaEnUSblogsPercentege <- 
        tripletFreq0_5lHCcorporaEnUSblogs/
        length(which(gram3FreqHCcorporaEnUSblogs != 0))
    
    # Triplet indexes order by frequency
    tripletOrderHCcorporaEnUSblogs <- order(gram3FreqHCcorporaEnUSblogs, decreasing = TRUE)
    # Represents the 10000 trigram more frequents of the vocabulary
    percentege10kTripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:10000]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 2700 trigram more frequents of the vocabulary
    percentege2700TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:2700]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 140 trigram more frequents of the vocabulary
    percentege140TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:140]]) /
        tripletTotalHCcorporaEnUSblogs
    # Represents the 15 trigram more frequents of the vocabulary
    percentege15TripletMoreFreqHCcorporaEnUSblogs <- sum(
        gram3FreqHCcorporaEnUSblogs[tripletOrderHCcorporaEnUSblogs[1:15]]) /
        tripletTotalHCcorporaEnUSblogs
}

hist(log10(gram3FreqHCcorporaEnUSblogs), 
     main = "Log10 Distribution of tri-grams of EnUSblogs", 
     xlab = "log10 of trigram frequency")
```

* The percentege of trigram used is 
`r length(which(gram3FreqHCcorporaEnUSblogs != 0))/length(gram3FreqHCcorporaEnUSblogs)*100`%  
* The percentege of the vocabullary that is cover by the trigram that have a 
frequency less or equal to 5 is 
`r tripletFreq0_5lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of the vocabullary that is cover by the trigram that have a 
frequency less or equal to 9 is 
`r tripletFreq0_9lHCcorporaEnUSblogsPercentege*100`%  
* The percentege of trigram in the document that is cover by the 10K (`r (10000/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege10kTripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 2,7K (`r (2700/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege2700TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 140 (`r (140/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege140TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The percentege of trigram in the document that is cover by the 15 (`r (15/length(which(gram3FreqHCcorporaEnUSblogs != 0)))*100`%) more 
frequent trigram is `r percentege15TripletMoreFreqHCcorporaEnUSblogs*100`%  
* The 15 more frequent trigram are: 
```{r triGramMoreFreq, echo=FALSE, cache=TRUE, message=FALSE}
kable(sapply((names(gram3FreqHCcorporaEnUSblogs))[tripletOrderHCcorporaEnUSblogs[1:15]], 
        function(triplet){
    temp <- sapply(strsplit(triplet,"_", fixed = TRUE),as.numeric)
    temp2 <- character(length = 2)
    temp2[1] <- ifelse(temp[[1]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[1]]])
    temp2[2] <- ifelse(temp[[2]]==2701,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[2]]])
    temp2[3] <- ifelse(temp[[2]]==141,"UNK",
        vocabularyHCcorporaEnUSbyIndex[temp[[3]]])
    t(temp2)
}))
```

## Questions about frequency of words  

1. Some words are more frequent than others - what are the distributions of word 
frequencies?  
    * People tends to use as few words as possible. The log10 of the 
    frequency has a exponential decay distribution.  
2. What are the frequencies of 2-grams and 3-grams in the dataset? 
    * The behaviour is the same, People tends to use as few n-grams as
    possible. The log10 of the frequency has a exponential decay distribution.   
3. How many unique words do you need in a frequency sorted dictionary to cover 
50% of all word instances in the language? 90%?  
    * As we showed above, it depends on the source (blogs, news, twitter, 
    etc.). But it is clear that araound 20% of the vocabullary covers more 
    than 80% of the words.  
4. How do you evaluate how many of the words come from foreign languages?  
    * Checking against a lexicom of the Supported words. There is not significant 
    difference between a unknown word and a foreing word.  
5. Can you think of a way to increase the coverage -- identifying words that may 
not be in the corpora or using a smaller number of words in the dictionary to 
cover the same number of phrases?  
    * We can make assumptions of the next word based on the previous words. This 
    is known as ngrams models.  
    * Due to the course of dimenssionality of ngram models, specifically the 
    sparsity generated for the high frequency of few words combined with the low 
    frequency of a lot of words; we can build a ngram model that not consider 
    all of the n-1 previous words and just consider n-1-k previous words. This 
    is known as skipgrams models.  
    * We can create a model that consider the morphological composition of 
    the words, and use the embeddings of this as inputs of our skipgram model.  

## Distribution of the frequencies of the characters over the vocabullary  

```{r characterFreqEnUS, echo=FALSE, cache=TRUE, message=FALSE}
library(plyr)
if(!file.exists("CorpusCharacterFreq1.RData"))
    system.time({
    print("Generating Corpus Character Freq.")
    alphabetFreqHCcorporaEnUSblogs <- lapply(vocabularyHCcorporaEnUSbyIndex, 
        function(w){
            chr <- strsplit(w, "")[[1]]
            chr <- table(chr)
            chr <- chr * wordFreqHCcorporaEnUSblogs[vocabularyHCcorporaEnUSbyWord[[w]]]
            chr <- as.data.frame(chr, stringsAsFactors = FALSE)
            colnames(chr) <- c("chr", "Freq")
            chr
        })
    alphabetFreqHCcorporaEnUSblogs <- rbind.fill(alphabetFreqHCcorporaEnUSblogs)
    alphabetFreqHCcorporaEnUSblogs$chr <- as.character(alphabetFreqHCcorporaEnUSblogs$chr)
    alphabetFreqHCcorporaEnUS <- table(alphabetFreqHCcorporaEnUSblogs$chr)
    alphabetOrderFreqHCcorporaEnUS <- order(alphabetFreqHCcorporaEnUS, 
                                            decreasing = TRUE)
    save(alphabetFreqHCcorporaEnUSblogs, alphabetFreqHCcorporaEnUS,
         alphabetOrderFreqHCcorporaEnUS,
        file = "CorpusCharacterFreq1.RData")
})else{
    print("Loading previous Corpus Character Freq.")
    load("CorpusCharacterFreq1.RData")
}
```

```{r characterFreqEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
temp <- ddply(alphabetFreqHCcorporaEnUSblogs, 
    .(chr), function(x) sum(x$Freq))
#alphabetFreqHCcorporaEnUSblogs <- alphabetFreqHCcorporaEnUSblogs[
#    order(alphabetFreqHCcorporaEnUSblogs$V1, decreasing = TRUE),]
```

Some facts:  
- Total of characters is `r length(alphabetFreqHCcorporaEnUS)`.  
- The 64 more frequent characters in the vocabulary are: `r names(alphabetFreqHCcorporaEnUS)[alphabetOrderFreqHCcorporaEnUS[1:64]]`
- The 64 more frequent characters in the EnUS blogs document are: `r alphabetFreqHCcorporaEnUSblogs$chr[1:64]`

```{r characterFreqEnUSGraph, echo=FALSE, cache=TRUE, message=FALSE}
enChars <- strsplit("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", "")[[1]]

hist(log10(alphabetFreqHCcorporaEnUS), 
     main = "Log10 Distribution of characters of EnUS vocabulary", 
     xlab = "log10 of character frequency")
hist(log10(alphabetFreqHCcorporaEnUS[enChars]), 
     main = "Log10 Distribution of 64 characters of English in the vocabulary", 
     xlab = "log10 of character frequency")

#hist(log10(alphabetFreqHCcorporaEnUSblogs$V1), 
#     main = "Log10 Distribution of characters of EnUS blogs document", 
#     xlab = "log10 of character frequency")
#temp <- which(alphabetFreqHCcorporaEnUSblogs$chr %in% enChars)
#temp <- alphabetFreqHCcorporaEnUSblogs$V1[as.numeric(temp)]
#temp <- subset(alphabetFreqHCcorporaEnUSblogs, chr == 'i')
#temp <- log10(alphabetFreqHCcorporaEnUSblogs$V1[temp])
#head(temp)
#dim(alphabetFreqHCcorporaEnUSblogs)
#class(alphabetFreqHCcorporaEnUSblogs$chr)
head(alphabetFreqHCcorporaEnUSblogs)
length(which(is.na(alphabetFreqHCcorporaEnUSblogs$Freq)))
#length(temp)
#class(temp)
head(temp)
length(which(is.na(temp$V1)))
#max(temp)
#min(temp)
#hist(temp, 
#     main = "Log10 Distribution of 64 characters of English in the EnUS blogs document", 
#     xlab = "log10 of character frequency")
```

## Distribution of the frequencies of the syllables over the vocabullary  

```{r cmuDictEnUS, echo=FALSE, cache=TRUE, message=FALSE}
if(!file.exists("CorpusCMUdic.RData"))
    system.time({
    print("Generating CMU Dictionary.")
    cmuPronDicEnUS <- readLines(file("CMUPronuciationDict/cmudict-07b-20160113.txt"))
    cmuPronDicEnUS <- cmuPronDicEnUS[128:133906]
    cmuPronDicEnUS <- unlist(strsplit(cmuPronDicEnUS, "  ", fixed = TRUE))
    cmuPronDicEnUS <- matrix(cmuPronDicEnUS, ncol = (length(cmuPronDicEnUS)/2))
    cmuPronDicEnUS <- t(cmuPronDicEnUS)
    cmuPronDicEnUS <- as.data.frame(cmuPronDicEnUS, stringsAsFactors = FALSE)
    colnames(cmuPronDicEnUS) <- c("w","s")
    cmuPronDicEnUS$w <- as.character(cmuPronDicEnUS$w)
    cmuPronDicEnUS$s <- as.character(cmuPronDicEnUS$s)
    totalWordCMUProDic <- length(cmuPronDicEnUS$w)
    temp <- grepl("[/(]{1}[1-3]{1}[/)]{1}",cmuPronDicEnUS$w)
    totalHomophoneCMUProDic <- length(which(temp))
    cmuPronDicEnUS <- cmuPronDicEnUS[which(!temp), ]
    totalDifWordCMUProDic <- length(cmuPronDicEnUS$w)
    cmuPronDicEnUSFreq <- data.frame()
    procCMUPro <- function(s){
        temp <- unlist(strsplit(s," "))
        res <- length(temp)
        temp <- table(temp)
        cmuPronDicEnUSFreq <<- rbind.fill(cmuPronDicEnUSFreq, as.data.frame(temp))
        res
    }
    cmuPronDicEnUS$sC <- sapply(cmuPronDicEnUS$s, procCMUPro)
    cmuPronDicEnUSFreq <- ddply(cmuPronDicEnUSFreq, 
        .(temp), function(x) sum(x$Freq))
    cmuPronDicEnUSdic <- new.env(hash = TRUE)
    for(i in 1:length(cmuPronDicEnUS$w))
        cmuPronDicEnUSdic[[cmuPronDicEnUS$w[i]]] <- cmuPronDicEnUS$s[i]
    save(cmuPronDicEnUS, totalWordCMUProDic, totalHomophoneCMUProDic,
         totalDifWordCMUProDic, cmuPronDicEnUSFreq, cmuPronDicEnUSdic,
        file = "CorpusCMUdic.RData")
})else{
    print("Loading previous CMU dictionary.")
    load("CorpusCMUdic.RData")
}
```

```{r syllableFreqEnUS, echo=FALSE, cache=TRUE, message=FALSE}
#library(koRpus)
#if(!file.exists("Corpus11.RData")){
#    print("Running hyphenation algoritm.")
    currentTime <- Sys.time()
    #syllableHCcorporaEnUS<-hyphen(vocabularyHCcorporaEnUSbyIndex, 
    #hyph.pattern = "en")@hyphen
    # save a image of the environment
#    save(syllableHCcorporaEnUS, 
#        file = "Corpus11.RData")
    Sys.time() - currentTime
#}else{
    print("Loading previous hyphenation algorithm result.")
#    load("Corpus11.RData")
#}
```

```{r syllableFreqEnUS2, echo=FALSE, cache=TRUE, message=FALSE}
sylProc <- function(w){
        syl <- strsplit(w, "-")[[1]]
        syl <- table(syl)
        w <- gsub("-", "", w, fixed = TRUE)
        syl <- syl * wordFreqHCcorporaEnUSblogs[vocabularyHCcorporaEnUSbyWord[[w]]]
        as.data.frame(syl)
    }
#syllableFreqHCcorporaEnUSblogs <- lapply(syllableHCcorporaEnUS$word, 
#    sylProc)
#syllableFreqHCcorporaEnUSblogs <- rbind.fill(syllableFreqHCcorporaEnUSblogs)
#syllableFreqHCcorporaEnUSblogs$syl <- 
#    as.character(syllableFreqHCcorporaEnUSblogs$syl)
#syllableFreqHCcorporaEnUS <- table(syllableFreqHCcorporaEnUSblogs$syl)
#syllableOrderFreqHCcorporaEnUS <- order(syllableFreqHCcorporaEnUS, 
#                                        decreasing = TRUE)
```

```{r syllableFreqEnUSblogs, echo=FALSE, cache=TRUE, message=FALSE}
```

Syllables using [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=Arpabet)
Some facts:  
- Total of words in CMU Pronouncing Dictionary `r totalWordCMUProDic`.  
- Total of words in CMU Pronouncing Dictionary that are different pronunciation
of the same word `r totalHomophoneCMUProDic`.  
- Total of different words in CMU Pronouncing Dictionary are `r totalDifWordCMUProDic`.  

# Task 3 - Build basic n-gram model & handle unseen n-grams  

```{r x, echo=FALSE, cache=TRUE, message=FALSE}
```

1. How can you efficiently store an n-gram model (think Markov Chains)?  
2. How can you use the knowledge about word frequencies to make your model 
smaller and more efficient?  
3. How many parameters do you need (i.e. how big is n in your n-gram model)?  
4. Can you think of simple ways to "smooth" the probabilities (think about 
giving all n-grams a non-zero probability even if they aren't observed in the 
data) ?  
5. How do you evaluate whether your model is any good?  
6. How can you use 
[backoff models](http://en.wikipedia.org/wiki/Katz%27s_back-off_model) to 
estimate the probability of unobserved n-grams?

# Task 4 - Build a predictive model & valuate the model for efficiency and accuracy  

```{r x, echo=FALSE, cache=TRUE, message=FALSE}
```

1. How does the model perform for different choices of the parameters and size 
of the model?  
2. How much does the model slow down for the performance you gain?  
3. Does perplexity correlate with the other measures of accuracy?  
4. Can you reduce the size of the model (number of parameters) without reducing 
performance?  


```{r x, echo=FALSE, cache=TRUE, message=FALSE}
```

[CRAN Task View: Natural Language Processing](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)  
[Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)  
[n-gram model](http://en.wikipedia.org/wiki/N-gram)  
[Google Books	Ngram Viewer](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)  
[backoff models](http://en.wikipedia.org/wiki/Katz%27s_back-off_model)  
[Good–Turing frequency estimation](https://en.wikipedia.org/wiki/Good–Turing_frequency_estimation)