Skip to content

Commit

Permalink
Merge pull request #827 from tilburgsciencehub/buildingblock/text-pre…
Browse files Browse the repository at this point in the history
…-processing

Updates to text pre-processing BB
  • Loading branch information
hannesdatta authored Oct 1, 2023
2 parents 299ed88 + 362202c commit 3d235af
Showing 1 changed file with 58 additions and 25 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -30,25 +30,26 @@ In the GitHub repository linked below you can find the full R script `text_clean

## Steps for pre-processing text data

### Install tm package
### Install `tm` package


{{% codeblock %}}

```R
install.packages("tm")
library(tm)
```
{{% /codeblock %}}

{{% /codeblock %}}



### Step 1: Create Corpus

The `tm-package` uses a so-called Corpus as the main structure for managing text documents. A corpus is a collection of documents and is classified into two types based on how the corpus is stored:
The `tm` package uses a so-called corpus as the main structure for managing text documents. A corpus is a collection of documents and is classified into two types based on how the corpus is stored:

- Volatile Corpus (VCorpus) - a temporary R object. This is the default implementation when creating a corpus.
- Permanent Corpus (PCorpus) - a permanent object that can be stored outside of R (e.g. in a database)
- Volatile corpus (VCorpus) - a temporary R object. This is the default implementation when creating a corpus.
- Permanent corpus (PCorpus) - a permanent object that can be stored outside of R (e.g., in a database)

Next, to create the corpus you need to identify the *source* type of the object. *Sources*
abstract input locations, like a directory, a connection, or simply an **R**
Expand Down Expand Up @@ -79,7 +80,7 @@ A data frame source interprets each row of the data frame x as a document. The
{{% /tip %}}

### Step 2: Cleaning Raw Data
The `tm-package` has several built-in transformation functions that enable pre-processing without *too much* code!
The `tm` package has several built-in transformation functions that enable pre-processing without *too much* code!

This procedure might include (depending on the data) :

Expand All @@ -103,6 +104,8 @@ review_corpus<- tm_map(review_corpus, content_transformer(tolower))

#### Whitespaces, Punctuation and Numbers

Whitespaces, while ensuring readability, can introduce inconsistencies in text data, necessitating standardized handling. Punctuation, essential for sentence structure, can cause variability in text mining, making its removal or normalization crucial. While numbers offer context, they might divert attention from the main textual content, so their normalization or removal can be beneficial.

{{% codeblock %}}
```R
review_corpus<- tm_map(review_corpus, stripWhitespace) # removes whitespaces
Expand All @@ -113,25 +116,28 @@ review_corpus<- tm_map(review_corpus, removeNumbers) # removes numbers


#### Special characters, URLs or HTML tags
For this purpose, you may create a custom function based on your needs and use it neatly under the tm framework.
For this purpose, you may create a custom function based on your needs and use it neatly under the `tm` framework.

{{% codeblock %}}
```R
# create custom function to remove other misc characters
text_preprocessing<- function(x)
{gsub('http\\S+\\s*','',x) # remove URLs
gsub('#\\S+','',x) # remove hashtags
gsub('[[:cntrl:]]','',x) # remove controls and special characters
gsub("^[[:space:]]*","",x) # remove leading whitespaces
gsub("[[:space:]]*$","",x) # remove trailing whitespaces
gsub(' +', ' ', x) # remove extra whitespaces
}
text_preprocessing<- function(x) {
gsub('http\\S+\\s*','',x) # remove URLs
gsub('#\\S+','',x) # remove hashtags
gsub('[[:cntrl:]]','',x) # remove controls and special characters
gsub("^[[:space:]]*","",x) # remove leading whitespaces
gsub("[[:space:]]*$","",x) # remove trailing whitespaces
gsub(' +', ' ', x) # remove extra whitespaces
}

# Now apply this function
review_corpus<-tm_map(review_corpus,text_preprocessing)
```
{{% /codeblock %}}

#### Stopwords
Stopwords such as “the”, “an” etc do not provide much of valuable information and can be removed from the text. Based on the context, you could also create custom stopwords list and remove them.
Stopwords such as “the”, “an” etc. do not provide much of valuable information and can be removed from the text. Based on the context, you could also create custom stopwords list and remove them.

{{% codeblock %}}
```R
review_corpus<- tm_map(review_corpus, removeWords, stopwords("english"))
Expand All @@ -148,11 +154,11 @@ The process of splitting text into smaller bites called tokens is called **token
- Stemming: it is the process of getting the root form (stem) of the word by removing and replacing suffixes. However, watch out for *overstemming* or *understemming.*
- *Overstemming occurs when words are over-truncated which might distort or strip the meaning of the word.*

E.g. the words “university” and “universe” may be reduced to “univers” but this implies both words mean the same which is incorrect.
E.g., the words “university” and “universe” may be reduced to “univers” but this implies both words mean the same which is incorrect.

- *Understemming occurs when two words are stemmed from the same root that is not of different stems.*

E.g. consider the words “data” and “datum” which have “dat” as the stem. Reducing the words to “dat” and “datu” respectively results in understemming.
E.g., consider the words “data” and “datum” which have “dat” as the stem. Reducing the words to “dat” and “datu”, respectively, results in understemming.

- Lemmatization: is the process of identifying the correct base forms of words using lexical knowledge bases. This overcomes the challenge of stemming where words might lose meaning and makes words more interpretable.

Expand All @@ -161,22 +167,26 @@ In this example we will stick to Lemmatization, which can be conducted in R as s
{{% codeblock %}}
```R
# Lemmatization
review_corpus <- tm_map(review_corpus, lemmatize_strings)
review_corpus<- tm_map(review_corpus, content_transformer(lemmatize_strings))

# Note: `lemmatize_words` function is used when you have a vector of words but in a corpus we do not have a vector of words. Instead, we have strings with each string being a document's content. Hence, we use `lemmatize_strings` function instead.
```
{{% /codeblock %}}

### Step 4: Term Document Matrix
The corpus can now be represented in the form of a Term Document Matrix which represents document vectors in matrix format. The rows of this matrix correspond to the terms in the document, columns represent the documents in the corpus and cells correspond to the weights of the terms.
### Step 4: Creating the Term-Document Matrix
The corpus can now be represented in the form of a Term-Document Matrix, which represents document vectors in matrix format. The rows of this matrix correspond to the terms in the document, columns represent the documents in the corpus and cells correspond to the weights of the terms.

{{% codeblock %}}
```R
tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))
```
{{% /codeblock %}}

#### Operations on Term-Document Matrices
## Inspect and Analyze your Textual Data

### Inspecting Word Frequemcies

Imagine you want to quickly view the terms with a certain frequency, say at least 50. You can use `findFreqTerms()` for this. `findAssocs()` is another useful function if you want to find associations with at least a certain percentage of correlation for a certain term.
Now that we've created the term-document matrix, we can start analyzing our data. For example, imagine you want to quickly view the terms with certain frequency, say at least 50. You can use `findFreqTerms()` for this. `findAssocs()` is another useful function if you want to find associations with at least certain percentage of correlation for certain term.

{{% codeblock %}}
```R
Expand Down Expand Up @@ -208,7 +218,8 @@ ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) + geom_bar
Term-document matrices tend to get very big with increasing sparsity. You can remove sparse terms, i.e., terms occurring only in very few documents. This reduces the matrix dimensionality without losing too much important information. Use the `removeSparseTerms()` function for this.
{{% /tip %}}

### Step 5: Visualise
### Visualizing the Data

Let’s build a word cloud that gives quick insights into the most frequently occurring words across documents at a glance. We will use the `wordcloud2` package for this purpose.

{{% codeblock %}}
Expand All @@ -227,8 +238,30 @@ wordcloud2(df, color = "random-dark", backgroundColor = "white")
<figcaption> Wordcloud of the most frequent words </figcaption>
</p>

That's it! You're now equipped with the basics in doing text analysis using the `tm` package in R!

{{% summary %}}
- Text mining extracts insights from unstructured text data and the main goal is to transform a corpus of texts into interpretable patterns and valuable insights.
- Steps to Pre-processing raw text data:
- Lowercase: Normalize text by converting it to lowercase.
- Remove special characters, punctuation, and numbers.
- Eliminate URLs, HTML tags, and other unwanted elements.
- Strip extra whitespaces and standardize formatting.
- Remove common stopwords that add little meaning.
- **Stemming**: reduces words to their root form by removing suffixes.
- **Lemmatization**: identifies base forms based on lexical knowledge.

- Term Document Matrix (TDM): Represent the corpus as a matrix of terms and documents. Each row is a term, each column is a document, cells hold term weights.

- Visualise the processed text data using Word Clouds which provide a quick overview of frequent words.

{{% /summary %}}

{{% tip %}}
Here are some alternative packages to check out
__Curious for more?__

Here are some alternative packages to check out!

- in R: [Quanteda](http://quanteda.io/), [Text2vec](https://text2vec.org/), [Tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) and [Spacyr](https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html)
- in Python: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), [TextBlob](https://textblob.readthedocs.io/en/dev/#), [spaCy](https://spacy.io/), and [CoreNLP](https://stanfordnlp.github.io/CoreNLP/)

Expand Down

0 comments on commit 3d235af

Please sign in to comment.