From f3fdf132324c95abf3886c0a829c9cd9418682ba Mon Sep 17 00:00:00 2001 From: srosh2000 <> Date: Fri, 25 Aug 2023 15:35:13 +0200 Subject: [PATCH 1/3] Updates --- .../data-preparation/text-preprocessing.md | 21 ++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md index d6adb5d43..17b2a8b72 100644 --- a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md +++ b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md @@ -158,7 +158,9 @@ The process of splitting text into smaller bites called tokens is called **token # Stemming review_corpus<- stemDocument(review_corpus, language = "english") # Lemmatization -review_corpus<- tm_map(review_corpus, content_transformer(lemmatize_words)) +review_corpus<- tm_map(review_corpus, content_transformer(lemmatize_strings)) + +# Note: `lemmatize_words` function is used when you have a vector of words but in a corpus we do not have a vector of words. Instead, we have strings with each string being a document's content. Hence, we use `lemmatize_strings` function instead. ``` {{% /codeblock %}} @@ -230,3 +232,20 @@ Here are some alternative packages to check out - in Python: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), [TextBlob](https://textblob.readthedocs.io/en/dev/#), [spaCy](https://spacy.io/), and [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) {{% /tip %}} + +{{% summary %}} +- Text mining extracts insights from unstructured text data and the main goal is to transform a corpus of texts into interpretable patterns and valuable insights. +- Steps to Pre-processing raw text data: + - Lowercase: Normalize text by converting it to lowercase. + - Remove special characters, punctuation, and numbers. + - Eliminate URLs, HTML tags, and other unwanted elements. + - Strip extra whitespaces and standardize formatting. + - Remove common stopwords that add little meaning. + - **Stemming**: reduces words to their root form by removing suffixes. + - **Lemmatization**: identifies base forms based on lexical knowledge. + +- Term Document Matrix (TDM): Represent the corpus as a matrix of terms and documents. Each row is a term, each column is a document, cells hold term weights. + +- Visualise the processed text data using Word Clouds which provide a quick overview of frequent words. + +{{% /summary %}} From 2b567263fc44c721340ea6c68735b878bcb6e06a Mon Sep 17 00:00:00 2001 From: Hannes Datta Date: Fri, 29 Sep 2023 19:48:48 +0200 Subject: [PATCH 2/3] some textual changes --- .../data-preparation/text-preprocessing.md | 75 +++++++++++-------- 1 file changed, 45 insertions(+), 30 deletions(-) diff --git a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md index 17b2a8b72..7471011d6 100644 --- a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md +++ b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md @@ -10,7 +10,8 @@ authorlink: "https://nl.linkedin.com/in/roshinisudhaharan" aliases: - /building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing --- -# Overview + +## Overview Text mining is all about deriving insights from unstructured text data such as social media posts, consumer reviews and newspaper articles. The ultimate goal is to turn a large collection of texts, a corpus, into insights that reveal important and interesting patterns in the data. @@ -27,25 +28,26 @@ Before we can move into the analysis of text, the unstructured nature of the dat ## Steps for pre-processing text data -### Install tm package +### Install `tm` package {{% codeblock %}} + ```R install.packages("tm") library(tm) ``` -{{% /codeblock %}} +{{% /codeblock %}} ### Step 1: Create Corpus -The `tm-package` uses a so-called Corpus as the main structure for managing text documents. A corpus is a collection of documents and is classified into two types based on how the corpus is stored: +The `tm` package uses a so-called corpus as the main structure for managing text documents. A corpus is a collection of documents and is classified into two types based on how the corpus is stored: -- Volatile Corpus (VCorpus) - a temporary R object. This is the default implementation when creating a corpus. -- Permanent Corpus (PCorpus) - a permanent object that can be stored outside of R (e.g. in a database) +- Volatile corpus (VCorpus) - a temporary R object. This is the default implementation when creating a corpus. +- Permanent corpus (PCorpus) - a permanent object that can be stored outside of R (e.g., in a database) Next, to create the corpus you need to identify the *source* type of the object. *Sources* abstract input locations, like a directory, a connection, or simply an **R** @@ -76,7 +78,7 @@ A data frame source interprets each row of the data frame x as a document. The {{% /tip %}} ### Step 2: Cleaning Raw Data -The `tm-package` has several built-in transformation functions that enable pre-processing without *too much* code! +The `tm` package has several built-in transformation functions that enable pre-processing without *too much* code! This procedure might include (depending on the data) : @@ -100,6 +102,8 @@ review_corpus<- tm_map(review_corpus, content_transformer(tolower)) #### Whitespaces, Punctuation and Numbers +Whitespaces, while ensuring readability, can introduce inconsistencies in text data, necessitating standardized handling. Punctuation, essential for sentence structure, can cause variability in text mining, making its removal or normalization crucial. While numbers offer context, they might divert attention from the main textual content, so their normalization or removal can be beneficial. + {{% codeblock %}} ```R review_corpus<- tm_map(review_corpus, stripWhitespace) # removes whitespaces @@ -110,25 +114,28 @@ review_corpus<- tm_map(review_corpus, removeNumbers) # removes numbers #### Special characters, URLs or HTML tags -For this purpose, you may create a custom function based on your needs and use it neatly under the tm framework. +For this purpose, you may create a custom function based on your needs and use it neatly under the `tm` framework. + {{% codeblock %}} ```R # create custom function to remove other misc characters -text_preprocessing<- function(x) -{gsub('http\\S+\\s*','',x) # remove URLs - gsub('#\\S+','',x) # remove hashtags - gsub('[[:cntrl:]]','',x) # remove controls and special characters - gsub("^[[:space:]]*","",x) # remove leading whitespaces - gsub("[[:space:]]*$","",x) # remove trailing whitespaces - gsub(' +', ' ', x) # remove extra whitespaces -} +text_preprocessing<- function(x) { + gsub('http\\S+\\s*','',x) # remove URLs + gsub('#\\S+','',x) # remove hashtags + gsub('[[:cntrl:]]','',x) # remove controls and special characters + gsub("^[[:space:]]*","",x) # remove leading whitespaces + gsub("[[:space:]]*$","",x) # remove trailing whitespaces + gsub(' +', ' ', x) # remove extra whitespaces + } + # Now apply this function review_corpus<-tm_map(review_corpus,text_preprocessing) ``` {{% /codeblock %}} #### Stopwords -Stopwords such as “the”, “an” etc do not provide much of valuable information and can be removed from the text. Based on the context, you could also create custom stopwords list and remove them. +Stopwords such as “the”, “an” etc. do not provide much of valuable information and can be removed from the text. Based on the context, you could also create custom stopwords list and remove them. + {{% codeblock %}} ```R review_corpus<- tm_map(review_corpus, removeWords, stopwords("english")) @@ -145,11 +152,11 @@ The process of splitting text into smaller bites called tokens is called **token - Stemming: it is the process of getting the root form (stem) of the word by removing and replacing suffixes. However, watch out for *overstemming* or *understemming.* - *Overstemming occurs when words are over-truncated which might distort or strip the meaning of the word.* - E.g. the words “university” and “universe” may be reduced to “univers” but this implies both words mean the same which is incorrect. + E.g., the words “university” and “universe” may be reduced to “univers” but this implies both words mean the same which is incorrect. - *Understemming occurs when two words are stemmed from the same root that is not of different stems.* - E.g. consider the words “data” and “datum” which have “dat” as the stem. Reducing the words to “dat” and “datu” respectively results in understemming. + E.g., consider the words “data” and “datum” which have “dat” as the stem. Reducing the words to “dat” and “datu”, respectively, results in understemming. - Lemmatization: is the process of identifying the correct base forms of words using lexical knowledge bases. This overcomes the challenge of stemming where words might lose meaning and makes words more interpretable. @@ -164,8 +171,8 @@ review_corpus<- tm_map(review_corpus, content_transformer(lemmatize_strings)) ``` {{% /codeblock %}} -### Step 4: Term Document Matrix -The corpus can now be represented in the form of a Term Document Matrix which represents document vectors in matrix format. The rows of this matrix correspond to the terms in the document, columns represent the documents in the corpus and cells correspond to the weights of the terms. +### Step 4: Creating the Term-Document Matrix +The corpus can now be represented in the form of a Term-Document Matrix, which represents document vectors in matrix format. The rows of this matrix correspond to the terms in the document, columns represent the documents in the corpus and cells correspond to the weights of the terms. {{% codeblock %}} ```R @@ -173,9 +180,11 @@ tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf))) ``` {{% /codeblock %}} -#### Operations on Term-Document Matrices +## Inspect and Analyze your Textual Data + +### Inspecting Word Frequemcies -Imagine you want to quickly view the terms with certain frequency, say at least 50. You can use `findFreqTerms()` for this. `findAssocs()` is another useful function if you want to find associations with at least certain percentage of correlation for certain term. +Now that we've created the term-document matrix, we can start analyzing our data. For example, imagine you want to quickly view the terms with certain frequency, say at least 50. You can use `findFreqTerms()` for this. `findAssocs()` is another useful function if you want to find associations with at least certain percentage of correlation for certain term. {{% codeblock %}} ```R @@ -207,7 +216,8 @@ ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) + geom_bar Term-document matrices tend to get very big with increasing sparsity. You can remove sparse terms, i.e., terms occurring only in very few documents. This reduces the matrix dimensionality without losing too much important information. Use the `removeSparseTerms()` function for this. {{% /tip %}} -### Step 5: Visualise +### Visualizing the Data + Let’s build a word cloud that gives quick insights into the most frequently occurring words across documents at a glance. We will use the `wordcloud2` package for this purpose. {{% codeblock %}} @@ -226,12 +236,7 @@ wordcloud2(df, color = "random-dark", backgroundColor = "white")
Wordcloud of the most frequent words

-{{% tip %}} -Here are some alternative packages to check out -- in R: [Quanteda](http://quanteda.io/), [Text2vec](https://text2vec.org/), [Tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) and [Spacyr](https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html) -- in Python: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), [TextBlob](https://textblob.readthedocs.io/en/dev/#), [spaCy](https://spacy.io/), and [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) - -{{% /tip %}} +That's it! You're now equipped with the basics in doing text analysis using the `tm` package in R! {{% summary %}} - Text mining extracts insights from unstructured text data and the main goal is to transform a corpus of texts into interpretable patterns and valuable insights. @@ -249,3 +254,13 @@ Here are some alternative packages to check out - Visualise the processed text data using Word Clouds which provide a quick overview of frequent words. {{% /summary %}} + +{{% tip %}} +__Curious for more?__ + +Here are some alternative packages to check out! + +- in R: [Quanteda](http://quanteda.io/), [Text2vec](https://text2vec.org/), [Tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) and [Spacyr](https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html) +- in Python: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), [TextBlob](https://textblob.readthedocs.io/en/dev/#), [spaCy](https://spacy.io/), and [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) + +{{% /tip %}} From c02d3a2ccfcaae5a2ac33af97b0f935dbe31a492 Mon Sep 17 00:00:00 2001 From: Hannes Datta Date: Fri, 29 Sep 2023 19:51:25 +0200 Subject: [PATCH 3/3] fix merge conflict --- .../data-preparation/text-preprocessing.md | 30 ++++++++++--------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md index 7471011d6..96dbf68f2 100644 --- a/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md +++ b/content/building-blocks/prepare-your-data-for-analysis/data-preparation/text-preprocessing.md @@ -1,7 +1,7 @@ --- title: "Text Pre-processing in R" description: "Learn how to pre-process text data in R using the tm-package" -keywords: "R, preparation, raw data, text, cleaning, wrangling, NLP, preprocessing, tm, text analysis" +keywords: "R, preparation, raw data, text, cleaning, wrangling, NLP, preprocessing, tm, text analysis, corpus, stemming, lemmatization, document matrix, tokenization" #weight: 4 #date: 2020-11-11T22:02:51+05:30 draft: false @@ -13,19 +13,21 @@ aliases: ## Overview -Text mining is all about deriving insights from unstructured text data such as social media posts, consumer reviews and newspaper articles. +In this building block, you will learn the essential steps of text pre-processing in R. You will use the 'tm' package and create a corpus, the main structure for managing text documents, to then conduct a range of text preprocessing tasks. These steps will allow you to transform raw text into a more structured and suitable form for analysis. Finally, you will explore the process of visualizing and analyzing text data using term-document matrices and word clouds. + +## Introduction + +Text mining is all about deriving insights from unstructured text data such as social media posts, consumer reviews, and newspaper articles. The ultimate goal is to turn a large collection of texts, a corpus, into insights that reveal important and interesting patterns in the data. -This could include either computing sentiment of text or inferring the topic of a text among other common tasks. +This could include either computing the sentiment of a text or inferring the topic of a text among other common tasks. Before we can move into the analysis of text, the unstructured nature of the data means there is a need to pre-process the raw text to transform it to provide some additional structure and clean the text to make it more amenable for further analysis. To illustrate some common pre-processing steps we will take some data on [Amazon reviews](https://www.kaggle.com/datasets/bharadwaj6/kindle-reviews) and use the `tm package` in R to clean up the review texts. - +In the GitHub repository linked below you can find the full R script `text_cleaning.R` which is used as a reference during this building block. We will cover the most relevant code snippets within it. However, we strongly recommend that you review the full script and keep it on hand while following the building block so you can replicate the presented results and get a comprehensive picture of the content. {{% cta-primary-center "Go to the GitHub Repository now" "https://github.com/srosh2000/book-review-analysis-example" %}} - - ## Steps for pre-processing text data ### Install `tm` package @@ -82,11 +84,11 @@ The `tm` package has several built-in transformation functions that enable pre-p This procedure might include (depending on the data) : -- removal of extra spaces -- lowering case -- removal of special characters -- removal of URLs and HTML tags -- removal of stopwords +- The removal of extra spaces +- Lowering case +- The removal of special characters +- The removal of URLs and HTML tags +- The removal of stopwords #### Lowering case @@ -139,7 +141,7 @@ Stopwords such as “the”, “an” etc. do not provide much of valuable infor {{% codeblock %}} ```R review_corpus<- tm_map(review_corpus, removeWords, stopwords("english")) -# OR: creating and using custom stopwords in adddition +# OR: creating and using custom stopwords in addition mystopwords<- c(stopwords("english"),"book","people") review_corpus<- tm_map(review_corpus, removeWords, mystopwords) ``` @@ -160,10 +162,10 @@ The process of splitting text into smaller bites called tokens is called **token - Lemmatization: is the process of identifying the correct base forms of words using lexical knowledge bases. This overcomes the challenge of stemming where words might lose meaning and makes words more interpretable. +In this example we will stick to Lemmatization, which can be conducted in R as shown in the code block below: + {{% codeblock %}} ```R -# Stemming -review_corpus<- stemDocument(review_corpus, language = "english") # Lemmatization review_corpus<- tm_map(review_corpus, content_transformer(lemmatize_strings))