Skip to content

Commit

Permalink
Deploying to gh-pages from @ f6136ad 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
hdolinh committed Aug 29, 2023
1 parent 800a70b commit abae20e
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 21 deletions.
4 changes: 2 additions & 2 deletions 2023-08-delta/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@
"href": "session_12.html#what-is-text-data",
"title": "12  Working with Text Data in R",
"section": "12.1 What is text data?",
"text": "12.1 What is text data?\nText data is information stored as character or string data types. It comes in various different forms including books, emails, social media posts, interview transcripts, newspapers, government reports, and much more.\n\n12.1.1 How do we talk about text data?\nHere is a list of text data or text analysis terms we’ll be referring to throughout this lesson. Note this is not a comprehensive list of text analysis terms that are used beyond this lesson.\n\n\n\n\n\n\n\nTerm\nDefinition\n\n\n\n\nCorpus (corpora, plural)\nCollection or database of text or multiple texts. These types of objects typically contain raw strings annotated with additional metadata and details.\n\n\nDocument-term matrix\nRepresents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document.\n\n\nNatural Language Processing (NLP)\nNLP is an interdisciplinary field used in computer science, data science, linguistics, and others to analyze, categorize, and work with computerized text.\n\n\nString\nSpecific type of data whose values are enclosed within a set of quotes. Typically values or elements are characters (e.g. “Hello World!”).\n\n\nText analysis\nThe process of deriving high-quality information or patterns from text through evaluation and interpretation of the output. Also referred to as “text mining” or “text analytics”.\n\n\nToken\nA meaningful unit of text, such as a word, to use for analysis.\n\n\nTokenization\nThe process of splitting text into tokens.\n\n\n\n\n\n12.1.2 How is text data used in the environmental field?\nAs our knowledge about the environmental world grows, researchers will need new computational approaches for working with text data because reading and identifying all the relevant literature for literature syntheses is becoming an increasingly difficult task.\n<<<<<<< HEAD Beyond literature syntheses, quantitative text analysis tools are extremely valuable for efficiently extracting information from texts and other text mining or text analysis tasks. ======= Beyond literature syntheses, quantitative text analysis tools are extremely valuable for efficiently extracting information from texts and other text mining or text analysis tasks. >>>>>>> bb20f6f8311df6dfd89cc170ff212e980fc9a258"
"text": "12.1 What is text data?\nText data is information stored as character or string data types. It comes in various different forms including books, emails, social media posts, interview transcripts, newspapers, government reports, and much more.\n\n12.1.1 How do we talk about text data?\nHere is a list of text data or text analysis terms we’ll be referring to throughout this lesson. Note this is not a comprehensive list of text analysis terms that are used beyond this lesson.\n\n\n\n\n\n\n\nTerm\nDefinition\n\n\n\n\nCorpus (corpora, plural)\nCollection or database of text or multiple texts. These types of objects typically contain raw strings annotated with additional metadata and details.\n\n\nDocument-term matrix\nRepresents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document.\n\n\nNatural Language Processing (NLP)\nNLP is an interdisciplinary field used in computer science, data science, linguistics, and others to analyze, categorize, and work with computerized text.\n\n\nString\nSpecific type of data whose values are enclosed within a set of quotes. Typically values or elements are characters (e.g. “Hello World!”).\n\n\nText analysis\nThe process of deriving high-quality information or patterns from text through evaluation and interpretation of the output. Also referred to as “text mining” or “text analytics”.\n\n\nToken\nA meaningful unit of text, such as a word, to use for analysis.\n\n\nTokenization\nThe process of splitting text into tokens.\n\n\n\n\n\n12.1.2 How is text data used in the environmental field?\nAs our knowledge about the environmental world grows, researchers will need new computational approaches for working with text data because reading and identifying all the relevant literature for literature syntheses is becoming an increasingly difficult task.\nBeyond literature syntheses, quantitative text analysis tools are extremely valuable for efficiently extracting information from texts and other text mining or text analysis tasks."
},
{
"objectID": "session_12.html#what-is-tidy-text-data",
Expand All @@ -368,7 +368,7 @@
"href": "session_12.html#exercise-tidy-text-workflow",
"title": "12  Working with Text Data in R",
"section": "12.3 Exercise: Tidy Text Workflow",
"text": "12.3 Exercise: Tidy Text Workflow\n\nWe are going to use the gutenbergr package to access public domain texts from Project Gutenberg (a library of free eBooks). We’ll then use the tidytext, dyplr and ggplot2 packages to practice the tidy text workflow.\nBreak out into groups and then follow the exercise setup and instructions.\n\n\n\n\n\n\nSetup and Instructions\n\n\n\n\nCreate a new qmd file and title it “Intro to Text Data”, name yourself as the author, and then save the file as intro-text-data.qmd.\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(gutenbergr) # access public domain texts from Project Gutenberg\nlibrary(tidytext) # text mining using tidy tools\nlibrary(dplyr) # wrangle data\nlibrary(ggplot2) # plot data\n\n\nDepending on which group you’re in, use one of the following public domain texts:\n\n\n# Group A\ngutenberg_works(title == \"Dracula\") # dracula text\n\n# Group B\ngutenberg_works(title == \"Frankenstein; Or, The Modern Prometheus\") # frankenstein text\n\n# Group C\ngutenberg_works(title == \"Carmilla\") # carmilla text\n\n\nGet the id number from the gutenberg_works() function so that you can download the text as a corpus using the function gutenberg_download(). Save the corpus to an object called {book-title}_corp. View the object - is the data in a tidy format?\nTokenize the corpus data using unnest_tokens(). Take a look at the data - do we need every single token for our analysis?\nRemove “stop words” or words that can be safely removed or ignored without sacrificing the meaning of the sentence (e.g. “to”, “in”, “and”) using anti_join(). Take a look at the data - are you satisfied with your data? We won’t conduct any additional cleaning steps here, but consider how you would further clean the data.\nCalculate the top 10 most frequent words using the functions count() and slice_max().\nPlot the top 10 most frequent words using ggplot(). We reccommend creating either a bar plot using geom_col() or a lollipop plot using both geom_point() and geom_segment().\nBonus: Consider elements in theme() and improve your plot.\n\n\n\n\n12.3.1 Example using Ray Bradbury’s Asleep in Armageddon\nThe code chunks below follows the instructions from above using Ray Bradbury’s Asleep in Armageddon.\n\n# get id number\ngutenberg_works(title == \"Asleep in Armageddon\")\n\n# A tibble: 1 × 8\n gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf\n <int> <chr> <chr> <int> <chr> <chr> \n1 63827 Asleep i… Bradb… 41269 en <NA> \n# ℹ 2 more variables: rights <chr>, has_text <lgl>\n\n\n\n\nSteps 4-7 Code\n# access text data using id number from `gutenberg_works()`\nbradbury_corp <- gutenberg_download(63827)\n\n# tidy text data - unnest and remove stop words\ntidy_bradbury <- bradbury_corp %>% \n unnest_tokens(word, text) %>% \n anti_join(stop_words, by = \"word\")\n\n# calculate top 10 most frequent words\ncount_bradbury <- tidy_bradbury %>%\n count(word) %>% \n slice_max(n = 10, order_by = n)\n\n\n\n\nStep 8 Plot Code\n# visualize text data #\n# bar plot\nggplot(data = count_bradbury, aes(n, reorder(word, n))) +\n geom_col() +\n labs(x = \"Count\",\n y = \"Token\")\n\n\n\n\n\n\n\nStep 9 Plot Code\n# visualize text data #\n# initial lollipop plot\n# ggplot(data = count_bradbury, aes(x=word, y=n)) +\n# geom_point() +\n# geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n# coord_flip() +\n# labs(x = \"Token\",\n# y = \"Count\")\n\n# ascending order pretty lollipop plot\nggplot(data = count_bradbury, aes(x=reorder(word, n), y=n)) +\n geom_point(color=\"cyan4\") +\n geom_segment(aes(x=word, xend=word, y=0, yend=n), color=\"cyan4\") +\n coord_flip() +\n labs(title = \"Top Ten Words in Ray Bradbury's Asleep in Armageddon\",\n x = NULL,\n y = \"Count\") +\n theme_minimal() +\n theme(\n panel.grid.major.y = element_blank()\n )"
"text": "12.3 Exercise: Tidy Text Workflow\n\nWe are going to use the gutenbergr package to access public domain texts from Project Gutenberg (a library of free eBooks). We’ll then use the tidytext, dyplr and ggplot2 packages to practice the tidy text workflow.\nBreak out into groups and then follow the exercise setup and instructions.\n\n\n\n\n\n\nSetup and Instructions\n\n\n\n\nCreate a new qmd file and title it “Intro to Text Data”, name yourself as the author, and then save the file as intro-text-data.qmd.\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(gutenbergr) # access public domain texts from Project Gutenberg\nlibrary(tidytext) # text mining using tidy tools\nlibrary(dplyr) # wrangle data\nlibrary(ggplot2) # plot data\n\n\nDepending on which group you’re in, use one of the following public domain texts:\n\n\n# Group A\ngutenberg_works(title == \"Dracula\") # dracula text\n\n# Group B\ngutenberg_works(title == \"Frankenstein; Or, The Modern Prometheus\") # frankenstein text\n\n# Group C\ngutenberg_works(title == \"The Strange Case of Dr. Jekyll and Mr. Hyde\") # jekyll hyde text\n\n\nGet the id number from the gutenberg_works() function so that you can download the text as a corpus using the function gutenberg_download(). Save the corpus to an object called {book-title}_corp. View the object - is the data in a tidy format?\nTokenize the corpus data using unnest_tokens(). Take a look at the data - do we need every single token for our analysis?\nRemove “stop words” or words that can be safely removed or ignored without sacrificing the meaning of the sentence (e.g. “to”, “in”, “and”) using anti_join(). Take a look at the data - are you satisfied with your data? We won’t conduct any additional cleaning steps here, but consider how you would further clean the data.\nCalculate the top 10 most frequent words using the functions count() and slice_max().\nPlot the top 10 most frequent words using ggplot(). We reccommend creating either a bar plot using geom_col() or a lollipop plot using both geom_point() and geom_segment().\nBonus: Consider elements in theme() and improve your plot.\n\n\n\n\n12.3.1 Example using Ray Bradbury’s Asleep in Armageddon\nThe code chunks below follows the instructions from above using Ray Bradbury’s Asleep in Armageddon.\n\n# get id number\ngutenberg_works(title == \"The Phantom of the Opera\")\n\n# A tibble: 1 × 8\n gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf\n <int> <chr> <chr> <int> <chr> <chr> \n1 175 The Phan… Lerou… 112 en Opera/Gothic Ficti…\n# ℹ 2 more variables: rights <chr>, has_text <lgl>\n\n\n\n\nSteps 4-7 Code\n# access text data using id number from `gutenberg_works()`\nbradbury_corp <- gutenberg_download(63827)\n\n# tidy text data - unnest and remove stop words\ntidy_bradbury <- bradbury_corp %>% \n unnest_tokens(word, text) %>% \n anti_join(stop_words, by = \"word\")\n\n# calculate top 10 most frequent words\ncount_bradbury <- tidy_bradbury %>%\n count(word) %>% \n slice_max(n = 10, order_by = n)\n\n\n\n\nStep 8 Plot Code\n# visualize text data #\n# bar plot\nggplot(data = count_bradbury, aes(n, reorder(word, n))) +\n geom_col() +\n labs(x = \"Count\",\n y = \"Token\")\n\n\n\n\n\n\n\nStep 9 Plot Code\n# visualize text data #\n# initial lollipop plot\n# ggplot(data = count_bradbury, aes(x=word, y=n)) +\n# geom_point() +\n# geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n# coord_flip() +\n# labs(x = \"Token\",\n# y = \"Count\")\n\n# ascending order pretty lollipop plot\nggplot(data = count_bradbury, aes(x=reorder(word, n), y=n)) +\n geom_point(color=\"cyan4\") +\n geom_segment(aes(x=word, xend=word, y=0, yend=n), color=\"cyan4\") +\n coord_flip() +\n labs(title = \"Top Ten Words in Ray Bradbury's Asleep in Armageddon\",\n x = NULL,\n y = \"Count\") +\n theme_minimal() +\n theme(\n panel.grid.major.y = element_blank()\n )"
},
{
"objectID": "session_13.html#learning-objectives",
Expand Down
Loading

0 comments on commit abae20e

Please sign in to comment.