Skip to content

Commit

Permalink
Create automated build
Browse files Browse the repository at this point in the history
  • Loading branch information
RRC_GHA committed Aug 31, 2023
1 parent 2782ace commit 2e58b75
Show file tree
Hide file tree
Showing 6 changed files with 1,582 additions and 76 deletions.
2 changes: 1 addition & 1 deletion public/2023-08-delta/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,7 @@
"href": "session_12.html#exercise-explore-unstructured-text-data-from-a-pdf",
"title": "12  Working with Text Data in R",
"section": "12.5 Exercise: Explore Unstructured Text Data from a PDF",
"text": "12.5 Exercise: Explore Unstructured Text Data from a PDF\n\n\n\n\n\n\nSetup\n\n\n\n\nIn the intro-text-data.qmd file, create a new header for this exercise (e.g. “Explore Unstructured Text Data from a PDF”).\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(tidytext) # tidy text tools\nlibrary(quanteda) # create a corpus\n\nWarning in stringi::stri_info(): Your current locale is not in the list of\navailable locales. Some functions may not work properly. Refer to\nstri_locale_list() for more details on known locale specifiers.\n\nWarning in stringi::stri_info(): Your current locale is not in the list of\navailable locales. Some functions may not work properly. Refer to\nstri_locale_list() for more details on known locale specifiers.\n\nlibrary(pdftools) # read in data\nlibrary(dplyr) # wrangle data\nlibrary(stringr) # string manipulation\nlibrary(ggplot2) # plots\nlibrary(wordcloud)\n\n\nDepending on which group you’re in, read in one of the following chapters of the Delta Plan. Access and download the chapters of the Delta Plan from Delta Stewardship Council website.\n\nNotes for quick exploration of data:\n\nCheck the class() of the pdf you just read in - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? What can you infer from how it’s structured?\n\n\n# ch 3\npath_df <- \"data/dsc-plan-ch3.pdf\"\ndp_ch3 <- pdftools::pdf_text(path_df)\n\n# ch 4\npath_df <- \"data/dsc-plan-ch4.pdf\"\ndp_ch4 <- pdftools::pdf_text(path_df)\n\n# ch 5\npath_df <- \"data/dsc-plan-ch5.pdf\"\ndp_ch5 <- pdftools::pdf_text(path_df)\n\n# ch 6\npath_df <- \"data/dsc-plan-ch6.pdf\"\ndp_ch6 <- pdftools::pdf_text(path_df)\n\n\nUsing the quanteda package, turn the unstructured pdf text data into a corpus.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? How does this structure compare to the pdf object?\nRun summary() of the corpus in the Console. What insights can you glean?\n\n\ncorpus_dp_ch <- quanteda::corpus(dp_ch)\n\n\nUsing tidy() from tidytext, make the corpus a tidy object.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console or use View(). What does your data look like? Is it what you expected?\n\n\ntidy_dp_ch <- tidytext::tidy(corpus_dp_ch)\n\n\n\n\n12.5.1 Questions\nWork independently or in groups for Question 1-5. The code solutions are based on the text data from Chapter 8 of the Delta Plan.\n\n\n\n\n\n\nQuestion 1\n\n\n\nTokenize the tidy text data using unnest_tokens()\n\n\n\n\nAnswer\nunnest_dp_ch8 <- tidy_dp_ch8 %>% \n unnest_tokens(output = word,\n input = text) \n\n\n\n\n\n\n\n\nQuestion 2\n\n\n\nRemove stop words using anti_join() and the stop_words data frame from tidytext.\n\n\n\n\nAnswer\nwords_dp_ch8 <- unnest_dp_ch8 %>% \n dplyr::anti_join(stop_words)\n\n\n\n\n\n\n\n\nQuestion 3\n\n\n\nCalculate the top 10 most frequently occurring words. Consider using count() and slice_max().\n\n\n\n\nAnswer\ncount_dp_ch8 <- words_dp_ch8 %>%\n count(word) %>%\n slice_max(n = 10, order_by = n)\n\n\n\n\n\n\n\n\nQuestion 4\n\n\n\nVisualize the results using a plot of your choice (e.g. bar plot, lollipop plot, or wordcloud).\n\n\n\n\nPlot Code\n# bar plot\nggplot(count_dp_ch8, aes(x = reorder(word, n), y = n)) +\n geom_col() +\n coord_flip() +\n labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n x = NULL,\n y = \"count\") +\n theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# lollipop plot\nggplot(data = count_dp_ch8, aes(x=reorder(word, n), y=n)) +\n geom_point() +\n geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n coord_flip() +\n labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n x = NULL,\n y = \"Count\") +\n theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# wordcloud\nwordcloud(words = count_dp_ch8$word,\n freq = count_dp_ch8$n)\n\n\n\n\n\n\n\n12.5.2 Bonus Question\n\n\n\n\n\n\nQuestion 5\n\n\n\nWhat do you think of your plots? Are they helpful? Consider other techniques like adding custom stop words or stemming to improve your results."
"text": "12.5 Exercise: Explore Unstructured Text Data from a PDF\n\n\n\n\n\n\nSetup\n\n\n\n\nIn the intro-text-data.qmd file, create a new header for this exercise (e.g. “Explore Unstructured Text Data from a PDF”).\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(tidytext) # tidy text tools\nlibrary(quanteda) # create a corpus\nlibrary(pdftools) # read in data\nlibrary(dplyr) # wrangle data\nlibrary(stringr) # string manipulation\nlibrary(ggplot2) # plots\nlibrary(wordcloud)\n\n\nDepending on which group you’re in, read in one of the following chapters of the Delta Plan. Access and download the chapters of the Delta Plan from Delta Stewardship Council website.\n\nNotes for quick exploration of data:\n\nCheck the class() of the pdf you just read in - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? What can you infer from how it’s structured?\n\n\n# ch 3\npath_df <- \"data/dsc-plan-ch3.pdf\"\ndp_ch3 <- pdftools::pdf_text(path_df)\n\n# ch 4\npath_df <- \"data/dsc-plan-ch4.pdf\"\ndp_ch4 <- pdftools::pdf_text(path_df)\n\n# ch 5\npath_df <- \"data/dsc-plan-ch5.pdf\"\ndp_ch5 <- pdftools::pdf_text(path_df)\n\n# ch 6\npath_df <- \"data/dsc-plan-ch6.pdf\"\ndp_ch6 <- pdftools::pdf_text(path_df)\n\n\nUsing the quanteda package, turn the unstructured pdf text data into a corpus.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? How does this structure compare to the pdf object?\nRun summary() of the corpus in the Console. What insights can you glean?\n\n\ncorpus_dp_ch <- quanteda::corpus(dp_ch)\n\n\nUsing tidy() from tidytext, make the corpus a tidy object.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console or use View(). What does your data look like? Is it what you expected?\n\n\ntidy_dp_ch <- tidytext::tidy(corpus_dp_ch)\n\n\n\n\n12.5.1 Questions\nWork independently or in groups for Question 1-5. The code solutions are based on the text data from Chapter 8 of the Delta Plan.\n\n\n\n\n\n\nQuestion 1\n\n\n\nTokenize the tidy text data using unnest_tokens()\n\n\n\n\nAnswer\nunnest_dp_ch8 <- tidy_dp_ch8 %>% \n unnest_tokens(output = word,\n input = text) \n\n\n\n\n\n\n\n\nQuestion 2\n\n\n\nRemove stop words using anti_join() and the stop_words data frame from tidytext.\n\n\n\n\nAnswer\nwords_dp_ch8 <- unnest_dp_ch8 %>% \n dplyr::anti_join(stop_words)\n\n\n\n\n\n\n\n\nQuestion 3\n\n\n\nCalculate the top 10 most frequently occurring words. Consider using count() and slice_max().\n\n\n\n\nAnswer\ncount_dp_ch8 <- words_dp_ch8 %>%\n count(word) %>%\n slice_max(n = 10, order_by = n)\n\n\n\n\n\n\n\n\nQuestion 4\n\n\n\nVisualize the results using a plot of your choice (e.g. bar plot, lollipop plot, or wordcloud).\n\n\n\n\nPlot Code\n# bar plot\nggplot(count_dp_ch8, aes(x = reorder(word, n), y = n)) +\n geom_col() +\n coord_flip() +\n labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n x = NULL,\n y = \"count\") +\n theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# lollipop plot\nggplot(data = count_dp_ch8, aes(x=reorder(word, n), y=n)) +\n geom_point() +\n geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n coord_flip() +\n labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n x = NULL,\n y = \"Count\") +\n theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# wordcloud\nwordcloud(words = count_dp_ch8$word,\n freq = count_dp_ch8$n)\n\n\n\n\n\n\n\n12.5.2 Bonus Question\n\n\n\n\n\n\nQuestion 5\n\n\n\nWhat do you think of your plots? Are they helpful? Consider other techniques like adding custom stop words or stemming to improve your results."
},
{
"objectID": "session_12.html#common-text-analysis-and-text-mining-methods",
Expand Down
Loading

0 comments on commit 2e58b75

Please sign in to comment.