Create automated build

NCEAS · Aug 31, 2023 · 2e58b75 · 2e58b75
1 parent 2782ace
commit 2e58b75
Show file tree

Hide file tree

Showing 6 changed files with 1,582 additions and 76 deletions.
diff --git a/public/2023-08-delta/search.json b/public/2023-08-delta/search.json
@@ -431,7 +431,7 @@
     "href": "session_12.html#exercise-explore-unstructured-text-data-from-a-pdf",
     "title": "12  Working with Text Data in R",
     "section": "12.5 Exercise: Explore Unstructured Text Data from a PDF",
-    "text": "12.5 Exercise: Explore Unstructured Text Data from a PDF\n\n\n\n\n\n\nSetup\n\n\n\n\nIn the intro-text-data.qmd file, create a new header for this exercise (e.g. “Explore Unstructured Text Data from a PDF”).\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(tidytext) # tidy text tools\nlibrary(quanteda) # create a corpus\n\nWarning in stringi::stri_info(): Your current locale is not in the list of\navailable locales. Some functions may not work properly. Refer to\nstri_locale_list() for more details on known locale specifiers.\n\nWarning in stringi::stri_info(): Your current locale is not in the list of\navailable locales. Some functions may not work properly. Refer to\nstri_locale_list() for more details on known locale specifiers.\n\nlibrary(pdftools) # read in data\nlibrary(dplyr) # wrangle data\nlibrary(stringr) # string manipulation\nlibrary(ggplot2) # plots\nlibrary(wordcloud)\n\n\nDepending on which group you’re in, read in one of the following chapters of the Delta Plan. Access and download the chapters of the Delta Plan from Delta Stewardship Council website.\n\nNotes for quick exploration of data:\n\nCheck the class() of the pdf you just read in - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? What can you infer from how it’s structured?\n\n\n# ch 3\npath_df &lt;- \"data/dsc-plan-ch3.pdf\"\ndp_ch3 &lt;- pdftools::pdf_text(path_df)\n\n# ch 4\npath_df &lt;- \"data/dsc-plan-ch4.pdf\"\ndp_ch4 &lt;- pdftools::pdf_text(path_df)\n\n# ch 5\npath_df &lt;- \"data/dsc-plan-ch5.pdf\"\ndp_ch5 &lt;- pdftools::pdf_text(path_df)\n\n# ch 6\npath_df &lt;- \"data/dsc-plan-ch6.pdf\"\ndp_ch6 &lt;- pdftools::pdf_text(path_df)\n\n\nUsing the quanteda package, turn the unstructured pdf text data into a corpus.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? How does this structure compare to the pdf object?\nRun summary() of the corpus in the Console. What insights can you glean?\n\n\ncorpus_dp_ch &lt;- quanteda::corpus(dp_ch)\n\n\nUsing tidy() from tidytext, make the corpus a tidy object.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console or use View(). What does your data look like? Is it what you expected?\n\n\ntidy_dp_ch &lt;- tidytext::tidy(corpus_dp_ch)\n\n\n\n\n12.5.1 Questions\nWork independently or in groups for Question 1-5. The code solutions are based on the text data from Chapter 8 of the Delta Plan.\n\n\n\n\n\n\nQuestion 1\n\n\n\nTokenize the tidy text data using unnest_tokens()\n\n\n\n\nAnswer\nunnest_dp_ch8 &lt;- tidy_dp_ch8 %&gt;% \n    unnest_tokens(output = word,\n                  input = text) \n\n\n\n\n\n\n\n\nQuestion 2\n\n\n\nRemove stop words using anti_join() and the stop_words data frame from tidytext.\n\n\n\n\nAnswer\nwords_dp_ch8 &lt;- unnest_dp_ch8 %&gt;% \n    dplyr::anti_join(stop_words)\n\n\n\n\n\n\n\n\nQuestion 3\n\n\n\nCalculate the top 10 most frequently occurring words. Consider using count() and slice_max().\n\n\n\n\nAnswer\ncount_dp_ch8 &lt;- words_dp_ch8 %&gt;%\n    count(word) %&gt;%\n    slice_max(n = 10, order_by = n)\n\n\n\n\n\n\n\n\nQuestion 4\n\n\n\nVisualize the results using a plot of your choice (e.g. bar plot, lollipop plot, or wordcloud).\n\n\n\n\nPlot Code\n# bar plot\nggplot(count_dp_ch8, aes(x = reorder(word, n), y = n)) +\n    geom_col() +\n    coord_flip() +\n    labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n         x = NULL,\n         y = \"count\") +\n    theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# lollipop plot\nggplot(data = count_dp_ch8, aes(x=reorder(word, n), y=n)) +\n    geom_point() +\n    geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n    coord_flip() +\n    labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n         x = NULL,\n         y = \"Count\") +\n    theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# wordcloud\nwordcloud(words = count_dp_ch8$word,\n          freq = count_dp_ch8$n)\n\n\n\n\n\n\n\n12.5.2 Bonus Question\n\n\n\n\n\n\nQuestion 5\n\n\n\nWhat do you think of your plots? Are they helpful? Consider other techniques like adding custom stop words or stemming to improve your results."
+    "text": "12.5 Exercise: Explore Unstructured Text Data from a PDF\n\n\n\n\n\n\nSetup\n\n\n\n\nIn the intro-text-data.qmd file, create a new header for this exercise (e.g. “Explore Unstructured Text Data from a PDF”).\nCreate a new code chunk and attach the following libraries:\n\n\nlibrary(tidytext) # tidy text tools\nlibrary(quanteda) # create a corpus\nlibrary(pdftools) # read in data\nlibrary(dplyr) # wrangle data\nlibrary(stringr) # string manipulation\nlibrary(ggplot2) # plots\nlibrary(wordcloud)\n\n\nDepending on which group you’re in, read in one of the following chapters of the Delta Plan. Access and download the chapters of the Delta Plan from Delta Stewardship Council website.\n\nNotes for quick exploration of data:\n\nCheck the class() of the pdf you just read in - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? What can you infer from how it’s structured?\n\n\n# ch 3\npath_df &lt;- \"data/dsc-plan-ch3.pdf\"\ndp_ch3 &lt;- pdftools::pdf_text(path_df)\n\n# ch 4\npath_df &lt;- \"data/dsc-plan-ch4.pdf\"\ndp_ch4 &lt;- pdftools::pdf_text(path_df)\n\n# ch 5\npath_df &lt;- \"data/dsc-plan-ch5.pdf\"\ndp_ch5 &lt;- pdftools::pdf_text(path_df)\n\n# ch 6\npath_df &lt;- \"data/dsc-plan-ch6.pdf\"\ndp_ch6 &lt;- pdftools::pdf_text(path_df)\n\n\nUsing the quanteda package, turn the unstructured pdf text data into a corpus.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console. What does your data look like? How does this structure compare to the pdf object?\nRun summary() of the corpus in the Console. What insights can you glean?\n\n\ncorpus_dp_ch &lt;- quanteda::corpus(dp_ch)\n\n\nUsing tidy() from tidytext, make the corpus a tidy object.\n\nNotes for quick exploration of data:\n\nCheck the class() of the corpus you created - is it what you expected? How does the object appear in the Global Environment?\nCall the object in the Console or use View(). What does your data look like? Is it what you expected?\n\n\ntidy_dp_ch &lt;- tidytext::tidy(corpus_dp_ch)\n\n\n\n\n12.5.1 Questions\nWork independently or in groups for Question 1-5. The code solutions are based on the text data from Chapter 8 of the Delta Plan.\n\n\n\n\n\n\nQuestion 1\n\n\n\nTokenize the tidy text data using unnest_tokens()\n\n\n\n\nAnswer\nunnest_dp_ch8 &lt;- tidy_dp_ch8 %&gt;% \n    unnest_tokens(output = word,\n                  input = text) \n\n\n\n\n\n\n\n\nQuestion 2\n\n\n\nRemove stop words using anti_join() and the stop_words data frame from tidytext.\n\n\n\n\nAnswer\nwords_dp_ch8 &lt;- unnest_dp_ch8 %&gt;% \n    dplyr::anti_join(stop_words)\n\n\n\n\n\n\n\n\nQuestion 3\n\n\n\nCalculate the top 10 most frequently occurring words. Consider using count() and slice_max().\n\n\n\n\nAnswer\ncount_dp_ch8 &lt;- words_dp_ch8 %&gt;%\n    count(word) %&gt;%\n    slice_max(n = 10, order_by = n)\n\n\n\n\n\n\n\n\nQuestion 4\n\n\n\nVisualize the results using a plot of your choice (e.g. bar plot, lollipop plot, or wordcloud).\n\n\n\n\nPlot Code\n# bar plot\nggplot(count_dp_ch8, aes(x = reorder(word, n), y = n)) +\n    geom_col() +\n    coord_flip() +\n    labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n         x = NULL,\n         y = \"count\") +\n    theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# lollipop plot\nggplot(data = count_dp_ch8, aes(x=reorder(word, n), y=n)) +\n    geom_point() +\n    geom_segment(aes(x=word, xend=word, y=0, yend=n)) +\n    coord_flip() +\n    labs(title = \"Top 10 Most Frequently Occurring Words in Chapter 8 of the Delta Plan\",\n         x = NULL,\n         y = \"Count\") +\n    theme_minimal()\n\n\n\n\n\n\n\nPlot Code\n# wordcloud\nwordcloud(words = count_dp_ch8$word,\n          freq = count_dp_ch8$n)\n\n\n\n\n\n\n\n12.5.2 Bonus Question\n\n\n\n\n\n\nQuestion 5\n\n\n\nWhat do you think of your plots? Are they helpful? Consider other techniques like adding custom stop words or stemming to improve your results."
   },
   {
     "objectID": "session_12.html#common-text-analysis-and-text-mining-methods",