library(readr)
+library(dplyr)
+library(tidyr)
+library(forcats)
+library(ggplot2)
+library(leaflet)
+library(DT)
+library(scales)
diff --git a/public/.nojekyll b/public/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/public/404.html b/public/404.html new file mode 100644 index 00000000..5e71c9e4 --- /dev/null +++ b/public/404.html @@ -0,0 +1,167 @@ + + +
+ + + ++ EVENTS +
+Dates: July 31 - Aug 1, 2017
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 2-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
The Arctic Data Center conducts training in data science and management, +both of which are critical skills for stewardship of data, software, and other +products of research that are preserved at the Arctic Data Center.
+ +Course topics will include:
+ ++ | Day 1. | +Day 2. | +
---|---|---|
8:30-9:00 | +Welcome and introductions | +Writing Good Data Management Plans | +
9:00-9:45 | +Arctic Data Center and NSF Standards and Policies | +Writing Good Data Management Plans | +
9:45-10:00 | +Break | +Break | +
10:00-12:00 | +Effective data modeling and management | +Data packaging and file hierarchies | +
Noon-1:15 | +Lunch | +Lunch | +
1:15-2:15 | +Authoring Quality metadata | +Authoring large data sets | +
2:15-2:30 | +Break | +Break | +
2:30-4:30 | +Authoring Quality metadata | +Large data and Tracking data provenance | +
4:30-5:00 | +Question and Answer | +Discussion | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: July 10 - July 28, 2017
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
The primary goal of the Open Science for Synthesis: Gulf Research Program Workshop is to provide hands-on experience with contemporary open science tools from command line to data to communication. Team science is promoted. Practice and real data are used in groups to apply skills we explore.
+ +Week 1. Fundamental collaboration skills +Introduction to command line, communicating science, R, meta-analysis and data management.
+ +Week 2. Advanced topics +Tabular data, programming, Python, workflows, reproducible science and metagenomics.
+ +Week 3. Advanced topics & group projects +Communication, geospatial analysis, data viz, and group project sharing.
+ ++ EVENTS +
+Dates: June 22, 2018 12:30 to 14:00
+Location: Davos, Switzerland
+Venue: POLAR 2018 Open Science Conference
This hands-on session will cover (1) Open data archives, especially the Arctic Data Center; (2) What science metadata is and how it can be used; (3) How data and code can be documented and published in open data archives; (4) Web-based submission, and (5) Submission using R (pending sufficient time).
+ +The Arctic Data Center conducts training in data science and management, +both of which are critical skills for stewardship of data, software, and other +products of research that are preserved at the Arctic Data Center.
+ +In this hands-on session, we will cover:
+ +Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: August 13 - August 17, 2018
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Kathryn Meyer | +meyer@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ +We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.4.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("dataone", "datapack", "devtools", "dplyr", "EML", "ggplot2", "readxl", "tidyr")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ +Name | +|
---|---|
Amanda Moors | +amanda.moors@noaa.gov | +
Anastasia Tarmann | +altarmann@alaska.edu | +
Anya Suslova | +asuslova@whrc.org | +
Apryl Perry | +apryl.perry@gmail.com | +
Erin Whitney | +erin.whitney@alaska.edu | +
Hong Guo | +hongguo@uark.edu | +
Ibrahim Ilhan | +iilhan@alaska.edu | +
Ivan Sudakov | +isudakov1@udayton.edu | +
Jennifer Schmidt | +jischmidt@alaska.edu | +
Kang Wang | +Kang.Wang@colorado.edu | +
Kelsey Aho | +ahokelsey@gmail.com | +
Kirill Dolgikh | +kdolgikh@alaska.edu | +
Priyanka Surio | +psurio@gwmail.gwu.edu | +
Randy Fulweber | +rafulweber@alaska.edu | +
Robert Orttung | +rorttung@gmail.com | +
Sam Dunn | +sdunn3@luc.edu | +
Taejin Park | +parktj@bu.edu | +
Tate Meehan | +tatemeehan@u.boisestate.edu | +
Tobias Schwoerer | +tschwoerer@alaska.edu | +
Vera Kuklina | +vvkuklina@gmail.com | +
Vera Solovyeva | +vsolovye@masonlive.gmu.edu | +
Vibhor Agarwal | +vibhor.ism@gmail.com | +
+ EVENTS +
+Dates: January 14 - January 18, 2019
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Kathryn Meyer | +meyer@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ +We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.5.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ +Name | +Affiliation | +|
---|---|---|
Aleksandra Durova | +adurova@mit.edu | +MIT | +
Angela Bliss | +acbliss2@gmail.com | +NASA | +
Barbara Johnson | +bajohnson20@alaska.edu | +University of Alaska - Fairbanks | +
Benjamin Keisling | +benjaminkeisling@gmail.com | +University of Massachusetts - Amherst | +
Bryan Brasher | +Bryan.brasher@noaa.gov | +NOAA | +
Carolyn M. Kurle | +ckurle@ucsd.edu | +UC San Diego | +
Gary Holton | +holton@hawaii.edu | +University of Hawaii at Manoa | +
Hunter Snyder | +hunter.gr@dartmouth.edu | +Dartmouth College | +
Julia Stuart | +jms2435@nau.edu | +Northern Arizona University | +
Kelly Deuerling | +kmdeuerling@gmail.com | +University of Florida | +
Matthew Druckenmiller | +druckenmiller@nsidc.org | +National Snow and Ice Data Center | +
Jessica Ernakovich | +jessica.ernakovich@unh.edu | +University of New Hampshire | +
Michael Loranty | +mloranty@colgate.edu | +Colgate University | +
Monika Sikand | +monikavsikand@gmail.com | +Bronx Community College | +
Nicole Trenholm | +nicolet3@umbc.edu | +University of Maryland Baltimore County | +
Patrick Alexander | +pma2107@ldeo.columbia.edu | +Lamont Doherty Earth Observatory | +
Patricia DeRepentigny | +patricia.derepentigny@colorado.edu | +University of Colorado - Boulder | +
Seeta Sistla | +ssistla@hampshire.edu | +Hampshire College | +
Shayne O’Brien | +sro41@txstate.edu | +Texas State | +
Shujie Wang | +swang@ldeo.columbia.edu | +Lamont Doherty Earth Observatory | +
Vladimir Alexeev | +valexeev@alaska.edu | +University of Alaska - Fairbanks | +
Ying-Jung C Deweese | +ydeweese@uw.edu | +University of Washington | +
+ EVENTS +
+Dates: January 10 - May 24, 2019
+Location: Remote
In the Long-term Remote Program, Cohorts of research groups participate over a four-month period, with two Cohort Calls each month. Calls are designed to be engaging, requiring discussion and active participation through live-notetaking in Google Docs and video Zoom (group and breakouts).
+ +Openscapes is operated by the National Center for Ecological Analysis & Synthesis (NCEAS) and is being incubated by a Mozilla Fellowship awarded to Julia Stewart Lowndes.
+ +Course agenda and lesson materials are available here.
+ ++ EVENTS +
+Dates: February 11 - February 15, 2019
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ +We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.5.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("dataone", "datapack", "devtools", "dplyr", "EML", "ggplot2", "readxl", "tidyr")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ +Name | +Affiliation | +|
---|---|---|
Adam Schneider | +amschne@umich.edu | +University of Michigan | +
Aleksey Sheshukov | +ashesh@ksu.edu | +Kansas State University | +
Alexis C Garretson | +alexis@garretson.net | +Brigham Young University | +
Ali Paulson | +alison.paulson@msstate.edu | +Mississippi State | +
Anastasija Mensikova | +mensikova.anastasija@gmail.com | +George Washington University | +
Anna Nesterovich | +annanest@iastate.edu | +Iowa State | +
Caixia Wang | +cwang12@alaska.edu | +University of Alaska - Anchorage | +
Christina Minions | +cminions@whrc.org | +Woods Hole Research Center | +
Desheng Liu | +liu.738@osu.edu | +Ohio State University | +
Helene Angot | +helene.angot@colorado.edu | +University of Colorado - Boulder | +
Ian Baxter | +itbaxter@ucsb.edu | +UC Santa Barbara | +
Kelly Kapsar | +kelly.kapsar@gmail.com | +Michigan State | +
Komi Messan | +Komi.S.Messan@erdc.dren.mil | +US Army Corps of Engineers | +
Olaf Kuhlke | +okuhlke@d.umn.edu | +University of Minnesota - Duluth | +
Rebecca Finger-Higgens | +rebecca.finger@gmail.com | +Dartmouth College | +
Sanghoon Kang | +sanghoon_kang@baylor.edu | +Baylor University | +
Sara Pedro | +sara.pedro@uconn.edu | +University of Connecticut | +
Susan L. Howard | +showard@esr.org | +Earth and Space Research | +
Yiyi Huang | +yiyi063@email.arizona.edu | +University of Arizona | +
+ EVENTS +
+Dates: October 7 - October 11, 2019
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Acceptance to the training includes domestic travel, accommodation, and meals.
+ +Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ +We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.6.1, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("dataone", "datapack", "devtools", "dplyr", "EML", "ggplot2", "readxl", "tidyr", "EML", "dataone", "datapack", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ +Name | +Affiliation | +|
---|---|---|
Sharon Kenny | +kenny.sharon@epa.gov | +Environmental Protection Agency | +
Anna Talucci | +atalucci@colgate.edu | +Colgate University | +
Michael Sousa | +sousa014@umn.edu | +University of Minnesota | +
Leslie M. Hartten | +Leslie.M.Hartten@noaa.gov | +University of Colorado | +
Andreas Muenchow | +muenchow@udel.edu | +University of Delaware | +
Haley Dunleavy | +hd255@nau.edu | +Northern Arizona University | +
Christina Bonsell | +cbonsell@utexas.edu | +University of Texas | +
Jennie DeMarco | +cbonsell@utexas.edu | +University of Texas | +
Eugenie Euskirchen | +cbonsell@utexas.edu | +University of Texas | +
Allen Bondurant | +acbondurant@alaska.edu | +University of Alaska | +
Timothy Pasch | +timothy.pasch@und.edu | +University of North Dakota | +
Amanda B. Young | +ayoung55@alaska.edu | +University of Texas | +
Nikolai Tausnev | +nikolai.l.tausnev@nasa.gov | +National Aeronautics and Space Administration | +
Toni Aandroski | +aandroski@unm.edu | +University of New Mexico | +
+ EVENTS +
+Dates: November 4 - November 8, 2019
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.6.1, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
devtools, dplyr, DT, ggplot2, leaflet, tidyr, EML, dataone, datapack, sf, rmarkdown, roxygen2, usethis, broom, captioner
You can install these packages quickly by running the following two code snippets:
+ + packages <- c("DT", "dataone", "datapack", "devtools", "dplyr", "EML", "ggmap", "ggplot2", "leaflet", "readxl", "tidyr", "scales", "sf", "rmarkdown", "roxygen2", "usethis", "broom", "captioner")
+
+
+ for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+ }
+
+
+This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: February 3 - February 7, 2020
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.6.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
dplyr, tidyr, devtools, usethis, roxygen2, leaflet, ggplot2, DT, scales, shiny, sf, ggmap, broom, captioner
You can install these packages quickly by running the following two code snippets:
+ + packages <- c("dplyr", "tidyr", "devtools", "usethis", "roxygen2", "leaflet", "ggplot2", "DT", "scales", "shiny", "sf", "ggmap", "broom", "captioner")
+
+
+ for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+ }
+
+
+This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: February 26 - February 27, 2020
+Location: Woods Hole, Massachusetts. NOAA Northeast Fisheries Science Center.
Through in-person Workshops, cohorts of research groups participate over a 2 full days. Workshops are designed to be engaging, requiring active participation through discussion, live-notetaking in Google Docs, and breakout group activities.
+ +Openscapes is operated by the National Center for Ecological Analysis & Synthesis (NCEAS) and is being incubated by a Mozilla Fellowship awarded to Julia Stewart Lowndes.
+ +Course agenda and lesson materials are available here.
+ ++ EVENTS +
+Dates: February 18 - February 21, 2020
+Location: Santa Barbara, CA
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
This intensive 4-day workshop on Data Science and Collaboration Skills for Integrative Conservation Science will be held at NCEAS, Santa Barbara, CA from Feb 18 to Feb 21, 2020.
+ +This training, sponsored by SNAPP, aims to bring together the SNAPP and NCEAS postdoctoral associates to foster communities and collaboration, as well as promote scientific computing and open science best practices.
+ +The goals of this workshop are to:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Julien Brun | +brun@nceas.ucsb.edu | +
Carrie Kappel | +kappel@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
We will primarily be using a web browser, R
, RStudio, and git
. Please be sure these are all installed on your laptop, as follows:
R: We will use R version 3.6.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page. +If you don’t know how up to date your version of RStudio is, please download an updated copy and install it
R packages: Please be sure you have installed or updated the following packages:
devtools, dplyr, DT, ggplot2, ggmap, leaflet, tidyr, sf, rmarkdown, roxygen2, usethis
You can install these packages quickly by running the following two code snippets:
+ + packages <- c("DT", "devtools", "dplyr", "ggmap", "ggplot2", "leaflet", "readxl", "tidyr", "scales", "sf", "rmarkdown", "roxygen2", "usethis")
+
+
+ for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+ }
+
+
+This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: April 24, 2020
+Location: Remote
This 3-hour module provides mentorship and facilitation training for Working Groups to develop skill sets, habits, and mindsets to make remote work and collaborative synthesis science more efficient and resilient.
+ ++ EVENTS +
+Dates: May 8, 2020
+Location: Remote
This module is an introduction to the data science support NCEAS is providing to LTER and SNAPP working groups followed by a discussion on best practices about data management in a distributed team setup. Participants will have the opportunity to brainstorm on their data and computing needs. In the second part of the workshop, an introduction to the use of NCEAS analytical server and the concept of collaborative coding as a distributed team will be demonstrated to empower participants to develop their analytical workflows in a remote setup.
+ +Goal: Introduction to data science support and discuss data and analytical needs
+ +Format: presentation and discussion;
+ +Audience: Whole Working group
+ +Goal: Empower working groups to collaboratively develop analytical workflows
+ +Format: Demo + questions
+ +Audience: Analysts + interested participants
+ ++ EVENTS +
+Dates: October 19 - October 23, 2020
+Location: Online
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ +We will primarily be using a web browser along with an instance of RStudio server set +up especially for this course. However, we also recommend setting up R, RStudio, +and git on your local system to more easily prepare you to utilize the skills you learned +once the course ends.
+ +R: We will use R version 4.0.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page.
R packages: We will be using the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr", "scales", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: November 12 - November 18, 2020
+Location: Remote
This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Bryce Mecum | +mecum@nceas.ucsb.edu | +
We will primarily be using a web browser along with an instance of RStudio server set +up especially for this course. However, we also recommend setting up R, RStudio, +and git on your local system to more easily prepare you to utilize the skills you learned +once the course ends.
+ +R: We will use R version 4.0.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page.
R packages: We will be using the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr", "scales", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: December, 2020
+Location: Virtual
A self guided learning curricula to support new NEON postdocs as part of their onboarding experience. The curricula builds from the experience of ecological researchers, trainers, developers and information managers to provide resources and training in support of collaborative, reproducible research practices.
+ ++ EVENTS +
+Dates: February 25-26, March 1-3, 2021
+Location: Remote
This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Bryce Mecum | +mecum@nceas.ucsb.edu | +
We will primarily be using a web browser along with an instance of RStudio server set +up especially for this course. However, we also recommend setting up R, RStudio, +and git on your local system to more easily prepare you to utilize the skills you learned +once the course ends.
+ +R: We will use R version 4.0.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page.
R packages: We will be using the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr", "scales", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: July 8-9, 12-14, 2021
+Location: Remote
This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Bryce Mecum | +mecum@nceas.ucsb.edu | +
We will primarily be using a web browser along with an instance of RStudio server set +up especially for this course. However, we also recommend setting up R, RStudio, +and git on your local system to more easily prepare you to utilize the skills you learned +once the course ends.
+ +R: We will use R version 4.0.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page.
R packages: We will be using the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr", "scales", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Friday, August 6th, 2021 +10:30 AM - 1:30 PM
+ +While graduate students in ecology learn about methods for collecting and analyzing ecological data, there is less emphasis on managing and using the resulting data effectively. This is an increasingly important skill set as the research landscape changes. Researchers are increasingly engaging in collaboration across networks, many funding agencies require data management plans, journals are requiring that data and code be accessible, and society is increasingly expecting that research be reproducible. Ecologists can maximize the productivity of their research program with good data skills, in that they can effectively and efficiently share their data and other research products with the scientific community, and potentially benefit from the re-use of their data by others.
+ +The purpose of this short course is to give attendees an introduction to a set of practical tools for organizing and sharing their data through all parts of the research cycle. The target audience is early-career scientists but is open to any researcher who would benefit from developing better data management skills. Topics will include data organization, data documentation, and the importance of good data management practices for data sharing, collaboration, and data re-use. The short course will be an interactive combination of presentation, discussion and activities. Participants must bring their own laptop to work on exercises.
+ +Learn to organize and share your data through all parts of the research cycle. Topics include data organization, data documentation, and the importance of good data management practices for data sharing, collaboration, and data re-use.
+ +Jeanette Clark – University of California, Santa Barbara, NCEAS, Arctic Data Center
+ +Amber Budden – University of California, Santa Barbara, NCEAS +Matthew B. Jones – University of California, Santa Barbara, NCEAS
+ ++ EVENTS +
+In collaboration with the Delta Science Program we are running a 12 month facilitated research synthesis activity, supported by 3 one-week intensive training events. Curriculum material will focus on introducing Delta Researchers to best practices in, and application of, scientific computing and +scientific software for reproducible science. In addition to developing and delivering learning curriculum, this collaboration will include the provision of data consulting, synthesis facilitation, and a remote workshop to conclude the group synthesis activities.
+ +The learning curricula will focus on techniques for data management, scientific programming, synthetic analysis, and collaboration techniques through the use of open-source, community-supported tools. Participants will learn skills for rapid and robust use of open source scientific software. These approaches will be explored and applied to scientific synthesis projects related to the Delta ecosystem.
+ +The course will weave together several core themes which are reinforced – and injected into the real-time synthetic scientific research process – through work on group synthesis projects.
+ +Week 1: September 13-17, 2021. Remote.
+Week 2: October 25-29, 2021. UC Davis.
+Week 3: November 1-5, 2021. UC Davis.
Jeanette is a Projects Data Coordinator at NCEAS with extensive experience helping synthesis scientists find, synthesize, document, and publish datasets. She also helps maintain the Knowledge Network for Biocomplexity (KNB) data archive. She has expertise in R, GitHub, structured metadata, and data archival. Jeanette was introduced to data processing and data analysis through her academic background in physical oceanography, and enjoys applying this foundation to more interdisciplinary ecology research.
+ +Amber is the Director of Learning and Outreach at NCEAS and lead of community engagement and outreach at DataONE and the Arctic Data Center. She holds a PhD in Ecology in addition to research experience in bibliometrics. She has coordinated and taught numerous workshops focused on data management for Earth and environmental science. Her skills include data management, science communication and outreach, and training evaluation.
+ +Matt is the Director of Informatics Research and Development at NCEAS and has expertise in environmental informatics, particularly software for management, integration, analysis, and modeling of data. Jones has taught at over 20 training workshops over a decade on data science topics including analysis in R, GitHub, programming (e.g., Python), data management, quality assessment and reporting, metadata and data infrastructure, scientific workflow systems, and other topics.
+ +Chris Lortie is Professor of Ecology at York University. He received his B.Sc. in Biology, Bachelor’s of Education, and Master’s degrees from Queen’s University and was awarded a PhD in Botany from the University of British Columbia. Chris’ areas of research include community ecology, seedbanks, invasive species, social ecology and theory development with a focus on synthesis and meta-analytical techniques.
+ +Dr. David LeBauer Director of Data Science for the Division of Agriculture, Life and Veterinary Sciences and Cooperative Extension, University of Arizona. David develops open data and software that enables synthesis of observations with systems-level understanding to improve understanding and prediction of agricultural yield potential as well as ecosystem carbon, nutrient, water, and energy budgets.
+ +Jessica Guo is a scientific programmer and plant ecophysiologist in the Digital Agriculture Group at the University of Arizona. Her background is in hierarchical Bayesian modeling of plant and ecosystem processes. She completed her PhD in Biology at Northern Arizona University with Dr. Kiona Ogle.
+ ++ EVENTS +
+Dates: November 15-19, 2021
+Location: Remote
This 5-day workshop will provide researchers with an overview of best data management practices, data science tools, and concrete steps and methods for more easily producing transparent, reproducible workflows. This opportunity is for researchers from across career stages and sectors who want to gain fundamental data science skills that will improve their reproducible research techniques, particularly for the purposes of synthesis science.
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Amber Budden | +aebudden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Matt Jones | +jones@nceas.ucsb.edu | +
We will primarily be using a web browser along with an instance of RStudio server set +up especially for this course. However, we also recommend setting up R, RStudio, +and git on your local system to more easily prepare you to utilize the skills you learned +once the course ends.
+ +R: We will use R version 4.0.2, which you can download and install from CRAN
RStudio: To download RStudio, visit RStudio’s download page.
R packages: We will be using the following packages:
+ +You can install these packages quickly by running the following two code snippets:
+ +packages <- c("devtools", "dplyr", "DT", "ggplot2", "leaflet", "tidyr", "scales", "sf")
+
+
+for (package in packages) {
+ if (!(package %in% installed.packages())) {
+ install.packages(package)
+ }
+}
+
git: Download git and install it on your system.
GitHub: We will be using GitHub so you will need create (or remember your existing) GitHub login
This workshop assumes a base level of experience using R for scientific and statistical analyses. +However, we realize that not everyone will be at the same place in terms of familiarity with the tools we’ll be using. +If you’d like to brush up on your R skills prior to the workshop, check out this list of resources we like:
+ +If you’re a fan of cheat sheets, RStudio provides some fantastic ones on their Cheat Sheets page. +Please make sure to print ahead of time if you prefer hard copies. +In particular, check out:
+ ++ EVENTS +
+Dates: April 18 - 22, 2022
+Location: NCEAS
+Venue: Santa Barbara, CA
This 5-day in-person workshop will provide researchers with an overview of reproducible and ethical research practices, steps and methods for more easily documenting and preserving their data at the Arctic Data Center, and an introduction to programming in R. Special attention will be paid to qualitative data management, including practices working with sensitive data. Example datasets will draw from natural and social sciences, and methods for conducting reproducible research will be discussed in the context of both qualitative and quantitative data. Responsible and reproducible data management practices will be discussed as they apply to all aspects of the data life cycle. This includes ethical data collection and data sharing, data sovereignty, and the CARE principles. The CARE principles are guidelines that help ensure open data practices (like the FAIR principles) appropriately engage with Indigenous Peoples’ rights and interests.
+ +Name | +|
---|---|
Amber Budden | +budden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Natasha Haycock-Chavez | +haycock-chavez@nceas.ucsb.edu | +
Matt Jones | +jones@nceas.ucsb.edu | +
Noor Johnson | +noor.johnson@colorado.edu | +
Work was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: September 19 - 23, 2022
+Location: NCEAS
+Venue: Santa Barbara, CA
This 5-day in-person workshop will provide researchers with an introduction to advanced topics in computationally reproducible research in python, including software and techniques for working with very large datasets. This includes working in cloud computing environments, docker containers, and parallel processing using tools like parsl
and dask
. The workshop will also cover concrete methods for documenting and uploading data to the Arctic Data Center, advanced approaches to tracking data provenance, responsible research and data management practices including data sovereignty and the CARE principles, and ethical concerns with data-intensive modeling and analysis.
Name | +|
---|---|
Matt Jones | +jones@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Sam Csik | +scisk@ucsb.edu | +
Carmen Galaz-Garcia | +galaz-garcia@nceas.ucsb.edu | +
Daphne Virlar-Knight | +virlar-knight@nceas.ucsb.edu | +
Natasha Haycock-Chavez | +haycock-chavez@nceas.ucsb.edu | +
Ingmar Nitze | +Ingmar.Nitze@awi.de | +
Chandi Witharana | +chandi.witharana@uconn.edu | +
Work was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: February 14 - February 19, 2022
+Location: Online
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
+The Arctic Data Center provides training in data science and data management, as these are critical skills for the stewardship of the data, software, and other research products that are preserved in the Arctic Data Center. A goal of the Arctic Data Center is to advance data archiving and promote reproducible science and data reuse. This 5-day workshop will provide researchers with an overview of best data management practices, data science tools and concrete steps and methods for more easily documenting and uploading their data to the Arctic Data Center.
Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Matthew Jones | +jones@nceas.ucsb.edu | +
Amber Budden | +budden@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: March 29, 2022
+Location: Arctic Science Summit Week
+Venue: Tromso, Norway
This workshop is a collaboration between the Arctic Data Center, ELOKA and the NNA Community Office, and will focus on the presentation of open science principles and best practices. Open science will be explored from the lens of reproducibility of research, Indigenous data sovereignty, and community data management. A combination of presentation and discussion will introduce participants to key topics, detail current recommended practices and highlight areas for future research. During the second part of the workshop, participants will have the opportunity to explore data organization and the principles of ‘tidy’ data structures through practical hands-on and small group activities.
+ +Name | +|
---|---|
Amber Budden | +budden@nceas.ucsb.edu | +
Noor Johnson | +noor.johnson@colorado.edu | +
Peter Pulsifer | +pulsifer@nsidc.org | +
Work was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: January 30 - February 3, 2023
+Location: NCEAS
+Venue: Santa Barbara, CA
This 5-day in-person workshop will provide researchers with an overview of reproducible and ethical research practices, steps and methods for more easily documenting and preserving their data at the Arctic Data Center, and an introduction to programming in R. Special attention will be paid to qualitative data management, including practices working with sensitive data. Example datasets will draw from natural and social sciences, and methods for conducting reproducible research will be discussed in the context of both qualitative and quantitative data. Responsible and reproducible data management practices will be discussed as they apply to all aspects of the data life cycle. This includes ethical data collection and data sharing, data sovereignty, and the CARE principles. The CARE principles are guidelines that help ensure open data practices (like the FAIR principles) appropriately engage with Indigenous Peoples’ rights and interests.
+ +Name | +|
---|---|
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Halina Do-linh | +dolinh@nceas.ucsb.edu | +
Natasha Haycock-Chavez | +haycock-chavez@nceas.ucsb.edu | +
Matt Jones | +jones@nceas.ucsb.edu | +
Camila Vargas Poulsen | +vargas-poulsen@nceas.ucsb.edu | +
Daphne Virlar-Knight | +virlar-knight@nceas.ucsb.edu | +
Work was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: February 27 - March 3, 2023
+Location: Online
+Venue: NCEAS, 735 State St., Suite 300, UC Santa Barbara
This 5-day remote workshop will provided researchers with an overview of best data management practices, data science tools for cleaning and analyzing data, and concrete steps and methods for more easily documenting and preserving their data at the Arctic Data Center. Example tools included R, Rmarkdown, and git/GitHub. This course provided background in both the theory and practice of reproducible research, spanning all portions of the research lifecycle, from ethical data collection following the CARE principles to engage with local stakeholders, to data publishing..
+ +Workshop topics will include:
+ +For more detailed information on how to prepare for the workshop, see preparing for the workshop (below).
+ +Name | +|
---|---|
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Halina Do-Linh | +do-linh@nceas.ucsb.edu | +
Camila Vargas-Poulsen | +vargas-poulsen@nceas.ucsb.edu | +
Daphne Virlar-Knight | +virlar-knight@nceas.ucsb.edu | +
Work on this package was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Dates: March 27-31, 2023
+Location: NCEAS
+Venue: Santa Barbara, CA
This 5-day in-person workshop will provide researchers with an introduction to advanced topics in computationally reproducible research in python, including software and techniques for working with very large datasets. This includes working in cloud computing environments, docker containers, and parallel processing using tools like parsl and dask. The workshop will also cover concrete methods for documenting and uploading data to the Arctic Data Center, advanced approaches to tracking data provenance, responsible research and data management practices including data sovereignty and the CARE principles, and ethical concerns with data-intensive modeling and analysis.
+ +Name | +|
---|---|
Matt Jones | +jones@nceas.ucsb.edu | +
Jeanette Clark | +jclark@nceas.ucsb.edu | +
Daphne Virlar-Knight | +virlar-knight@nceas.ucsb.edu | +
Juliet Cohen | +jcohen@nceas.ucsb.edu | +
Anna Liljedahl | +aliljedahl@woodwellclimate.org | +
Work was supported by:
+ +Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.
+ ++ EVENTS +
+Previously called the Reproducible Research Techniques for Synthesis course
+ +Dates: April 3-7, 2023
+Location: NCEAS
+Venue: Santa Barbara, CA
A five-day immersion in R programming for environmental data science. Researchers will gain experience with essential data science tools and best practices to increase their capacity as collaborators, reproducible coders, and open scientists. This course is taught both in-person and virtually.
+ +Name | +|
---|---|
Halina Do-Linh | +dolinh@nceas.ucsb.edu | +
Camila Vargas Poulsen | +vargas-poulsen@nceas.ucsb.edu | +
Daphne Virlar-Knight | +virlar-knight@nceas.ucsb.edu | +
Samantha Csik | +scsik@ucsb.edu | +
Work was supported by many additional contributors including:
+ +Ben Bolker, Amber E. Budden, Julien Brun, Natasha Haycock-Chavez, S. Jeanette Clark, Julie Lowndes, Stephanie Hampton, Matthew B. Jones, Samanta Katz, Erin McLean, Bryce Mecum, Deanna Pennington, Karthik Ram, Jim Regetz, Tracy Teal, Leah Wasser.
+ ++ EVENTS +
+In collaboration with the Delta Science Program we are running a 12 month facilitated research synthesis activity, supported by 3 one-week intensive training events. Curriculum material will focus on introducing Delta Researchers to best practices in, and application of, scientific computing and +scientific software for reproducible science. In addition to developing and delivering learning curriculum, this collaboration will include the provision of data consulting, synthesis facilitation, and a remote workshop to conclude the group synthesis activities.
+ +The learning curricula will focus on techniques for data management, scientific programming, and collaboration techniques through the use of open-source, community-supported tools. Participants will learn skills for rapid and robust use of open source scientific software. These approaches will be explored and applied to scientific synthesis projects related to the Delta ecosystem.
+ +The course will weave together several core themes which are reinforced – and injected into the real-time synthetic scientific research process – through work on group synthesis projects.
+ +Week 1: June 26-30, 2023. UC Davis.
+Week 2: August 28 - Spetember 1, 2023. UC Davis.
+Week 3: October 23-27, 2023. UC Davis.
Name | +|
---|---|
Camila Vargas Poulsen | +vargas-poulsen@nceas.ucsb.edu | +
Halina Do-Linh | +dolinh@nceas.ucsb.edu | +
Matt Jones | +jones@nceas.ucsb.edu | +
Carmen Galaz García | +galaz-garcia@nceas.ucsb.edu | +
NCEAS Open Science Synthesis training consists of three 1-week long workshops, geared towards early career researchers. Participants engage in a mix of lectures, exercises, and synthesis research groups to undertake synthesis while learning and implementing best practices for open data science.
+ +The National Center for Ecological Analysis and Synthesis (NCEAS), a research affiliate of UCSB, is a leading expert on interdisciplinary data science and works collaboratively to answer the world’s largest and most complex questions. The NCEAS approach leverages existing data and employs a team science philosophy to squeeze out all potential insights and solutions efficiently - this is called synthesis science.
+NCEAS has over 25 years of success with this model among working groups and environmental professionals. Together with the Delta Science Program and the Delta Stewardship Council we are excited to pass along skills, workflows, mindsets learn throughout the years.
+Aug 28 - Sep 1, 2023
+UPDATE IMAGE WITH WEEK 2 SCHEDULE
+October 23 – 27, 2023
+By participating in this activity you agree to abide by the NCEAS Code of Conduct.
+These written materials are the result of a continuous and collaborative effort at NCEAS with the support of DataONE, to help researchers make their work more transparent and reproducible. This work began in the early 2000’s, and reflects the expertise and diligence of many, many individuals. The primary authors for this version are listed in the citation below, with additional contributors recognized for their role in developing previous iterations of these or similar materials.
+This work is licensed under a Creative Commons Attribution 4.0 International License.
+Citation: Halina Do-Linh, Carmen Galaz García, Matthew B. Jones, Camila Vargas Poulsen. 2023. Open Science Synthesis training Week 2. NCEAS Learning Hub & Delta Stewardship Council.
+Additional contributors: Ben Bolker, Julien Brun, Amber E. Budden, Jeanette Clark, Samantha Csik, Stephanie Hampton, Natasha Haycock-Chavez, Samanta Katz, Julie Lowndes, Erin McLean, Bryce Mecum, Deanna Pennington, Karthik Ram, Jim Regetz, Tracy Teal, Daphne Virlar-Knight, Leah Wasser.
+This is a Quarto book. To learn more about Quarto books visit https://quarto.org/docs/books.
+ + +Git
and GitHub to effectively collaborate with colleagues on codeGit
conflict resolution techniquesGit
and GitHub Tools for CollaborationGit
is not only a powerful tool for individual work but also an excellent choice for collaborating with friends and colleagues. Git
ensures that after you’ve completed your contributions to a repository, you can confidently synchronize your changes with changes made by others.
One of the easiest and most effective ways to collaborate using Git
is by utilizing a shared repository on a hosting service like GitHub. This shared repository acts as a central hub, enabling collaborators to effortlessly exchange and merge their changes. With Git
and a shared repository, you can collaborate seamlessly and work confidently, knowing that your changes will be integrated smoothly with those of your collaborators.
There are many advanced techniques for synchronizing Git
repositories, but let’s start with a simple example.
In this example, the Collaborator will clone
a copy of the Owner’s repository from GitHub, and the Owner will grant them Collaborator status, enabling the Collaborator to directly pull and push from the Owner’s GitHub repository.
We start our collaboration by giving a trusted colleague access to our repository on GitHub. In this example, we define the Owner as the individual who owns the repository, and the Collaborator as the person whom the Owner chooses to give permission to make changes to their repository.
+The Collaborator will make changes to the repository and then push
those changes to the shared repository on GitHub. The Owner will then use pull
to retrieve the changes without encountering any conflicts. This is the most ideal workflow.
The instructors will demonstrate this process in the next section.
+ +To be able to contribute to a repository, the Collaborator must clone the repository from the Owner’s GitHub account. To do this, the Collaborator should visit the GitHub page for the Owner’s repository, and then copy the clone URL. In R Studio, the Collaborator will create a new project from version control by pasting this clone URL into the appropriate dialog (see the earlier chapter introducing GitHub).
+ +With a clone copied locally, the Collaborator can now make changes to the README.md
file in the repository, adding a line or statement somewhere noticeable near the top. Save your changes.
commit
and push
To sync changes, the Collaborator will need to add
, commit
, and push
their changes to the Owner’s repository. But before doing so, it’s good practice to pull
immediately before committing to ensure you have the most recent changes from the Owner. So, in RStudio’s Git
tab, first click the “Diff” button to open the Git
window, and then press the green “Pull” down arrow button. This will fetch any recent changes from the origin repository and merge them. Next, add
the changed README.Rmd
file to be committed by clicking the check box next to it, type in a commit message, and click “Commit”. Once that finishes, then the Collaborator can immediately click “Push” to send the commits to the Owner’s GitHub repository.
pull
Now, the Owner can open their local working copy of the code in RStudio, and pull
those changes down to their local copy.
Congrats, the Owner now has your changes!
+commit
, and push
Next, the Owner should do the same. Make changes to a file in the repository, save it, pull
to make sure no new changes have been made while editing, and then add
, commit
, and push
the Owner changes to GitHub.
pull
The Collaborator can now pull
down those Owner changes, and all copies are once again fully synced. And you’re off to collaborating.
These next steps are for the Owner:
+{FIRSTNAME}_test
repositoryNow, the Collaborator will follow this step:
+You will do the exercise twice, where each person will get to practice being both the Owner and the Collaborator roles.
+Round One:
+{FIRSTNAME}_test
repository (see Setup block above for detailed steps){FIRSTNAME}_test
repositoryREADME
file:
+README
titled “Git
Workflow”README
file with the new changes to GitHubREADME
file:
+Git
Workflow”, Owner adds the steps of the Git
workflow we’ve been practicingREADME
file with the new changes to GitHubOwners
changes from GitHubRound Two:
+{FIRSTNAME}_test
repository{FIRSTNAME}_test
repositoryREADME
file:
+README
titled “How to Create a Git
Repository from an existing project” and adds the high level steps for this workflowREADME
file with the new changes to GitHubREADME
file:
+Git
Repository”, Owner adds the high level steps for this workflowREADME
file with the new changes to GitHubOwners
changes from GitHubHint: If you don’t remember how to create a Git
repository, refer to the chapter Intro to Git
and GitHub where we created two Git
repositories
There are many Git
and GitHub collaboration techniques, some more advanced than others. We won’t be covering advanced strategies in this course. But here is a table for your reference on a few popular Git
collaboration workflow strategies and tools.
Collaboration Technique | +Benefits | +When to Use | +When Not to Use | +
---|---|---|---|
Branch Management Strategies | +1. Enables parallel development and experimentation 2. Facilitates isolation of features or bug fixes 3. Provides flexibility and control over project workflows |
+When working on larger projects with multiple features or bug fixes simultaneously. When you want to maintain a stable main branch while developing new features or resolving issues on separate branches. When collaborating with teammates on different aspects of a project and later integrating their changes. |
+When working on small projects with a single developer or limited codebase. When the project scope is simple and doesn’t require extensive branch management. When there is no need to isolate features or bug fixes. |
+
Code Review Practices | +1. Enhances code quality and correctness through feedback 2. Promotes knowledge sharing and learning within the team 3. Helps identify bugs, improve performance, and ensure adherence to coding standards |
+When collaborating on a codebase with team members to ensure code quality and maintain best practices. When you want to receive feedback and suggestions on your code to improve its readability, efficiency, or functionality. When working on critical or complex code that requires an extra layer of scrutiny before merging it into the main branch. |
+When working on personal projects or small codebases with no collaboration involved. When time constraints or project size make it impractical to conduct code reviews. When the codebase is less critical or has low complexity. |
+
Forking | +1. Enables independent experimentation and development 2. Provides a way to contribute to a project without direct access 3. Allows for creating separate, standalone copies of a repository |
+When you want to contribute to a project without having direct write access to the original repository. When you want to work on an independent variation or extension of an existing project. When experimenting with changes or modifications to a project while keeping the original repository intact. |
+When collaborating on a project with direct write access to the original repository. When the project does not allow external contributions or forking. When the project size or complexity doesn’t justify the need for independent variations. |
+
Pull Requests | +1. Facilitates code review and discussion 2. Allows for collaboration and feedback from team members 3. Enables better organization and tracking of proposed changes |
+When working on a shared repository with a team and wanting to contribute changes in a controlled and collaborative manner. When you want to propose changes to a project managed by others and seek review and approval before merging them into the main codebase. |
+When working on personal projects or individual coding tasks without the need for collaboration. When immediate changes or fixes are required without review processes. When working on projects with a small team or single developer with direct write access to the repository. |
+
The “When Not to Use” column provides insights into situations where it may be less appropriate to use each collaboration technique, helping you make informed decisions based on the specific context and requirements of your project.
+These techniques provide different benefits and are used in various collaboration scenarios, depending on the project’s needs and team dynamics.
+Merge conflicts occur when both collaborators make conflicting changes to the same file. Resolving merge conflicts involves identifying the root of the problem and restoring the project to a normal state. Good communication, discussing file sections to work on, and avoiding overlaps can help prevent merge conflicts. However, if conflicts do arise, Git
warns about potential issues and ensures that changes from different collaborators based on the same file version are not overwritten. To resolve conflicts, you need to explicitly specify whose changes should be used for each conflicting line in the file.
In this image, we see collaborators mbjones
and metamattj
have both made changes to the same line in the same README.md
file. This is causing a merge conflict because Git
doesn’t know whose changes came first. To resolve it, we need to tell Git
whose changes to keep for that line, and whose changes to discard.
1. Abort, abort, abort…
+Sometimes you just made a mistake. When you get a merge conflict, the repository is placed in a “Merging” state until you resolve it. There’s a Terminal command to abort doing the merge altogether:
+git merge --abort
Of course, after doing that you still haven’t synced with your Collaborator’s changes, so things are still unresolved. But at least your repository is now usable on your local machine.
+2. Checkout
+The simplest way to resolve a conflict, given that you know whose version of the file you want to keep, is to use the command line Git
program to tell Git
to use either your changes (the person doing the merge
), or their changes (the Collaborator).
git checkout --theirs conflicted_file.Rmd
git checkout --ours conflicted_file.Rmd
Once you have run that command, then run add
(staging), commit
, pull
, and push
the changes as normal.
3. Pull and edit the file
+But that requires the command line. If you want to resolve from RStudio, or if you want to pick and choose some of your changes and some of your Collaborator’s, then instead you can manually edit and fix the file. When you pull
the file with a conflict, Git
notices that there is a conflict and modifies the file to show both your own changes and your Collaborator’s changes in the file. It also shows the file in the Git
tab with an orange U
icon, which indicates that the file is Unmerged
, and therefore awaiting your help to resolve the conflict. It delimits these blocks with a series of less than and greater than signs, so they are easy to find:
To resolve the conflicts, simply find all of these blocks, and edit them so that the file looks how you want (either pick your lines, your Collaborator’s lines, some combination, or something altogether new), and save. Be sure you removed the delimiter lines that started with
+<<<<<<<
,=======
,>>>>>>>
.Once you have made those changes, you simply add
(staging), commit
, and push
the files to resolve the conflict.
To illustrate this process, the instructors are going to carefully create a merge conflict step by step, show how to resolve it, and show how to see the results of the successful merge after it is complete. First, the instructors will walk through the exercise to demonstrate the issues. Then, participants will pair up and try the exercise.
+First, start the exercise by ensuring that both the Owner and Collaborator have all of the changes synced to their local copies of the Owner’s repository in RStudio. This includes doing a git pull
to ensure that you have all changes local, and make sure that the Git
tab in RStudio doesn’t show any changes needing to be committed.
From that clean slate, the Owner first modifies and commits a small change including their name on a specific line of the README.md
file (we will change the first line, the title). Work to only change that one line, and add your username to the line in some form and commit the changes (but DO NOT push). We are now in a situation where the Owner has unpushed changes that the Collaborator can not yet see.
Now the Collaborator also makes changes to the same line (the first line, the title) on the README.md
file in their RStudio copy of the project, adding their name to the line. They then commit. At this point, both the Owner and Collaborator have committed changes based on their shared version of the README.md
file, but neither has tried to share their changes via GitHub.
Sharing starts when the Collaborator pushes their changes to the GitHub repo, which updates GitHub to their version of the file. The Owner is now one revision behind, but doesn’t know it yet.
+At this point, the Owner tries to push their change to the repository, which triggers an error from GitHub. While the error message is long, it basically tells you everything needed (that the Owner’s repository doesn’t reflect the changes on GitHub, and that they need to pull
before they can push).
Doing what the message says, the Owner pulls the changes from GitHub, and gets another, different error message. In this case, it indicates that there is a merge conflict because of the conflicting lines.
+ +In the Git
pane of RStudio, the file is also flagged with an orange U
, which stands for an unresolved merge conflict.
To resolve the conflict, the Owner now needs to edit the file. Again, as indicated above, Git
has flagged the locations in the file where a conflict occurred with <<<<<<<
, =======
, and >>>>>>>
. The Owner should edit the file, merging whatever changes are appropriate until the conflicting lines read how they should, and eliminate all of the marker lines with <<<<<<<
, =======
, and >>>>>>>
.
Of course, for scripts and programs, resolving the changes means more than just merging the text – whoever is doing the merging should make sure that the code runs properly and none of the logic of the program has been broken.
+ +From this point forward, things proceed as normal. The Owner first add
the file changes to be made, which changes the orange U
to a blue M
for modified, and then commits the changes locally. The Owner now has a resolved version of the file on their system.
Have the Owner push the changes, and it should replicate the changes to GitHub without error.
+ +Finally, the Collaborator can pull from GitHub to get the changes the Owner made.
+When either the Collaborator or the Owner view the history, the conflict, associated branch, and the merged changes are clearly visible in the history.
+ +Note you will only need to complete the Setup and Git
configuration steps again if you are working in a new repository. Return to Exercise 1 for Setup and Git
configuration steps.
Now it’s your turn. In pairs, intentionally create a merge conflict, and then go through the steps needed to resolve the issues and continue developing with the merged files. See the sections above for help with each of the steps below. You will do the exercise twice, where each person will get to practice being both the Owner and the Collaborator roles.
+Round One:
+pull
to ensure both have the most up-to-date changesREADME
file and makes a change to the title and commits do not pushREADME
file and makes a change to the title and commitsREADME
file to resolve the conflictRound Two:
+pull
to ensure both have the most up-to-date changesREADME
file and makes a change to line 2 and commits do not pushREADME
file and makes a change to line 2 and commitsREADME
file to resolve the conflictSome basic rules of thumb can avoid the vast majority of merge conflicts, saving a lot of time and frustration. These are words our teams live by:
+ +pull
Pull
immediately before you commit
or push
Commit
often in small chunks (this helps you organize your work!)Git
workflow you’re using aka make sure you’re on the same page before you start!A good workflow is encapsulated as follows:
+Pull -> Edit -> Save -> Add (stage) -> Commit -> Pull -> Push
Always start your working sessions with a pull
to get any outstanding changes, then start your work. Stage
your changes, but before you commit
, pull
again to see if any new changes have arrived. If so, they should merge in easily if you are working in different parts of the program. You can then commit
and immediately push
your changes safely.
Good luck, and try to not get frustrated. Once you figure out how to handle merge conflicts, they can be avoided or dispatched when they occur, but it does take a bit of practice.
+ + +Whether you are joining a lab group or establishing a new collaboration, articulating a set of shared agreements about how people in the group will treat each other will help create the conditions for successful collaboration. If agreements or a code of conduct do not yet exist, invite a conversation among all members to create them. Co-creation of a code of conduct will foster collaboration and engagement as a process in and of itself, and is important to ensure all voices heard such that your code of conduct represents the perspectives of your community. If a code of conduct already exists, and your community will be a long-acting collaboration, you might consider revising the code of conduct. Having your group ‘sign off’ on the code of conduct, whether revised or not, supports adoption of the principles.
+When creating a code of conduct, consider both the behaviors you want to encourage and those that will not be tolerated. For example, the Openscapes code of conduct includes Be respectful, honest, inclusive, accommodating, appreciative, and open to learning from everyone else. Do not attack, demean, disrupt, harass, or threaten others or encourage such behavior.
+Below are other example codes of conduct:
+ +As with authorship agreements, it is valuable to establish a shared agreement around handling of data when embarking on collaborative projects. Data collected as part of a funded research activity will typically have been managed as part of the Data Management Plan (DMP) associated with that project. However, collaborative research brings together data from across research projects with different data management plans and can include publicly accessible data from repositories where no management plan is available. For these reasons, a discussion and agreement around the handling of data brought into and resulting from the collaboration is warranted and management of this new data may benefit from going through a data management planning process. Below we discuss example data agreements.
+The example data policy template provided by the Arctic Data Center addresses three categories of data.
+For the first category, the agreement considers conditions under which those data may be used and permissions associated with use. It also addresses access and sharing. In the case of individual, publicly accessible data, the agreement stipulates that the team will abide by the attribution and usage policies that the data were published under, noting how those requirements we met. In the case of derived data, the agreement reads similar to a DMP with consideration of making the data public; management, documentation and archiving; pre-publication sharing; and public sharing and attribution. As research data objects receive a persistent identifier (PID), often a DOI, there are citable objects and consideration should be given to authorship of data, as with articles.
+The following example lab policy from the Wolkovich Lab combines data management practices with authorship guidelines and data sharing agreements. It provides a lot of detail about how this lab approaches data use, attribution and authorship. For example:
+ +This policy is communicated with all incoming lab members, from undergraduate to postdocs and visiting scholars, and is shared here with permission from Dr Elizabeth Wolkovich.
+The CARE and FAIR Principles were introduced previously in the context of introducing the Arctic Data Center and our data submission and documentation process. In this section we will dive a little deeper.
+To recap, the Arctic Data Center is an openly-accessible data repository and the data published through the repository is open for anyone to reuse, subject to conditions of the license (at the Arctic Data Center, data is released under one of two licenses: CC-0 Public Domain and CC-By Attribution 4.0). In facilitating use of data resources, the data stewardship community have converged on principles surrounding best practices for open data management One set of these principles is the FAIR principles. FAIR stands for Findable, Accessible, Interoperable, and Reproducible.
+ +The “Fostering FAIR Data Practices in Europe” project found that it is more monetarily and timely expensive when FAIR principles are not used, and it was estimated that 10.2 billion dollars per years are spent through “storage and license costs to more qualitative costs related to the time spent by researchers on creation, collection and management of data, and the risks of research duplication.” FAIR principles and open science are overlapping concepts, but are distinctive concepts. Open science supports a culture of sharing research outputs and data, and FAIR focuses on how to prepare the data.
+ +Another set of community developed principles surrounding open data are the CARE Principles. The CARE principles for Indigenous Data Governance complement the more data-centric approach of the FAIR principles, introducing social responsibility to open data management practices. The CARE Principles stand for:
+The CARE principles align with the FAIR principles by outlining guidelines for publishing data that is findable, accessible, interoperable, and reproducible while at the same time, accounts for Indigenous’ Peoples rights and interests. Initially designed to support Indigenous data sovereignty, CARE principles are now being adopted across domains, and many researchers argue they are relevant for both Indigenous Knowledge and data, as well as data from all disciplines (Carroll et al., 2021). These principles introduce a “game changing perspective” that encourages transparency in data ethics, and encourages data reuse that is purposeful and intentional that aligns with human well-being aligns with human well-being (Carroll et al., 2021).
+For over 20 years, the Committee on Publication Ethics (COPE) has provided trusted guidance on ethical practices for scholarly publishing. The COPE guidelines have been broadly adopted by academic publishers across disciplines, and represent a common approach to identify, classify, and adjudicate potential breaches of ethics in publication such as authorship conflicts, peer review manipulation, and falsified findings, among many other areas. Despite these guidelines, there has been a lack of ethics standards, guidelines, or recommendations for data publications, even while some groups have begun to evaluate and act upon reported issues in data publication ethics.
+To address this gap, the Force 11 Working Group on Research Data Publishing Ethics was formed as a collaboration among research data professionals and the Committee on Publication Ethics (COPE) “to develop industry-leading guidance and recommended best practices to support repositories, journal publishers, and institutions in handling the ethical responsibilities associated with publishing research data.” The group released the “Joint FORCE11 & COPE Research Data Publishing Ethics Working Group Recommendations” (Puebla, Lowenberg, and WG 2021), which outlines recommendations for four categories of potential data ethics issues:
+Guidelines cover what actions need to be taken, depending on whether the data are already published or not, as well as who should be involved in decisions, who should be notified of actions, and when the public should be notified. The group has also published templates for use by publishers and repositories to announce the extent to which they plan to conform to the data ethics guidelines.
+At the Arctic Data Center, we need to develop policies and procedures governing how we react to potential breaches of data publication ethics. In this exercise, break into groups to provide advice on how the Arctic Data Center should respond to reports of data ethics issues, and whether we should adopt the Joint FORCE11 & COPE Research Data Publishing Ethics Working Group Policy Templates for repositories. In your discussion, consider:
+You might consider a hypothetical scenario such as the following in considering your response.
+++The data coordinator at the Arctic Data Center receives an email in 2022 from a prior postdoctoral fellow who was employed as part of an NSF-funded project on microbial diversity in Alaskan tundra ecosystems. The email states that a dataset from 2014 in the Arctic Data Center was published with the project PI as author, but omits two people, the postdoc and an undergraduate student, as co-authors on the dataset. The PI retired in 2019, and the postdoc asks that they be added to the author list of the dataset to correct the historical record and provide credit.
+
INSERT ONE OF THE THIKING PREFERENCES VERSIONS
+ + + +Git
, GitHub (+Pages), and R Markdown to publish an analysis to the webSharing your work with others in engaging ways is an important part of the scientific process.
+So far in this course, we’ve introduced a small set of powerful tools for doing open science:
+Git
R Markdown, in particular, is amazingly powerful for creating scientific reports but, so far, we haven’t tapped its full potential for sharing our work with others.
+In this lesson, we’re going to take our training_{USERNAME}
GitHub repository and turn it into a beautiful and easy to read web page using the tools listed above.
Make sure you are in training_{USERNAME}
project
Add a new R Markdown file at the top level called index.Rmd
Save the R Markdown file you just created. Use index.Rmd
as the file name
Press “Knit” and observe the rendered output
+index.html
HTML
(a web page)Commit your changes (for both index.Rmd
and index.html
) with a commit message, and push
to GitHub
Open your web browser to the GitHub.com and navigate to the page for your training_{USERNAME}
repository
Activate GitHub Pages for the main
branch
main
. Keep the folder as the root
and then click “Save”main
branch”Note: index.Rmd
represents the default file for a web site, and is returned whenever you visit the web site but doesn’t specify an explicit file to be returned.
Now, the rendered website version of your repo will show up at a special URL.
+GitHub Pages follows a convention like this:
+ +Note that it changes from github.com to github.io
+https://{username}.github.io/{repo_name}/
(Note the trailing /
)Now that we’ve successfully published a web page from an R Markdown document, let’s make a change to our R Markdown document and follow the steps to publish the change on the web:
+index.Rmd
YAML
frontmatterGit
workflow: Stage > Commit > Pull > Push
https://{username}.github.io/{repo_name}/
Next, we will show how you can link different Rmd
s rendered into html
so you can easily share different parts of your work.
In this exercise, you’ll create a table of contents with the lessons of this course on the main page, and link some of the files we have work on so far.
+index.Rmd
file## coreR workshop
+
+- Introduction to RMarkdown
+- Cleaning and Wrangling data
+- Data Visualization
+- Spatial Analysis
+Make sure you have the html
versions of your intro-to-rmd.Rmd
and data-cleaning.Rmd
files. If you only see the Rmd
version, you need to “Knit” your files first
In your index.Rmd
let’s add the links to the html files we want to show on our webpage. Do you remember the Markdown syntax to create a link?
[Text you want to hyperlink](link)
+[Data wrangling and cleaning](data-wrangling-cleaning.html)
Git
workflow: Stage > Commit > Pull > Push
Now when you visit your web site, you’ll see the table of contents, and can navigate to the others file you linked.
+R Markdown web pages are a great way to share work in progress with your colleagues. Here we showed an example with the materials we have created in this course. However, you can use these same steps to share the different files and progress of a project you’ve been working on. To do so simply requires thinking through your presentation so that it highlights the workflow to be reviewed. You can include multiple pages and build a simple web site and make your work accessible to people who aren’t set up to open your project in R. Your site could look something like this:
+ + + +ggplot2
package to create static plotsggplot2
’s theme
abilities to create publication-grade graphicsleaflet
package to create interactive mapsggplot2
is a popular package for visualizing data in R. From the home page:
+++
ggplot2
is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tellggplot2
how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
It’s been around for years and has pretty good documentation and tons of example code around the web (like on StackOverflow). The goal of this lesson is to introduce you to the basic components of working with ggplot2
and inspire you to go and explore this awesome resource for visualizing your data.
ggplot2
vs base graphics in R vs others
+There are many different ways to plot your data in R. All of them work! However, ggplot2
excels at making complicated plots easy and easy plots simple enough
Base R graphics (plot()
, hist()
, etc) can be helpful for simple, quick and dirty plots. ggplot2
can be used for almost everything else.
Let’s dive into creating and customizing plots with ggplot2
.
Make sure you’re in the right project (training_{USERNAME}
) and use the Git
workflow by Pull
ing to check for any changes. Then, create a new R Markdown document and remove the default text.
Load the packages we’ll need:
library(readr)
+library(dplyr)
+library(tidyr)
+library(forcats)
+library(ggplot2)
+library(leaflet)
+library(DT)
+library(scales)
ADFG_fisrtAttempt_reformatted.csv
, right click, and select “Copy Link”.<- read_csv("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Af119a05b-bbe7-4aea-93c6-85434dcb1c5e") escape
Learn about the data. For this session we are going to be working with data on daily salmon escapement counts in Alaska. Check out the documentation.
Finally, let’s explore the data we just read into our working environment.
## Check out column names
+colnames(escape)
+
+## Peak at each column and class
+glimpse(escape)
+
+## From when to when
+range(escape$sampleDate)
+
+## How frequent?
+head(escape$sampleDate)
+tail(escape$sampleDate)
+
+## Which species?
+unique(escape$Species)
It is more frequent than not, that we need to do some wrangling before we can plot our data the way we want to. Now that we have read out data and have done some exploration, we’ll put our data wrangling skills to practice to get our data in the desired format.
+Species
and SASAP.Region
,<- escape %>%
+ annual_esc separate(sampleDate, c("Year", "Month", "Day"), sep = "-") %>%
+ mutate(Year = as.numeric(Year)) %>%
+ group_by(Species, SASAP.Region, Year) %>%
+ summarize(escapement = sum(DailyCount)) %>%
+ filter(Species %in% c("Chinook", "Sockeye", "Chum", "Coho", "Pink"))
+
+head(annual_esc)
# A tibble: 6 × 4
+# Groups: Species, SASAP.Region [1]
+ Species SASAP.Region Year escapement
+ <chr> <chr> <dbl> <dbl>
+1 Chinook Alaska Peninsula and Aleutian Islands 1974 1092
+2 Chinook Alaska Peninsula and Aleutian Islands 1975 1917
+3 Chinook Alaska Peninsula and Aleutian Islands 1976 3045
+4 Chinook Alaska Peninsula and Aleutian Islands 1977 4844
+5 Chinook Alaska Peninsula and Aleutian Islands 1978 3901
+6 Chinook Alaska Peninsula and Aleutian Islands 1979 10463
+The chunk above used a lot of the dplyr commands that we’ve used, and some that are new. The separate()
function is used to divide the sampleDate column up into Year, Month, and Day columns, and then we use group_by()
to indicate that we want to calculate our results for the unique combinations of species, region, and year. We next use summarize()
to calculate an escapement value for each of these groups. Finally, we use a filter and the %in%
operator to select only the salmon species.
ggplot2
First, we’ll cover some ggplot2
basics to create the foundation of our plot. Then, we’ll add on to make our great customized data visualization.
For example, let’s plot total escapement by species. We will show this by creating the same plot in 3 slightly different ways. Each of the options below have the essential pieces of a ggplot
.
## Option 1 - data and mapping called in the ggplot() function
+ggplot(data = annual_esc,
+aes(x = Species, y = escapement)) +
+ geom_col()
+
+## Option 2 - data called in ggplot function; mapping called in geom
+ggplot(data = annual_esc) +
+geom_col(aes(x = Species, y = escapement))
+
+
+## Option 3 - data and mapping called in geom
+ggplot() +
+geom_col(data = annual_esc,
+ aes(x = Species, y = escapement))
They all will create the same plot:
+Having the basic structure with the essential components in mind, we can easily change the type of graph by updating the geom_*()
.
ggplot2
and the pipe operator
+Just like in dplyr
and tidyr
, we can also pipe a data.frame
directly into the first argument of the ggplot
function using the %>%
operator.
This can certainly be convenient, but use it carefully! Combining too many data-tidying or subsetting operations with your ggplot
call can make your code more difficult to debug and understand.
Next, we will use the pipe operator to pass into ggplot()
a filtered version of annual_esc
, and make a plot with different geometries.
Boxplot
+%>%
+ annual_esc filter(Year == 1974,
+ %in% c("Chum", "Pink")) %>%
+ Species ggplot(aes(x = Species, y = escapement)) +
+ geom_boxplot()
Violin plot
+%>%
+ annual_esc filter(Year == 1974,
+ %in% c("Chum", "Pink")) %>%
+ Species ggplot(aes(x = Species, y = escapement)) +
+ geom_violin()
Line and point
+%>%
+ annual_esc filter(Species == "Sockeye",
+ == "Bristol Bay") %>%
+ SASAP.Region ggplot(aes(x = Year, y = escapement)) +
+ geom_line() +
+ geom_point()
Let’s go back to our base bar graph. What if we want our bars to be blue instead of gray? You might think we could run this:
+ggplot(annual_esc,
+aes(x = Species, y = escapement,
+ fill = "blue")) +
+ geom_col()
Why did that happen?
+Notice that we tried to set the fill color of the plot inside the mapping aesthetic call. What we have done, behind the scenes, is create a column filled with the word “blue” in our data frame, and then mapped it to the fill
aesthetic, which then chose the default fill color of red.
What we really wanted to do was just change the color of the bars. If we want do do that, we can call the color option in the geom_col()
function, outside of the mapping aesthetics function call.
ggplot(annual_esc,
+aes(x = Species, y = escapement)) +
+ geom_col(fill = "blue")
What if we did want to map the color of the bars to a variable, such as region. ggplot()
is really powerful because we can easily get this plot to visualize more aspects of our data.
ggplot(annual_esc,
+aes(x = Species, y = escapement,
+ fill = SASAP.Region)) +
+ geom_col()
We know that in the graph we just plotted, each bar includes escapements for multiple years. Let’s leverage the power of ggplot
to plot more aspects of our data in one plot.
We are going to plot escapement by species over time, from 2000 to 2016, for each region.
+An easy way to plot another aspect of your data is using the function facet_wrap()
. This function takes a mapping to a variable using the syntax ~{variable_name}
. The ~
(tilde) is a model operator which tells facet_wrap()
to model each unique value within variable_name
to a facet in the plot.
The default behavior of facet wrap is to put all facets on the same x and y scale. You can use the scales
argument to specify whether to allow different scales between facet plots (e.g scales = "free_y"
to free the y axis scale). You can also specify the number of columns using the ncol =
argument or number of rows using nrow =
.
## Subset with data from years 2000 to 2016
+
+<- annual_esc %>%
+ annual_esc_2000s filter(Year %in% c(2000:2016))
+
+## Quick check
+unique(annual_esc_2000s$Year)
[1] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
+[16] 2015 2016
+## Plot with facets
+ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y")
Now let’s work on making this plot look a bit nicer. We are going to”
+ggtitle()
ylab()
theme_bw()
There are a wide variety of built in themes in ggplot
that help quickly set the look of the plot. Use the RStudio autocomplete theme_
<TAB>
to view a list of theme functions.
ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ theme_bw()
You can see that the theme_bw()
function changed a lot of the aspects of our plot! The background is white, the grid is a different color, etc. There are lots of other built in themes like this that come with the ggplot2
package.
Use the RStudio auto complete, the ggplot2
documentation, a cheat sheet, or good old Google to find other built in themes. Pick out your favorite one and add it to your plot.
## Useful baseline themes are
+theme_minimal()
+theme_light()
+theme_classic()
The built in theme functions (theme_*()
) change the default settings for many elements that can also be changed individually using thetheme()
function. The theme()
function is a way to further fine-tune the look of your plot. This function takes MANY arguments (just have a look at ?theme
). Luckily there are many great ggplot
resources online so we don’t have to remember all of these, just Google “ggplot cheat sheet” and find one you like.
Let’s look at an example of a theme()
call, where we change the position of the legend from the right side to the bottom, and remove its title.
ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ theme_light() +
+ theme(legend.position = "bottom",
+ legend.title = element_blank())
Note that the theme()
call needs to come after any built-in themes like theme_bw()
are used. Otherwise, theme_bw()
will likely override any theme elements that you changed using theme()
.
You can also save the result of a series of theme()
function calls to an object to use on multiple plots. This prevents needing to copy paste the same lines over and over again!
<- theme_light() +
+ my_theme theme(legend.position = "bottom",
+ legend.title = element_blank())
So now our code will look like this:
+ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ my_theme
Hint: You can start by looking at the documentation of the function by typing ?theme()
in the console. And googling is a great way to figure out how to do the modifications you want to your plot.
scale_x_continuous(breaks = seq(2000,2016, 2))
## Useful baseline themes are
+ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ scale_x_continuous(breaks = seq(2000, 2016, 2)) +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ +
+ my_theme theme(axis.text.x = element_text(angle = 45,
+ vjust = 0.5))
scales
Fixing tick labels in ggplot
can be super annoying. The y-axis labels in the plot above don’t look great. We could manually fix them, but it would likely be tedious and error prone.
The scales
package provides some nice helper functions to easily rescale and relabel your plots. Here, we use scale_y_continuous()
from ggplot2
, with the argument labels
, which is assigned to the function name comma
, from the scales
package. This will format all of the labels on the y-axis of our plot with comma-formatted numbers.
ggplot(annual_esc_2000s,
+aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ scale_x_continuous(breaks = seq(2000, 2016, 2)) +
+ scale_y_continuous(labels = comma) +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ +
+ my_theme theme(axis.text.x = element_text(angle = 45,
+ vjust = 0.5))
You can also save all your code into an object in your working environment by assigning a name to the ggplot()
code.
<- ggplot(annual_esc_2000s,
+ annual_region_plot aes(x = Year,
+ y = escapement,
+ color = Species)) +
+ geom_line() +
+ geom_point() +
+ scale_x_continuous(breaks = seq(2000, 2016, 2)) +
+ scale_y_continuous(labels = comma) +
+ facet_wrap( ~ SASAP.Region,
+ scales = "free_y") +
+ ylab("Escapement") +
+ xlab("\nYear") +
+ ggtitle("Annual Salmon Escapement by Region") +
+ +
+ my_theme theme(axis.text.x = element_text(angle = 45,
+ vjust = 0.5))
And then call your object to see your plot.
+ annual_region_plot
ggplot()
loves putting things in alphabetical order. But more frequent than not, that’s not the order you actually want things to be plotted if you have categorical groups. Let’s find some total years of data by species for Kuskokwim.
## Number Years of data for each salmon species at Kuskokwim
+<- annual_esc %>%
+ n_years group_by(SASAP.Region, Species) %>%
+ summarize(n = n()) %>%
+ filter(SASAP.Region == "Kuskokwim")
Now let’s plot this using geom_bar()
.
## base plot
+ggplot(n_years,
+aes(x = Species,
+ y = n)) +
+ geom_bar(aes(fill = Species),
+ stat = "identity")
Now, let’s apply some of the customizations we have seen so far and learn some new ones.
+## Reordering, flipping coords and other customization
+ggplot(n_years,
+aes(
+ x = fct_reorder(Species, n),
+ y = n,
+ fill = Species
+ +
+ )) geom_bar(stat = "identity") +
+ coord_flip() +
+ theme_minimal() +
+ ## another way to customize labels
+ labs(x = "Species",
+ y = "Number of years of data",
+ title = "Number of years of escapement data for salmon species in Kuskokwim") +
+ theme(legend.position = "none")
Saving plots using ggplot
is easy! The ggsave()
function will save either the last plot you created, or any plot that you have saved to a variable. You can specify what output format you want, size, resolution, etc. See ?ggsave()
for documentation.
ggsave("figures/nyears_data_kus.jpg", width = 8, height = 6, units = "in")
We can also save our facet plot showing annual escapements by region calling the plot’s object.
+ggsave(annual_region_plot, "figures/annual_esc_region.png", width = 12, height = 8, units = "in")
DT
Now that we know how to make great static visualizations, let’s introduce two other packages that allow us to display our data in interactive ways. These packages really shine when used with GitHub Pages, so at the end of this lesson we will publish our figures to the website we created earlier.
+First let’s show an interactive table of unique sampling locations using DT
. Write a data.frame
containing unique sampling locations with no missing values using two new functions from dplyr
and tidyr
: distinct()
and drop_na()
.
<- escape %>%
+ locations distinct(Location, Latitude, Longitude) %>%
+ drop_na()
And display it as an interactive table using datatable()
from the DT
package.
datatable(locations)
leaflet
Similar to ggplot2
, you can make a basic leaflet
map using just a couple lines of code. Note that unlike ggplot2
, the leaflet
package uses pipe operators (%>%
) and not the additive operator (+
).
The addTiles()
function without arguments will add base tiles to your map from OpenStreetMap. addMarkers()
will add a marker at each location specified by the latitude and longitude arguments. Note that the ~
symbol is used here to model the coordinates to the map (similar to facet_wrap()
in ggplot
).
leaflet(locations) %>%
+addTiles() %>%
+ addMarkers(
+ lng = ~ Longitude,
+ lat = ~ Latitude,
+ popup = ~ Location
+ )
You can also use leaflet
to import Web Map Service (WMS) tiles. Here is an example that utilizes the General Bathymetric Map of the Oceans (GEBCO) WMS tiles. In this example, we also demonstrate how to create a more simple circle marker, the look of which is explicitly set using a series of style-related arguments.
leaflet(locations) %>%
+addWMSTiles(
+ "https://www.gebco.net/data_and_products/gebco_web_services/web_map_service/mapserv?request=getmap&service=wms&BBOX=-90,-180,90,360&crs=EPSG:4326&format=image/jpeg&layers=gebco_latest&width=1200&height=600&version=1.3.0",
+ layers = 'GEBCO_LATEST',
+ attribution = "Imagery reproduced from the GEBCO_2022 Grid, WMS 1.3.0 GetMap, www.gebco.net"
+ %>%
+ ) addCircleMarkers(
+ lng = ~ Longitude,
+ lat = ~ Latitude,
+ popup = ~ Location,
+ radius = 5,
+ # set fill properties
+ fillColor = "salmon",
+ fillOpacity = 1,
+ # set stroke properties
+ stroke = T,
+ weight = 0.5,
+ color = "white",
+ opacity = 1
+ )
Leaflet has a ton of functionality that can enable you to create some beautiful, functional maps with relative ease. Here is an example of some we created as part of the State of Alaskan Salmon and People (SASAP) project, created using the same tools we showed you here. This map hopefully gives you an idea of how powerful the combination of R Markdown and GitHub Pages can be.
+Rmd
you have been working on for this lesson.Rmd
. This is a good way to test if everything in your code is working.index.Rmd
and the link to the html
file with this lesson’s content.index.Rmd
to an html
.Git
workflow: Stage > Commit > Pull > Push
ggplot2
Resourcesggplot2
by Allison Horst.ggplot2
tutorial for beautiful plotting in R by Cedric Scherer.ADD LINK TO MATERIAL
+ + + +sf
package to wrangle spatial dataleaflet
sf
From the sf
vignette:
++Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.
+
The sf
package is an R implementation of Simple Features. This package incorporates:
Most of the functions in this package starts with prefix st_
which stands for spatial and temporal.
In this lesson, our goal is to use a shapefile of Alaska regions and rivers, and data on population in Alaska by community to create a map that looks like this:
+ +All of the data used in this tutorial are simplified versions of real datasets available on the KNB Data Repository. We are using simplified datasets to ease the processing burden on all our computers since the original geospatial datasets are high-resolution. These simplified versions of the datasets may contain topological errors.
+The spatial data we will be using to create the map are:
+Data | +Original datasets | +
---|---|
Alaska regional boundaries | +Jared Kibele and Jeanette Clark. 2018. State of Alaska’s Salmon and People Regional Boundaries. Knowledge Network for Biocomplexity. doi:10.5063/F1125QWP. | +
Community locations and population | +Jeanette Clark, Sharis Ochs, Derek Strong, and National Historic Geographic Information System. 2018. Languages used in Alaskan households, 1990-2015. Knowledge Network for Biocomplexity. doi:10.5063/F11G0JHX. | +
Alaska rivers | +The rivers shapefile is a simplified version of Jared Kibele and Jeanette Clark. Rivers of Alaska grouped by SASAP region, 2018. Knowledge Network for Biocomplexity. doi:10.5063/F1SJ1HVW. | +
data
folder in the training_{USERNAME}
project. You don’t need to unzip the folder ahead of time, uploading will automatically unzip the folder.library(readr)
+library(sf)
+library(ggplot2)
+library(leaflet)
+library(scales)
+library(ggmap)
+library(dplyr)
plot()
and st_crs()
First let’s read in the shapefile of regional boundaries in Alaska using read_sf()
and then create a basic plot of the data plot()
.
# read in shapefile using read_sf()
+<- read_sf("data/ak_regions_simp.shp") ak_regions
# quick plot
+plot(ak_regions)
We can also examine it’s class using class()
.
class(ak_regions)
[1] "sf" "tbl_df" "tbl" "data.frame"
+sf
objects usually have two types of classes: sf
and data.frame
.
Unlike a typical data.frame
, an sf
object has spatial metadata (geometry type
, dimension
, bbox
, epsg (SRID)
, proj4string
) and an additional column typically named geometry
that contains the spatial data.
Since our shapefile object has the data.frame
class, viewing the contents of the object using the head()
function or other exploratory functions shows similar results as if we read in data using read.csv()
or read_csv()
.
head(ak_regions)
Simple feature collection with 6 features and 3 fields
+Geometry type: MULTIPOLYGON
+Dimension: XY
+Bounding box: xmin: -179.2296 ymin: 51.15702 xmax: 179.8567 ymax: 71.43957
+Geodetic CRS: WGS 84
+# A tibble: 6 × 4
+ region_id region mgmt_area geometry
+ <int> <chr> <dbl> <MULTIPOLYGON [°]>
+1 1 Aleutian Islands 3 (((-171.1345 52.44974, -171.1686 52.4174…
+2 2 Arctic 4 (((-139.9552 68.70597, -139.9893 68.7051…
+3 3 Bristol Bay 3 (((-159.8745 58.62778, -159.8654 58.6137…
+4 4 Chignik 3 (((-155.8282 55.84638, -155.8049 55.8655…
+5 5 Copper River 2 (((-143.8874 59.93931, -143.9165 59.9403…
+6 6 Kodiak 3 (((-151.9997 58.83077, -152.0358 58.8271…
+glimpse(ak_regions)
Rows: 13
+Columns: 4
+$ region_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
+$ region <chr> "Aleutian Islands", "Arctic", "Bristol Bay", "Chignik", "Cop…
+$ mgmt_area <dbl> 3, 4, 3, 3, 2, 3, 4, 4, 2, 4, 2, 1, 4
+$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-171.1345 5..., MULTIPOLYGON (((-139.9552 6.…
+Every sf
object needs a coordinate reference system (or crs
) defined in order to work with it correctly. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system. One of the most commonly used geographic coordinate systems is WGS 1984.
ESRI has a blog post that explains these concepts in more detail with very helpful diagrams and examples.
+You can view what crs
is set by using the function st_crs()
.
st_crs(ak_regions)
Coordinate Reference System:
+ User input: WGS 84
+ wkt:
+GEOGCRS["WGS 84",
+ DATUM["World Geodetic System 1984",
+ ELLIPSOID["WGS 84",6378137,298.257223563,
+ LENGTHUNIT["metre",1]]],
+ PRIMEM["Greenwich",0,
+ ANGLEUNIT["degree",0.0174532925199433]],
+ CS[ellipsoidal,2],
+ AXIS["latitude",north,
+ ORDER[1],
+ ANGLEUNIT["degree",0.0174532925199433]],
+ AXIS["longitude",east,
+ ORDER[2],
+ ANGLEUNIT["degree",0.0174532925199433]],
+ ID["EPSG",4326]]
+This is pretty confusing looking. Without getting into the details, that long string says that this data has a geographic coordinate system (WGS84) with no projection. A convenient way to reference crs
quickly is by using the EPSG code, a number that represents a standard projection and datum. You can check out a list of (lots!) of EPSG codes here.
We will use multiple EPSG codes in this lesson. Here they are, along with their more readable names:
+You will often need to transform your geospatial data from one coordinate system to another. The st_transform()
function does this quickly for us. You may have noticed the maps above looked wonky because of the dateline. We might want to set a different projection for this data so it plots nicer. A good one for Alaska is called the Alaska Albers projection, with an EPSG code of 3338.
<- ak_regions %>%
+ ak_regions_3338 st_transform(crs = 3338)
+
+st_crs(ak_regions_3338)
Coordinate Reference System:
+ User input: EPSG:3338
+ wkt:
+PROJCRS["NAD83 / Alaska Albers",
+ BASEGEOGCRS["NAD83",
+ DATUM["North American Datum 1983",
+ ELLIPSOID["GRS 1980",6378137,298.257222101,
+ LENGTHUNIT["metre",1]]],
+ PRIMEM["Greenwich",0,
+ ANGLEUNIT["degree",0.0174532925199433]],
+ ID["EPSG",4269]],
+ CONVERSION["Alaska Albers (meters)",
+ METHOD["Albers Equal Area",
+ ID["EPSG",9822]],
+ PARAMETER["Latitude of false origin",50,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8821]],
+ PARAMETER["Longitude of false origin",-154,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8822]],
+ PARAMETER["Latitude of 1st standard parallel",55,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8823]],
+ PARAMETER["Latitude of 2nd standard parallel",65,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8824]],
+ PARAMETER["Easting at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8826]],
+ PARAMETER["Northing at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8827]]],
+ CS[Cartesian,2],
+ AXIS["easting (X)",east,
+ ORDER[1],
+ LENGTHUNIT["metre",1]],
+ AXIS["northing (Y)",north,
+ ORDER[2],
+ LENGTHUNIT["metre",1]],
+ USAGE[
+ SCOPE["Topographic mapping (small scale)."],
+ AREA["United States (USA) - Alaska."],
+ BBOX[51.3,172.42,71.4,-129.99]],
+ ID["EPSG",3338]]
+plot(ak_regions_3338)
Much better!
+sf
& the Tidyversesf objects can be used as a regular data.frame
object in many operations. We already saw the results of plot()
and head()
.
Since sf
objects are data.frames, they play nicely with packages in the tidyverse
. Here are a couple of simple examples:
select()
# returns the names of all the columns in dataset
+colnames(ak_regions_3338)
[1] "region_id" "region" "mgmt_area" "geometry"
+%>%
+ ak_regions_3338 select(region)
Simple feature collection with 13 features and 1 field
+Geometry type: MULTIPOLYGON
+Dimension: XY
+Bounding box: xmin: -2175328 ymin: 405653 xmax: 1579226 ymax: 2383770
+Projected CRS: NAD83 / Alaska Albers
+# A tibble: 13 × 2
+ region geometry
+ <chr> <MULTIPOLYGON [m]>
+ 1 Aleutian Islands (((-1156666 420855.1, -1159837 417990.3, -1161898 41694…
+ 2 Arctic (((571289.9 2143072, 569941.5 2142691, 569158.2 2142146…
+ 3 Bristol Bay (((-339688.6 973904.9, -339302 972297.3, -339229.2 9710…
+ 4 Chignik (((-114381.9 649966.8, -112866.8 652065.8, -108836.8 65…
+ 5 Copper River (((561012 1148301, 559393.7 1148169, 557797.7 1148492, …
+ 6 Kodiak (((115112.5 983293, 113051.3 982825.9, 110801.3 983211.…
+ 7 Kotzebue (((-678815.3 1819519, -677555.2 1820698, -675557.8 1821…
+ 8 Kuskokwim (((-1030125 1281198, -1029858 1282333, -1028980 1284032…
+ 9 Cook Inlet (((35214.98 1002457, 36660.3 1002038, 36953.11 1001186,…
+10 Norton Sound (((-848357 1636692, -846510 1635203, -840513.7 1632225,…
+11 Prince William Sound (((426007.1 1087250, 426562.5 1088591, 427711.6 1089991…
+12 Southeast (((1287777 744574.1, 1290183 745970.8, 1292940 746262.7…
+13 Yukon (((-375318 1473998, -373723.9 1473487, -373064.8 147393…
+Note the sticky geometry column! The geometry column will stay with your sf
object even if it is not called explicitly.
filter()
unique(ak_regions_3338$region)
[1] "Aleutian Islands" "Arctic" "Bristol Bay"
+ [4] "Chignik" "Copper River" "Kodiak"
+ [7] "Kotzebue" "Kuskokwim" "Cook Inlet"
+[10] "Norton Sound" "Prince William Sound" "Southeast"
+[13] "Yukon"
+%>%
+ ak_regions_3338 filter(region == "Southeast")
Simple feature collection with 1 feature and 3 fields
+Geometry type: MULTIPOLYGON
+Dimension: XY
+Bounding box: xmin: 559475.7 ymin: 722450 xmax: 1579226 ymax: 1410576
+Projected CRS: NAD83 / Alaska Albers
+# A tibble: 1 × 4
+ region_id region mgmt_area geometry
+* <int> <chr> <dbl> <MULTIPOLYGON [m]>
+1 12 Southeast 1 (((1287777 744574.1, 1290183 745970.8, 1292940 …
+You can also use the sf
package to create spatial joins, useful for when you want to utilize two datasets together.
We have some population data, but it gives the population by city, not by region. To determine the population per region we will need to:
+csv
and turn it into an sf
objectst_join()
) to assign each city to a regiongroup_by()
and summarize()
to calculate the total population by regionwrite_sf()
1. Read in alaska_population.csv
using read.csv()
# read in population data
+<- read_csv("data/alaska_population.csv") pop
Turn pop
into a spatial object
The st_join()
function is a spatial left join. The arguments for both the left and right tables are objects of class sf
which means we will first need to turn our population data.frame
with latitude and longitude coordinates into an sf
object.
We can do this easily using the st_as_sf()
function, which takes as arguments the coordinates and the crs
. The remove = F
specification here ensures that when we create our geometry
column, we retain our original lat
lng
columns, which we will need later for plotting. Although it isn’t said anywhere explicitly in the file, let’s assume that the coordinate system used to reference the latitude longitude coordinates is WGS84, which has a crs
number of 4326.
<- st_as_sf(pop,
+ pop_4326 coords = c('lng', 'lat'),
+ crs = 4326,
+ remove = F)
+
+head(pop_4326)
Simple feature collection with 6 features and 5 fields
+Geometry type: POINT
+Dimension: XY
+Bounding box: xmin: -176.6581 ymin: 51.88 xmax: -154.1703 ymax: 62.68889
+Geodetic CRS: WGS 84
+# A tibble: 6 × 6
+ year city lat lng population geometry
+ <dbl> <chr> <dbl> <dbl> <dbl> <POINT [°]>
+1 2015 Adak 51.9 -177. 122 (-176.6581 51.88)
+2 2015 Akhiok 56.9 -154. 84 (-154.1703 56.94556)
+3 2015 Akiachak 60.9 -161. 562 (-161.4314 60.90944)
+4 2015 Akiak 60.9 -161. 399 (-161.2139 60.91222)
+5 2015 Akutan 54.1 -166. 899 (-165.7731 54.13556)
+6 2015 Alakanuk 62.7 -165. 777 (-164.6153 62.68889)
+2. Join population data with Alaska regions data using st_join()
Now we can do our spatial join! You can specify what geometry function the join uses (st_intersects
, st_within
, st_crosses
, st_is_within_distance
…) in the join
argument. The geometry function you use will depend on what kind of operation you want to do, and the geometries of your shapefiles.
In this case, we want to find what region each city falls within, so we will use st_within
.
<- st_join(pop_4326, ak_regions_3338, join = st_within) pop_joined
This gives an error!
+ Error: st_crs(x) == st_crs(y) is not TRUE
Turns out, this won’t work right now because our coordinate reference systems are not the same. Luckily, this is easily resolved using st_transform()
, and projecting our population object into Alaska Albers.
<- st_transform(pop_4326, crs = 3338) pop_3338
<- st_join(pop_3338, ak_regions_3338, join = st_within)
+ pop_joined
+head(pop_joined)
Simple feature collection with 6 features and 8 fields
+Geometry type: POINT
+Dimension: XY
+Bounding box: xmin: -1537925 ymin: 472626.9 xmax: -10340.71 ymax: 1456223
+Projected CRS: NAD83 / Alaska Albers
+# A tibble: 6 × 9
+ year city lat lng population geometry region_id region
+ <dbl> <chr> <dbl> <dbl> <dbl> <POINT [m]> <int> <chr>
+1 2015 Adak 51.9 -177. 122 (-1537925 472626.9) 1 Aleutian I…
+2 2015 Akhiok 56.9 -154. 84 (-10340.71 770998.4) 6 Kodiak
+3 2015 Akiac… 60.9 -161. 562 (-400885.5 1236460) 8 Kuskokwim
+4 2015 Akiak 60.9 -161. 399 (-389165.7 1235475) 8 Kuskokwim
+5 2015 Akutan 54.1 -166. 899 (-766425.7 526057.8) 1 Aleutian I…
+6 2015 Alaka… 62.7 -165. 777 (-539724.9 1456223) 13 Yukon
+# ℹ 1 more variable: mgmt_area <dbl>
+3. Calculate the total population by region using group_by()
and summarize()
Next we compute the total population for each region. In this case, we want to do a group_by()
and summarise()
as this were a regular data.frame
. Otherwise all of our point geometries would be included in the aggregation, which is not what we want. Our goal is just to get the total population by region. We remove the sticky geometry using as.data.frame()
, on the advice of the sf::tidyverse
help page.
<- pop_joined %>%
+ pop_region as.data.frame() %>%
+ group_by(region) %>%
+ summarise(total_pop = sum(population))
+
+head(pop_region)
# A tibble: 6 × 2
+ region total_pop
+ <chr> <dbl>
+1 Aleutian Islands 8840
+2 Arctic 8419
+3 Bristol Bay 6947
+4 Chignik 311
+5 Cook Inlet 408254
+6 Copper River 2294
+And use a regular left_join()
to get the information back to the Alaska region shapefile. Note that we need this step in order to regain our region geometries so that we can make some maps.
<- left_join(ak_regions_3338, pop_region, by = "region")
+ pop_region_3338
+# plot to check
+plot(pop_region_3338["total_pop"])
So far, we have learned how to use sf
and dplyr
to use a spatial join on two datasets and calculate a summary metric from the result of that join.
Say we want to calculate the population by Alaska management area, as opposed to region.
+<- pop_region_3338 %>%
+ pop_mgmt_338 group_by(mgmt_area) %>%
+ summarize(total_pop = sum(total_pop))
+
+plot(pop_mgmt_338["total_pop"])
Notice that the region geometries were combined into a single polygon for each management area.
+If we don’t want to combine geometries, we can specify do_union = F
as an argument.
<- pop_region_3338 %>%
+ pop_mgmt_3338 group_by(mgmt_area) %>%
+ summarize(total_pop = sum(total_pop), do_union = F)
+
+plot(pop_mgmt_3338["total_pop"])
4. Save the spatial object to a new file using write_sf()
Save the spatial object to disk using write_sf()
and specifying the filename. Writing your file with the extension .shp
will assume an ESRI driver driver, but there are many other format options available.
write_sf(pop_region_3338, "data/ak_regions_population.shp")
ggplot
ggplot2
now has integrated functionality to plot sf objects using geom_sf()
.
We can plot sf
objects just like regular data.frames using geom_sf
.
ggplot(pop_region_3338) +
+geom_sf(aes(fill = total_pop)) +
+ labs(fill = "Total Population") +
+ scale_fill_continuous(low = "khaki",
+ high = "firebrick",
+ labels = comma) +
+ theme_bw()
We can also plot multiple shapefiles in the same plot. Say if we want to visualize rivers in Alaska, in addition to the location of communities, since many communities in Alaska are on rivers. We can read in a rivers shapefile, doublecheck the crs
to make sure it is what we need, and then plot all three shapefiles - the regional population (polygons), the locations of cities (points), and the rivers (linestrings).
Coordinate Reference System:
+ User input: Albers
+ wkt:
+PROJCRS["Albers",
+ BASEGEOGCRS["GCS_GRS 1980(IUGG, 1980)",
+ DATUM["D_unknown",
+ ELLIPSOID["GRS80",6378137,298.257222101,
+ LENGTHUNIT["metre",1,
+ ID["EPSG",9001]]]],
+ PRIMEM["Greenwich",0,
+ ANGLEUNIT["Degree",0.0174532925199433]]],
+ CONVERSION["unnamed",
+ METHOD["Albers Equal Area",
+ ID["EPSG",9822]],
+ PARAMETER["Latitude of false origin",50,
+ ANGLEUNIT["Degree",0.0174532925199433],
+ ID["EPSG",8821]],
+ PARAMETER["Longitude of false origin",-154,
+ ANGLEUNIT["Degree",0.0174532925199433],
+ ID["EPSG",8822]],
+ PARAMETER["Latitude of 1st standard parallel",55,
+ ANGLEUNIT["Degree",0.0174532925199433],
+ ID["EPSG",8823]],
+ PARAMETER["Latitude of 2nd standard parallel",65,
+ ANGLEUNIT["Degree",0.0174532925199433],
+ ID["EPSG",8824]],
+ PARAMETER["Easting at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8826]],
+ PARAMETER["Northing at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8827]]],
+ CS[Cartesian,2],
+ AXIS["(E)",east,
+ ORDER[1],
+ LENGTHUNIT["metre",1,
+ ID["EPSG",9001]]],
+ AXIS["(N)",north,
+ ORDER[2],
+ LENGTHUNIT["metre",1,
+ ID["EPSG",9001]]]]
+<- read_sf("data/ak_rivers_simp.shp")
+ rivers_3338 st_crs(rivers_3338)
Note that although no EPSG code is set explicitly, with some sluething we can determine that this is EPSG:3338
. This site is helpful for looking up EPSG codes.
ggplot() +
+geom_sf(data = pop_region_3338, aes(fill = total_pop)) +
+ geom_sf(data = pop_3338, size = 0.5) +
+ geom_sf(data = rivers_3338,
+ aes(linewidth = StrOrder)) +
+ scale_linewidth(range = c(0.05, 0.5), guide = "none") +
+ labs(title = "Total Population by Alaska Region",
+ fill = "Total Population") +
+ scale_fill_continuous(low = "khaki",
+ high = "firebrick",
+ labels = comma) +
+ theme_bw()
ggmap
The ggmap
package has some functions that can render base maps (as raster objects) from open tile servers like Google Maps, Stamen, OpenStreetMap, and others.
We’ll need to transform our shapefile with population data by community to EPSG:3857
which is the crs
used for rendering maps in Google Maps, Stamen, and OpenStreetMap, among others.
<- pop_3338 %>%
+ pop_3857 st_transform(crs = 3857)
Next, let’s grab a base map from the Stamen map tile server covering the region of interest. First we include a function that transforms the bounding box (which starts in EPSG:4326
) to also be in the EPSG:3857
CRS, which is the projection that the map raster is returned in from Stamen. This is an issue with ggmap
described in more detail here
# Define a function to fix the bbox to be in EPSG:3857
+# See https://github.com/dkahle/ggmap/issues/160#issuecomment-397055208
+<- function(map) {
+ ggmap_bbox_to_3857 if (!inherits(map, "ggmap"))
+ stop("map must be a ggmap object")
+ # Extract the bounding box (in lat/lon) from the ggmap to a numeric vector,
+ # and set the names to what sf::st_bbox expects:
+ <- setNames(unlist(attr(map, "bb")),
+ map_bbox c("ymin", "xmin", "ymax", "xmax"))
+
+ # Coonvert the bbox to an sf polygon, transform it to 3857,
+ # and convert back to a bbox (convoluted, but it works)
+ <-
+ bbox_3857 st_bbox(st_transform(st_as_sfc(st_bbox(map_bbox, crs = 4326)), 3857))
+
+ # Overwrite the bbox of the ggmap object with the transformed coordinates
+ attr(map, "bb")$ll.lat <- bbox_3857["ymin"]
+ attr(map, "bb")$ll.lon <- bbox_3857["xmin"]
+ attr(map, "bb")$ur.lat <- bbox_3857["ymax"]
+ attr(map, "bb")$ur.lon <- bbox_3857["xmax"]
+
+ map }
Next, we define the bounding box of interest, and use get_stamenmap()
to get the basemap. Then we run our function defined above on the result of the get_stamenmap()
call.
<- c(-170, 52,-130, 64) # this is roughly southern Alaska
+ bbox <- get_stamenmap(bbox, zoom = 4) # get base map
+ ak_map <- ggmap_bbox_to_3857(ak_map) # fix the bbox to be in EPSG:3857 ak_map_3857
Finally, plot both the base raster map with the population data overlayed, which is easy now that everything is in the same projection (3857):
+ggmap(ak_map_3857) +
+geom_sf(data = pop_3857,
+ aes(color = population),
+ inherit.aes = F) +
+ scale_color_continuous(low = "khaki",
+ high = "firebrick",
+ labels = comma)
sf
objects with leaflet
We can also make an interactive map from our data above using leaflet
.
leaflet
(unlike ggplot
) will project data for you. The catch is that you have to give it both a projection (like Alaska Albers), and that your shapefile must use a geographic coordinate system. This means that we need to use our shapefile with the 4326 EPSG code. Remember you can always check what crs
you have set using st_crs
.
Here we define a leaflet projection for Alaska Albers, and save it as a variable to use later.
+<- leaflet::leafletCRS(
+ epsg3338 crsClass = "L.Proj.CRS",
+ code = "EPSG:3338",
+ proj4def = "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs",
+ resolutions = 2 ^ (16:7)
+ )
You might notice that this looks familiar! The syntax is a bit different, but most of this information is also contained within the crs
of our shapefile:
st_crs(pop_region_3338)
Coordinate Reference System:
+ User input: EPSG:3338
+ wkt:
+PROJCRS["NAD83 / Alaska Albers",
+ BASEGEOGCRS["NAD83",
+ DATUM["North American Datum 1983",
+ ELLIPSOID["GRS 1980",6378137,298.257222101,
+ LENGTHUNIT["metre",1]]],
+ PRIMEM["Greenwich",0,
+ ANGLEUNIT["degree",0.0174532925199433]],
+ ID["EPSG",4269]],
+ CONVERSION["Alaska Albers (meters)",
+ METHOD["Albers Equal Area",
+ ID["EPSG",9822]],
+ PARAMETER["Latitude of false origin",50,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8821]],
+ PARAMETER["Longitude of false origin",-154,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8822]],
+ PARAMETER["Latitude of 1st standard parallel",55,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8823]],
+ PARAMETER["Latitude of 2nd standard parallel",65,
+ ANGLEUNIT["degree",0.0174532925199433],
+ ID["EPSG",8824]],
+ PARAMETER["Easting at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8826]],
+ PARAMETER["Northing at false origin",0,
+ LENGTHUNIT["metre",1],
+ ID["EPSG",8827]]],
+ CS[Cartesian,2],
+ AXIS["easting (X)",east,
+ ORDER[1],
+ LENGTHUNIT["metre",1]],
+ AXIS["northing (Y)",north,
+ ORDER[2],
+ LENGTHUNIT["metre",1]],
+ USAGE[
+ SCOPE["Topographic mapping (small scale)."],
+ AREA["United States (USA) - Alaska."],
+ BBOX[51.3,172.42,71.4,-129.99]],
+ ID["EPSG",3338]]
+Since leaflet
requires that we use an unprojected coordinate system, let’s use st_transform()
yet again to get back to WGS84.
<- pop_region_3338 %>% st_transform(crs = 4326) pop_region_4326
<- leaflet(options = leafletOptions(crs = epsg3338)) %>%
+ m addPolygons(data = pop_region_4326,
+ fillColor = "gray",
+ weight = 1)
+
+ m
We can add labels, legends, and a color scale.
+<- colorNumeric(palette = "Reds", domain = pop_region_4326$total_pop)
+ pal
+<- leaflet(options = leafletOptions(crs = epsg3338)) %>%
+ m addPolygons(
+ data = pop_region_4326,
+ fillColor = ~ pal(total_pop),
+ weight = 1,
+ color = "black",
+ fillOpacity = 1,
+ label = ~ region
+ %>%
+ ) addLegend(
+ position = "bottomleft",
+ pal = pal,
+ values = range(pop_region_4326$total_pop),
+ title = "Total Population"
+
+ )
+ m
We can also add the individual communities, with popup labels showing their population, on top of that!
+<- colorNumeric(palette = "Reds", domain = pop_region_4326$total_pop)
+ pal
+<- leaflet(options = leafletOptions(crs = epsg3338)) %>%
+ m addPolygons(
+ data = pop_region_4326,
+ fillColor = ~ pal(total_pop),
+ weight = 1,
+ color = "black",
+ fillOpacity = 1
+ %>%
+ ) addCircleMarkers(
+ data = pop_4326,
+ lat = ~ lat,
+ lng = ~ lng,
+ radius = ~ log(population / 500),
+ # arbitrary scaling
+ fillColor = "gray",
+ fillOpacity = 1,
+ weight = 0.25,
+ color = "black",
+ label = ~ paste0(pop_4326$city, ", population ", comma(pop_4326$population))
+ %>%
+ ) addLegend(
+ position = "bottomleft",
+ pal = pal,
+ values = range(pop_region_4326$total_pop),
+ title = "Total Population"
+
+ )
+ m
There is a lot more functionality to sf
including the ability to intersect
polygons, calculate distance
, create a buffer
, and more. Here are some more great resources and tutorials for a deeper dive into this great package:
ADD LINK TO MATERIALS
+ + + +Much of the information covered in this chapter is based on Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. This is a great book if you want to go deeper into text analysis.
+Text mining is the process by which unstructured text is transformed into a structured format to prepare it for analysis. This can range from the simple example we show in this lesson, to much more complicated processes such as using OCR (optical character recognition) to scan and extract text from pdfs, or web scraping.
+Once text is in a structured format, analysis can be performed on it. The inherent benefit of quantitative text analysis is that it is highly scalable. With the right computational techniques, massive quantities of text can be mined and analyzed many, many orders of magnitude faster than it would take a human to do the same task. The downside, is that human language is inherently nuanced, and computers (as you may have noticed) think very differently than we do. In order for an analysis to capture this nuance, the tools and techniques for text analysis need to be set up with care, especially when the analysis becomes more complex.
+There are a number of different types of text analysis. In this lesson we will show some simple examples of two: word frequency, and sentiment analysis.
+First we’ll load the libraries we need for this lesson:
+library(dplyr)
+library(tibble)
+library(readr)
+library(tidytext)
+library(wordcloud)
+library(reshape2)
Load the survey data back in using the code chunks below:
+<- read_csv("https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A71cb8d0d-70d5-4752-abcd-e3bcf7f14783", show_col_types = FALSE)
+ survey_raw
+<- read_csv("https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A0a1dd2d8-e8db-4089-a176-1b557d6e2786", show_col_types = FALSE) events
<- survey_raw %>%
+ survey_clean select(-notes) %>%
+ mutate(Q1 = if_else(Q1 == "1", "below expectations", Q1)) %>%
+ mutate(Q2 = tolower(Q2))
+
+<- left_join(survey_clean, events, by = "StartDate") survey_joined
We are going to be working in the “tidy text format.” This format stipulates that the text column of our data frame contains rows with only one token per row. A token, in this case, is a meaningful unit of text. Depending on the analysis, that could be a word, two words, or phrase.
+First, let’s create a data frame with responses to question 3, with the one token per row. We use the unnest_tokens
function from tidytext
, after selecting columns of interest.
<- survey_joined %>%
+ q3 select(StartDate, location, Q3) %>%
+ unnest_tokens(output = word, input = Q3)
You’ll see that we now have a very long data frame with only one word in each row of the text column. Some of the words aren’t so interesting though. The words that are likely not useful for analysis are called “stop words”. There is a list of stop words contained within the tidytext
package and we can access it using the data
function. We can then use the anti_join
function to return only the words that are not in the stop word list.
data(stop_words)
+
+<- anti_join(q3, stop_words) q3
Joining with `by = join_by(word)`
+Now, we can do normal dplyr
analysis to examine the most commonly used words in question 3. The count
function is helpful here. We could also do a group_by
and summarize
and get the same result. We can also arrange
the results, and get the top 10 using slice_head
.
<- q3 %>%
+ q3_top count(word) %>%
+ arrange(-n) %>%
+ slice_head(n = 10)
Right now, our counts of the most commonly used non-stop words are only moderately informative because they don’t take into context how many other words, responses, and courses there are. A widely used metric to analyze and draw conclusions from word frequency, including frequency within documents (or courses, in our case) is called tf-idf. This is the term frequency (number of appearances of a term divided by total number of terms), multiplied by the inverse document frequency (the natural log of the number of documents divided by the number of documents containing the term). The tidytext book has great examples on how to calculate this metric easily using some built in functions to the package.
+Let’s do the same workflow for question 4:
+<- survey_joined %>%
+ q4 select(StartDate, location, Q4) %>%
+ unnest_tokens(output = word, input = Q4) %>%
+ anti_join(stop_words)
Joining with `by = join_by(word)`
+<- q4 %>%
+ q4_top count(word) %>%
+ arrange(-n) %>%
+ slice_head(n = 10)
Perhaps not surprisingly, the word data is mentioned a lot! In this case, it might be useful to add it to our stop words list. You can create a data.frame in place with your word, and an indication of the lexicon (in this case, your own, which we can call custom). Then we use rbind
to bind that data frame with our previous stop words data frame.
<- data.frame(word = "data", lexicon = "custom")
+ custom_words
+<- rbind(stop_words, custom_words) stop_words_full
Now we can run our question 4 analysis again, with the anti_join
on our custom list.
<- survey_joined %>%
+ q4 select(StartDate, location, Q4) %>%
+ unnest_tokens(output = word, input = Q4) %>%
+ anti_join(stop_words_full)
Joining with `by = join_by(word)`
+<- q4 %>%
+ q4_top count(word) %>%
+ arrange(-n) %>%
+ slice_head(n = 10)
The above example showed how to analyze text that was contained within a tabular format (a csv file). There are many other text formats that you might want to analyze, however. This might include pdf documents, websites, word documents, etc. Here, we’ll look at how to read in the text from a PDF document into an analysis pipeline like above.
+Before we begin, it is important to understand that not all PDF documents can be processed this way. PDF files can store information in many ways, including both images and text. Some PDF documents, particularly older ones, or scanned documents, are images of text and the bytes making up the document do not contain a ‘readable’ version of the text in the image, it is an image not unlike one you would take with a camera. Other PDF documents will contain the text as character strings, along with the information on how to render it on the page (such as position and font). The analysis that follows will only work on PDF files that fit the second description of the format. If the PDF document you are trying to analyze is more like the first, you would need to first use a technique called Optical Character Recognition (OCR) to interpret the text in the image and store it in a parsable way. Since this document can be parsed, we’ll proceed without doing OCR.
+First we’ll load another library, pdftools
, which will read in our PDF, and the stringr
library, which helps manipulate character strings.
library(pdftools)
+library(stringr)
Next, navigate to the dataset Elizabeth Rink and Gitte Adler Reimer. 2022. Population Dynamics in Greenland: A Multi-component Mixed-methods Study of Pregnancy Dynamics in Greenland (2014-2022). Arctic Data Center. doi:10.18739/A21Z41V1R.. Right click the download button next to the top PDF data file called ‘Translation_of_FG_8_Ilulissat_170410_0077.pdf’.
+First we create a variable with a path to a location where we want to save the file.
+<- "data/Translation_of_FG_8_Ilulissat_170410_0077.pdf" path
Then use download.file
to download it and save it to that path.
download.file("https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A34999083-2fa1-4222-ab27-53204327e8fc", path)
The pdf_text
function extracts the text from the PDF file, returning a vector of character strings with a length equal to the number of pages in the file. So, our return value is loaded into R, but maybe not that useful yet because it is just a bunch of really long strings.
<- pdf_text(path)
+ txt
+class(txt)
[1] "character"
+Luckily, there is a function that will turn the pdf text data we just read in to a form that is compatible with the rest of the tidytext
tools. The tibble::enframe
function, converts the list into a data.frame
. We then change one column name to describe what the column actually is (page number).
<- txt %>%
+ txt_clean enframe() %>%
+ rename(page = name)
We can do the same analysis as above, unnesting the tokens and removing stop words to get the most frequent words:
+<- txt_clean %>%
+ pdf_summary unnest_tokens(output = word, input = value) %>%
+ anti_join(stop_words) %>%
+ count(word) %>%
+ arrange(-n) %>%
+ slice_head(n = 10)
Joining with `by = join_by(word)`
+If we look at the result, and then back at the original document, it is clear that there is more work needed to get the data to an analyzable state. The header and footer of each page of the PDF were included in the text we analyzed, and since they are repeated every page (and aren’t really the subject of our inquiry anyway), should be removed from the text after we read it into R but before we try to calculate the most used words. It might also be beneficial to try and separate out the questions from responses, if we wanted to analyze just the responses or just the questions.
+To help us clean things up, first let’s split our value column currently containing full pages of text by where there are double newlines (\n\n
). You can see in the original PDF how this demarcates the responses, which contain single newlines within each paragraph, and two new lines (an empty line) between paragraphs. You can see an example of this within the text we have read in by examining just the first 500 characters of the first page of data.
substr(txt_clean$value[1], 1,500)
[1] " Population Dynamics in Greenland\n Focus Group Meetings\n\nDate of the focus group meeting: Ilulissat April 10th, 2017\nName and title of the Researchers: Q3: Dr. Elizabeth Rink & Q1: Dr. Gitte Adler\nReimer\nName of the facilitator: Q2: Majbritt Didriksen Raal\nRecord No.: 170410_0077\n\nFG # 8\n\nQ2: Ilulissat the April 10th, 2017, we are going to talk to the Professional Group.\n\nQ1: Welcome, we are working on the final part of t"
+To split our character vectors, we will use the str_split
function. It splits a character vector according to a separator, and stores the values in a list. To show more clearly, let’s look at a dummy example. We can split a string of comma separated numbers into a list with each individual number.
<- "1,2,3,4,5"
+ x str_split(x, ",")
[[1]]
+[1] "1" "2" "3" "4" "5"
+In the real dataset, we’ll use str_split
and mutate
, which will create a list of values within each row of the value
column. So each cell in the value
column contains a list of values like the result of the example above. We can “flatten” this data so that each cell only has one value by using the unnest
function, which takes as arguments the columns to flatten. Let’s take the example above, and make it a little more like our real data.
First turn the original dummy vector into a data frame, and do our split as before, this time using mutate
.
<- data.frame(x = x) %>%
+ x_df mutate(x = str_split(x, ","))
+ x_df
x
+1 1, 2, 3, 4, 5
+Then you can run the unnest
on the column of split values we just created.
<- x_df %>%
+ x_df_flat ::unnest(cols = x)
+ tidyr
+ x_df_flat
# A tibble: 5 × 1
+ x
+ <chr>
+1 1
+2 2
+3 3
+4 4
+5 5
+Now that we know how this works, let’s do it on our dataset with the double newline character as the separator.
+<- txt_clean %>%
+ txt_clean mutate(value = str_split(value, "\n\n")) %>%
+ ::unnest(cols = value) tidyr
::datatable(txt_clean, rownames = F) DT
You can see that our questions and answers are now easily visible because they all start with wither Q or A. The other lines are blank lines or header/footer lines from the document. So, let’s extract the first few characters of each line into a new column using substr
, with the goal that we’ll run a filter
for rows that start with Q or A, thus discarding all the other rows.
First, we extract the first 4 characters of each row and using mutate
create a new column with those values called id
.
<- txt_clean %>%
+ txt_clean mutate(id = substr(value, 1, 4))
Let’s have a look at the unique values there:
+unique(txt_clean$id)
[1] " " "Date" "FG #" "Q2: " "Q1: " "A1: " "PDG " "A2: " "\nQ2:"
+[10] "\nQ2." "\nPDG" "" "A1 " "Q3: "
+So unfortunately some of the text is a tiny bit garbled, there are newlines before at least some Q and A ids. We can use mutate again with str_replace
to replace those \n
with a blank value, which will remove them.
<- txt_clean %>%
+ txt_clean mutate(id = substr(value, 1, 4)) %>%
+ mutate(id = str_replace(id, "\n", ""))
unique(txt_clean$id)
[1] " " "Date" "FG #" "Q2: " "Q1: " "A1: " "PDG " "A2: " "Q2:" "Q2."
+[11] "PDG" "" "A1 " "Q3: "
+Now we will use substr
again to get the first two characters of each id.
<- txt_clean %>%
+ txt_clean mutate(id = substr(value, 1, 4)) %>%
+ mutate(id = str_replace(id, "\n", "")) %>%
+ mutate(id = substr(id, 1, 2))
unique(txt_clean$id)
[1] " " "Da" "FG" "Q2" "Q1" "A1" "PD" "A2" "" "Q3"
+Finally, we can run the filter. Here, we filter for id
values that start with either a Q or an A using the grepl
function and a regular expression. We won’t go much into regular expression details, but there is a chapter in the appendix for more about how they work.
Here is an example of grepl in action. It returns a true or false for whether the value of x starts with (signified by ^
) a Q or A (signified by QA in square brackets).
<- c("Q3", "F1", "AAA", "FA")
+ x grepl("^[QA]", x)
[1] TRUE FALSE TRUE FALSE
+So let’s run that within a filter
which will return only rows where the grepl
would return TRUE.
<- txt_clean %>%
+ txt_clean mutate(id = substr(value, 1, 4)) %>%
+ mutate(id = str_replace(id, "\n", "")) %>%
+ mutate(id = substr(id, 1, 2)) %>%
+ filter(grepl("^[QA]", id))
Finally, as our last cleaning step we replace all instances of the start of a string that contains a Q or A, followed by a digit and a colon, with an empty string (removing them from the beginning of the line.
+<- txt_clean %>%
+ txt_clean mutate(id = substr(value, 1, 4)) %>%
+ mutate(id = str_replace(id, "\n", "")) %>%
+ mutate(id = substr(id, 1, 2)) %>%
+ filter(grepl("^[QA]", id)) %>%
+ mutate(value = str_replace_all(value, "[QA][0-9]\\:", ""))
Finally, we can try the same analysis again as above to look for the most commonly used words.
+<- txt_clean %>%
+ pdf_summary unnest_tokens(output = word, input = value) %>%
+ anti_join(stop_words) %>%
+ count(word) %>%
+ arrange(-n) %>%
+ slice_head(n = 10)
Joining with `by = join_by(word)`
+In sentiment analysis, tokens (in this case our single words) are evaluated against a dictionary of words where a sentiment is assigned to the word. There are many different sentiment lexicons, some with single words, some with more than one word, and some that are aimed at particular disciplines. When embarking on a sentiment analysis project, choosing your lexicon is one that should be done with care. Sentiment analysis can also be done using machine learning algorithms.
+With that in mind, we will next do a very simple sentiment analysis on our Q3 and Q4 answers using the bing lexicon from Bing Liu and collaborators, which ships with the tidytext
package.
First we will use the get_sentiments
function to load the lexicon.
<- get_sentiments("bing") bing
Next we do an inner join to return the words from question 3 that are contained within the lexicon.
+<- inner_join(q3, bing, by = "word") q3_sent
There are a variety of directions you could go from here, analysis wise, such as calculating an overall sentiment index for that question, plotting sentiment against some other variable, or, making a fun word cloud like below! Here we bring in reshape2::acast
to create a sentiment matrix for each word, and pass that into wordcloud::comparison.cloud
to look at a wordcloud that indicates the frequency and sentiment of the words in our responses.
%>%
+ q3_sent count(word, sentiment, sort = TRUE) %>%
+ acast(word ~ sentiment, value.var = "n", fill = 0) %>%
+ comparison.cloud(colors = c("gray20", "gray80"),
+ max.words = 100, title.size = 2)
Let’s look at the question 4 word cloud:
+%>%
+ q4 inner_join(bing, by = "word") %>%
+ count(word, sentiment, sort = TRUE) %>%
+ acast(word ~ sentiment, value.var = "n", fill = 0) %>%
+ comparison.cloud(colors = c("gray20", "gray80"),
+ max.words = 100, title.size = 2)
In this lesson, we learned:
+(Chapter 1 - 5 min) - General view of US Census Data
+tidycensus
(Chapter 2 - 25-30 min)
+(Chapter 3 - 20-25 min)
+ggplot2
(Chapter 4 - 20 min)
+( Chapter 5 and 6 - 10 min?)
+ + +Surveys and questionnaires are commonly used research methods within social science and other fields. For example, understanding regional and national population demographics, income, and education as part of the National Census activity, assessing audience perspectives on specific topics of research interest (e.g. the work by Tenopir and colleagues on Data Sharing by Scientists), evaluation of learning deliverables and outcomes, and consumer feedback on new and upcoming products. These are distinct from the use of the term survey within natural sciences, which might include geographical surveys (“the making of measurement in the field from which maps are drawn”), ecological surveys (“the process whereby a proposed development site is assess to establish any environmental impact the development may have”) or biodiversity surveys (“provide detailed information about biodiversity and community structure”) among others.
+Although surveys can be conducted on paper or verbally, here we focus on surveys done via software tools. Needs will vary according to the nature of the research being undertaken. However, there is fundamental functionality that survey software should provide including:
+More advanced features can include:
+Commonly used survey software within academic (vs market) research include Qualtrics, Survey Monkey and Google Forms. Both qualtrics and survey monkey are licensed (with limited functionality available at no cost) and google forms is free.
+ +In this lesson we will use the qualtRics
package to reproducible access some survey results set up for this course.
The survey is very short, only four questions. The first question is on it’s own page and is a consent question, after a couple of short paragraphs describing what the survey is, it’s purpose, how long it will take to complete, and who is conducting it. This type of information is required if the survey is governed by an IRB, and the content will depend on the type of research being conducted. In this case, this survey is not for research purposes, and thus is not governed by IRB, but we still include this information as it conforms to the Belmont Principles. The Belmont Principles identify the basic ethical principles that should underlie research involving human subjects.
+ +The three main questions of the survey have three types of responses: a multiple choice answer, a multiple choice answer which also includes an “other” write in option, and a free text answer. We’ll use the results of this survey, which was sent out to NCEAS staff to fill out, to learn about how to create a reproducible survey report.
+ +First, open a new RMarkdown document and add a chunk to load the libraries we’ll need for this lesson:
+library(qualtRics)
+library(tidyr)
+library(knitr)
+library(ggplot2)
+library(kableExtra)
+library(dplyr)
Next, we need to set the API credentials. This function modifies the .Renviron
file to set your API key and base URL so that you can access Qualtrics programmatically.
The API key is as good as a password, so care should be taken to not share it publicly. For example, you would never want to save it in a script. The function below is the rare exception of code that should be run in the console and not saved. It works in a way that you only need to run it once, unless you are working on a new computer or your credentials changed. Note that in this book, we have not shared the actual API key, for the reasons outlined above. You should have an e-mail with the API key in it. Copy and paste it as a string to the api_key
argument in the function below:
qualtrics_api_credentials(api_key = "", base_url = "ucsb.co1.qualtrics.com", install = TRUE)
The .Renviron file is a special user controlled file that can create environment variables. Every time you open Rstudio, the variables in your environment file are loaded as…environment variables! Environment variables are named values that are accessible by your R process. They will not show up in your environment pane, but you can get a list of all of them using Sys.getenv()
. Many are system defaults.
To view or edit your .Renviron
file, you can use usethis::edit_r_environ()
.
To get a list of all the surveys in your Qualtrics instance, use the all_surveys
function.
<- all_surveys()
+ surveys kable(surveys) %>%
+kable_styling()
This function returns a list of surveys, in this case only one, and information about each, including an identifier and it’s name. We’ll need that identifier later, so let’s go ahead and extract it using base R from the data frame.
+<- which(surveys$name == "Survey for Data Science Training")
+ i <- surveys$id[i] id
You can retrieve a list of the questions the survey asked using the survey_questions
function and the survey id
.
<- survey_questions(id)
+ questions kable(questions) %>%
+kable_styling()
This returns a data.frame
with one row per question with columns for question id, question name, question text, and whether the question was required. This is helpful to have as a reference for when you are looking at the full survey results.
To get the full survey results, run fetch_survey
with the survey id.
<- fetch_survey(id) survey_results
The survey results table has tons of information in it, not all of which will be relevant depending on your survey. The table has identifying information for the respondents (eg: ResponseID
, IPaddress
, RecipientEmail
, RecipientFirstName
, etc), much of which will be empty for this survey since it is anonymous. It also has information about the process of taking the survey, such as the StartDate
, EndDate
, Progress
, and Duration
. Finally, there are the answers to the questions asked, with columns labeled according to the qname
column in the questions table (eg: Q1, Q2, Q3). Depending on the type of question, some questions might have multiple columns associated with them. We’ll have a look at this more closely in a later example.
Let’s look at the responses to the second question in the survey, “How long have you been programming?” Remember, the first question was the consent question.
+We’ll use the dplyr
and tidyr
tools we learned earlier to extract the information. Here are the steps:
select
the column we want (Q1
)group_by
and summarize
the values<- survey_results %>%
+ q2 select(Q2) %>%
+ group_by(Q2) %>%
+ summarise(n = n())
We can show these results in a table using the kable
function from the knitr
package:
kable(q2, col.names = c("How long have you been programming?",
+"Number of responses")) %>%
+ kable_styling()
For question 3, we’ll use a similar workflow. For this question, however there are two columns containing survey answers. One contains the answers from the controlled vocabulary, the other contains any free text answers users entered.
+To present this information, we’ll first show the results of the controlled answers as a plot. Below the plot, we’ll include a table showing all of the free text answers for the “other” option.
+<- survey_results %>%
+ q3 select(Q3) %>%
+ group_by(Q3) %>%
+ summarise(n = n())
ggplot(data = q3, mapping = aes(x = Q3, y = n)) +
+geom_col() +
+ labs(x = "What language do you currently use most frequently?", y = "Number of reponses") +
+ theme_minimal()
Now we’ll extract the free text responses:
+<- survey_results %>%
+ q3_text select(Q3_7_TEXT) %>%
+ drop_na()
+
+kable(q3_text, col.names = c("Other responses to 'What language do you currently use mose frequently?'")) %>%
+kable_styling()
The last question is just a free text question, so we can just display the results as is.
+<- survey_results %>%
+ q4 select(Q4) %>%
+ rename(`What data science tool or language are you most excited to learn next?` = Q4) %>%
+ drop_na()
+
+kable(q4, col.names = "What data science tool or language are you most excited to learn next?") %>%
+kable_styling()
Google forms can be a great way to set up surveys, and it is very easy to interact with the results using R. The benefits of using google forms are a simple interface and easy sharing between collaborators, especially when writing the survey instrument.
+The downside is that google forms has far fewer features than Qualtrics in terms of survey flow and appearance.
+To show how we can link R into our survey workflows, I’ve set up a simple example survey here.
+I’ve set up the results so that they are in a new spreadsheet here:. To access them, we will use the googlesheets4
package.
First, open up a new R script and load the googlesheets4
library:
library(googlesheets4)
Next, we can read the sheet in using the same URL that you would use to share the sheet with someone else. Right now, this sheet is public
+<- read_sheet("https://docs.google.com/spreadsheets/d/1CSG__ejXQNZdwXc1QK8dKouxphP520bjUOnZ5SzOVP8/edit?usp=sharing") responses
✔ Reading from "Example Survey Form (Responses)".
+✔ Range 'Form Responses 1'.
+The first time you run this, you should get a popup window in your web browser asking you to confirm that you want to provide access to your google sheets via the tidyverse (googlesheets) package.
+My dialog box looked like this:
+ +Make sure you click the third check box enabling the Tidyverse API to see, edit, create, and delete your sheets. Note that you will have to tell it to do any of these actions via the R code you write.
+When you come back to your R environment, you should have a data frame containing the data in your sheet! Let’s take a quick look at the structure of that sheet.
+::glimpse(responses) dplyr
Rows: 10
+Columns: 5
+$ Timestamp <dttm> 2022-04-15 13:…
+$ `To what degree did the event meet your expectations?` <chr> "Met expectatio…
+$ `To what degree did your knowledge improve?` <chr> "Increase", "Si…
+$ `What did you like most about the event?` <chr> "the cool instr…
+$ `What might you change about the event?` <chr> "more snacks", …
+So, now that we have the data in a standard R data.frame
, we can easily summarize it and plot results. By default, the column names in the sheet are the long fully descriptive questions that were asked, which can be hard to type. We can save those questions into a vector for later reference, like when we want to use the question text for plot titles.
<- colnames(responses)[2:5]
+ questions ::glimpse(questions) dplyr
chr [1:4] "To what degree did the event meet your expectations?" ...
+We can make the responses data frame more compact by renaming the columns of the vector with short numbered names of the form Q1
. Note that, by using a sequence, this should work for sheets from just a few columns to many hundreds of columns, and provides a consistent question naming convention.
names(questions) <- paste0("Q", seq(1:4))
+colnames(responses) <- c("Timestamp", names(questions))
+::glimpse(responses) dplyr
Rows: 10
+Columns: 5
+$ Timestamp <dttm> 2022-04-15 13:48:58, 2022-04-15 13:49:43, 2022-04-15 13:50:…
+$ Q1 <chr> "Met expectations", "Above expectations", "Above expectation…
+$ Q2 <chr> "Increase", "Significant increase", "Significant increase", …
+$ Q3 <chr> "the cool instructors", "R is rad!", "everything", "the pizz…
+$ Q4 <chr> "more snacks", "no pineapple pizza!", "nothing", "needs more…
+Now that we’ve renamed our columns, let’s summarize the responses for the first question. We can use the same pattern that we usually do to split the data from Q1 into groups, then summarize it by counting the number of records in each group, and then merge the count of each group back together into a summarized data frame. We can then plot the Q1 results using ggplot
:
<- responses %>%
+ q1 ::select(Q1) %>%
+ dplyr::group_by(Q1) %>%
+ dplyr::summarise(n = dplyr::n())
+ dplyr
+::ggplot(data = q1, mapping = aes(x = Q1, y = n)) +
+ ggplot2geom_col() +
+ labs(x = questions[1],
+ y = "Number of reponses",
+ title = "To what degree did the course meet expectations?") +
+ theme_minimal()
If you don’t want to go through a little interactive dialog every time you read in a sheet, and your sheet is public, you can run the function gs4_deauth()
to access the sheet as a public user. This is helpful for cases when you want to run your code non-interactively. This is actually how I set it up for this book to build!
Now that you have some background in accessing survey data from common tools, let’s do a quick exercise with Google Sheets. First, create a google sheet with the following columns that reflect a hypothetical survey result:
+Next populate the spreadhsheet with 5 to 10 rows of sample data that you make up. Now that you have the Google sheet in place, copy its URL and use it to do the following in R:
+googlesheets
packageSimilar to Qualtrics and qualtRics, there is an open source R package for working with data in Survey Monkey: Rmonkey. However, the last updates were made 5 years ago, an eternity in the software world, so it may or may not still function as intended.
+There are also commercial options available. For example, cdata have a driver and R package that enable access to an analysis of Survey Monkey data through R.
+ + +Add material
+ + + +In this lesson we will:
+Shiny is an R package for creating interactive data visualizations embedded in a web application that you and your colleagues can view with just a web browser. Shiny apps are relatively easy to construct, and provide interactive features for letting others share and explore data and analyses.
+There are some really great examples of what Shiny can do on the RStudio webite like this one exploring movie metadata. A more scientific example is a tool from the SASAP project exploring proposal data from the Alaska Board of Fisheries. There is also an app for Delta monitoring efforts.
+ +Most any kind of analysis and visualization that you can do in R can be turned into a useful interactive visualization for the web that lets people explore your data more intuitively But, a Shiny application is not the best way to preserve or archive your data. Instead, for preservation use a repository that is archival in its mission like the KNB Data Repository, Zenodo, or Dryad. This will assign a citable identifier to the specific version of your data, which you can then read in an interactive visualiztion with Shiny.
+For example, the data for the Alaska Board of Fisheries application is published on the KNB and is citable as:
+Meagan Krupa, Molly Cunfer, and Jeanette Clark. 2017. Alaska Board of Fisheries Proposals 1959-2016. Knowledge Network for Biocomplexity. doi:10.5063/F1QN652R.
+While that is the best citation and archival location of the dataset, using Shiny, one can also provide an easy-to-use exploratory web application that you use to make your point that directly loads the data from the archival site. For example, the Board of Fisheries application above lets people who are not inherently familiar with the data to generate graphs showing the relationships between the variables in the dataset.
+We’re going to create a simple shiny app with two sliders so we can interactively control inputs to an R function. These sliders will allow us to interactively control a plot.
+RStudio will create a new file called app.R
that contains the Shiny application.
+Run it by choosing Run App
from the RStudio editor header bar. This will bring up the default demo Shiny application, which plots a histogram and lets you control the number of bins in the plot.
Note that you can drag the slider to change the number of bins in the histogram.
+A Shiny application consists of two functions, the ui
and the server
. The ui
function is responsible for drawing the web page, while the server
is responsible for any calculations and for creating any dynamic components to be rendered.
Each time that a user makes a change to one of the interactive widgets, the ui
grabs the new value (say, the new slider min and max) and sends a request to the server
to re-render the output, passing it the new input
values that the user had set. These interactions can sometimes happen on one computer (e.g., if the application is running in your local RStudio instance). Other times, the ui
runs on the web browser on one computer, while the server
runs on a remote computer somewhere else on the Internet (e.g., if the application is deployed to a web server).
Let’s modify this application to plot Yolo bypass secchi disk data in a time-series, and allow aspects of the plot to be interactively changed.
+Use this code to load the data at the top of your app.R
script. Note we are using contentId
again, and we have filtered for some species of interest.
library(shiny)
+library(contentid)
+library(dplyr)
+library(ggplot2)
+library(lubridate)
+
+<- 'hash://sha1/317d7f840e598f5f3be732ab0e04f00a8051c6d0'
+ sha1 <- contentid::resolve(sha1, registries=c("dataone"), store = TRUE)
+ delta.file
+# fix the sample date format, and filter for species of interest
+<- read.csv(delta.file) %>%
+ delta_data mutate(SampleDate = mdy(SampleDate)) %>%
+ filter(grepl("Salmon|Striped Bass|Smelt|Sturgeon", CommonName))
+
+names(delta_data)
We know there has been a lot of variation through time in the delta, so let’s plot a time-series of Secchi depth. We do so by switching out the histogram code for a simple ggplot, like so:
+<- function(input, output) {
+ server
+ $distPlot <- renderPlot({
+ output
+ ggplot(delta_data, mapping = aes(SampleDate, Secchi)) +
+ geom_point(colour="red", size=4) +
+ theme_light()
+
+ }) }
If you now reload the app, it will display the simple time-series instead of the histogram. At this point, we haven’t added any interactivity.
+In a Shiny application, the server
function provides the part of the application that creates our interactive components, and returns them to the user interface (ui
) to be displayed on the page.
To make the plot interactive, first we need to modify our user interface to include widgits that we’ll use to control the plot. Specifically, we will add a new slider for setting the minDate
parameter, and modify the existing slider to be used for the maxDate
parameter. To do so, modify the sidebarPanel()
call to include two sliderInput()
function calls:
sidebarPanel(
+sliderInput("minDate",
+ "Min Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = as.Date("1998-01-01")),
+ sliderInput("maxDate",
+ "Max Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = as.Date("2005-01-01"))
+ )
If you reload the app, you’ll see two new sliders, but if you change them, they don’t make any changes to the plot. Let’s fix that.
+Finally, to make the plot interactive, we can use the input
and output
variables that are passed into the server
function to access the current values of the sliders. In Shiny, each UI component is given an input identifier when it is created, which is used as the name of the value in the input list. So, we can access the minimum depth as input$minDate
and the max as input$maxDate
. Let’s use these values now by adding limits to our X axis in the ggplot:
ggplot(delta_data, mapping = aes(SampleDate, Secchi)) +
+ geom_point(colour="red", size=4) +
+ xlim(c(input$minDate,input$maxDate)) +
+ theme_light()
At this point, we have a fully interactive plot, and the sliders can be used to change the min and max of the Depth axis.
+ +Looks so shiny!
+What happens if a clever user sets the minimum for the X axis at a greater value than the maximum? You’ll see that the direction of the X axis becomes reversed, and the plotted points display right to left. This is really an error condition. Rather than use two independent sliders, we can modify the first slider to output a range of values, which will prevent the min from being greater than the max. You do so by setting the value of the slider to a vector of length 2, representing the default min and max date for the slider, such as c(as.Date("1998-01-01"), as.Date("2020-01-01"))
. So, delete the second slider, rename the first, and provide a vector for the value, like this:
sliderInput("date",
+"Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = c(as.Date("1998-01-01"), as.Date("2020-01-01")))
+ )
Now, modify the ggplot to use this new date
slider value, which now will be returned as a vector of length 2. The first element of the depth vector is the min, and the second is the max value on the slider.
ggplot(delta_data, mapping = aes(SampleDate, Secchi)) +
+ geom_point(colour="red", size=4) +
+ xlim(c(input$date[1],input$date[2])) +
+ theme_light()
If you want to display more than one plot in your application, and provide a different set of controls for each plot, the current layout would be too simple. Next we will extend the application to break the page up into vertical sections, and add a new plot in which the user can choose which variables are plotted. The current layout is set up such that the FluidPage
contains the title element, and then a sidebarLayout
, which is divided horizontally into a sidebarPanel
and a mainPanel
.
To extend the layout, we will first nest the existing sidebarLayout
in a new verticalLayout
, which simply flows components down the page vertically. Then we will add a new sidebarLayout
to contain the bottom controls and graph.
This mechanism of alternately nesting vertical and horizontal panels can be used to segment the screen into boxes with rules about how each of the panels is resized, and how the content flows when the browser window is resized. The sidebarLayout
works to keep the sidebar about 1/3 of the box, and the main panel about 2/3, which is a good proportion for our controls and plots. Add the verticalLayout, and the second sidebarLayout for the second plot as follows:
verticalLayout(
+ # Sidebar with a slider input for depth axis
+ sidebarLayout(
+ sidebarPanel(
+ sliderInput("date",
+ "Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = c(as.Date("1998-01-01"), as.Date("2020-01-01")))
+
+ ),# Show a plot of the generated distribution
+ mainPanel(
+ plotOutput("distPlot")
+
+ )
+ ),
+$hr(),
+ tags
+sidebarLayout(
+ sidebarPanel(
+ selectInput("x_variable", "X Variable", cols, selected = "SampleDate"),
+ selectInput("y_variable", "Y Variable", cols, selected = "Count"),
+ selectInput("color_variable", "Color", cols, selected = "CommonName")
+
+ ),
+# Show a plot with configurable axes
+ mainPanel(
+ plotOutput("varPlot")
+
+ )
+ ),$hr() tags
Note that the second sidebarPanel
uses three selectInput
elements to provide dropdown menus with the variable columns (cols
) from our data frame. To manage that, we need to first set up the cols variable, which we do by saving the variables names from the delta_data
data frame to a variable:
<- 'hash://sha1/317d7f840e598f5f3be732ab0e04f00a8051c6d0'
+ sha1 <- contentid::resolve(sha1, registries=c("dataone"), store = TRUE)
+ delta.file
+# fix the sample date format, and filter for species of interest
+<- read.csv(delta.file) %>%
+ delta_data mutate(SampleDate = mdy(SampleDate)) %>%
+ filter(grepl("Salmon|Striped Bass|Smelt|Sturgeon", CommonName))
+
+<- names(delta_data) cols
Because we named the second plot varPlot
in our UI section, we now need to modify the server to produce this plot. Its very similar to the first plot, but this time we want to use the selected variables from the user controls to choose which variables are plotted. These variable names from the $input
are character strings, and so would not be recognized as symbols in the aes
mapping in ggplot. As recommended by the tidyverse authors, we use the non-standard evaluation syntax of .data[["colname"]]
to access the variables.
$varPlot <- renderPlot({
+ outputggplot(delta_data, aes(x = .data[[input$x_variable]],
+ y = .data[[input$y_variable]],
+ color = .data[[input$color_variable]])) +
+ geom_point(size = 4)+
+ theme_light()
+ })
Citing the data that we used for this application is the right thing to do, and easy. You can add arbitrary HTML to the layout using utility functions in the tags
list.
# Application title
+ titlePanel("Yolo Bypass Fish and Water Quality Data"),
+ p("Data for this application are from: "),
+ $ul(
+ tags$li("Interagency Ecological Program: Fish catch and water quality data from the Sacramento River floodplain and tidal slough, collected by the Yolo Bypass Fish Monitoring Program, 1998-2018.",
+ tags$a("doi:10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f", href="http://doi.org/10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f")
+ tags
+ )
+ ),$br(),
+ tags$hr(), tags
The final application shows the data citation, the depth plot, and the configurable scatterplot in three distinct panels.
+ +Once you’ve finished your app, you’ll want to share it with others. To do so, you need to publish it to a server that is set up to handle Shiny apps.
+Your main choices are:
A comparison of publishing features is available from RStudio.
+The easiest path is to create an account on shinyapps.io, and then configure RStudio to use that account for publishing. Instructions for enabling your local RStudio to publish to your account are displayed when you first log into shinyapps.io:
+ +Once your account is configured locally, you can simply use the Publish
button from the application window in RStudio, and your app will be live before you know it!
Shiny is a fantastic way to quickly and efficiently provide data exploration for your data and code. We highly recommend it for its interactivity, but an archival-quality repository is the best long-term home for your data and products. In this example, we used data drawn directly from the EDI repository in our Shiny app, which offers both the preservation guarantees of an archive, plus the interactive data exploration from Shiny. You can utilize the full power of R and the tidyverse for writing your interactive applications.
+library(shiny)
+library(contentid)
+library(dplyr)
+library(ggplot2)
+library(lubridate)
+
+# read in the data from EDI
+<- 'hash://sha1/317d7f840e598f5f3be732ab0e04f00a8051c6d0'
+ sha1 <- contentid::resolve(sha1, registries=c("dataone"), store = TRUE)
+ delta.file
+# fix the sample date format, and filter for species of interest
+<- read.csv(delta.file) %>%
+ delta_data mutate(SampleDate = mdy(SampleDate)) %>%
+ filter(grepl("Salmon|Striped Bass|Smelt|Sturgeon", CommonName))
+
+<- names(delta_data)
+ cols
+
+
+# Define UI for application that draws a two plots
+<- fluidPage(
+ ui
+ # Application title and data source
+ titlePanel("Sacramento River floodplain fish and water quality dataa"),
+ p("Data for this application are from: "),
+ $ul(
+ tags$li("Interagency Ecological Program: Fish catch and water quality data from the Sacramento River floodplain and tidal slough, collected by the Yolo Bypass Fish Monitoring Program, 1998-2018.",
+ tags$a("doi:10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f", href="http://doi.org/10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f")
+ tags
+ )
+ ),$br(),
+ tags$hr(),
+ tags
+ verticalLayout(
+ # Sidebar with a slider input for time axis
+ sidebarLayout(
+ sidebarPanel(
+ sliderInput("date",
+ "Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = c(as.Date("1998-01-01"), as.Date("2020-01-01")))
+
+ ),# Show a plot of the generated timeseries
+ mainPanel(
+ plotOutput("distPlot")
+
+ )
+ ),
+ $hr(),
+ tags
+ sidebarLayout(
+ sidebarPanel(
+ selectInput("x_variable", "X Variable", cols, selected = "SampleDate"),
+ selectInput("y_variable", "Y Variable", cols, selected = "Count"),
+ selectInput("color_variable", "Color", cols, selected = "CommonName")
+
+ ),
+ # Show a plot with configurable axes
+ mainPanel(
+ plotOutput("varPlot")
+
+ )
+ ),$hr()
+ tags
+ )
+ )
+# Define server logic required to draw the two plots
+<- function(input, output) {
+ server
+ # turbidity plot
+ $distPlot <- renderPlot({
+ output
+ ggplot(delta_data, mapping = aes(SampleDate, Secchi)) +
+ geom_point(colour="red", size=4) +
+ xlim(c(input$date[1],input$date[2])) +
+ theme_light()
+
+ })
+ # mix and match plot
+ $varPlot <- renderPlot({
+ outputggplot(delta_data, aes(x = .data[[input$x_variable]],
+ y = .data[[input$y_variable]],
+ color = .data[[input$color_variable]])) +
+ geom_point(size = 4) +
+ theme_light()
+
+ })
+ }
+
+# Run the application
+shinyApp(ui = ui, server = server)
library(shiny)
+library(contentid)
+library(dplyr)
+library(tidyr)
+library(ggplot2)
+library(lubridate)
+library(shinythemes)
+library(sf)
+library(leaflet)
+library(snakecase)
+
+# read in the data from EDI
+<- 'hash://sha1/317d7f840e598f5f3be732ab0e04f00a8051c6d0'
+ sha1 <- contentid::resolve(sha1, registries=c("dataone"), store = TRUE)
+ delta.file
+# fix the sample date format, and filter for species of interest
+<- read.csv(delta.file) %>%
+ delta_data mutate(SampleDate = mdy(SampleDate)) %>%
+ filter(grepl("Salmon|Striped Bass|Smelt|Sturgeon", CommonName)) %>%
+ rename(DissolvedOxygen = DO,
+ Ph = pH,
+ SpecificConductivity = SpCnd)
+
+<- names(delta_data)
+ cols
+<- delta_data %>%
+ sites distinct(StationCode, Latitude, Longitude) %>%
+ drop_na() %>%
+ st_as_sf(coords = c('Longitude','Latitude'), crs = 4269, remove = FALSE)
+
+
+
+# Define UI for application
+<- fluidPage(
+ ui navbarPage(theme = shinytheme("flatly"), collapsible = TRUE,
+ HTML('<a style="text-decoration:none;cursor:default;color:#FFFFFF;" class="active" href="#">Sacramento River Floodplain Data</a>'), id="nav",
+ windowTitle = "Sacramento River floodplain fish and water quality data",
+
+ tabPanel("Data Sources",
+ verticalLayout(
+ # Application title and data source
+ titlePanel("Sacramento River floodplain fish and water quality data"),
+ p("Data for this application are from: "),
+ $ul(
+ tags$li("Interagency Ecological Program: Fish catch and water quality data from the Sacramento River floodplain and tidal slough, collected by the Yolo Bypass Fish Monitoring Program, 1998-2018.",
+ tags$a("doi:10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f", href="http://doi.org/10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f")
+ tags
+ )
+ ),$br(),
+ tags$hr(),
+ tagsp("Map of sampling locations"),
+ mainPanel(leafletOutput("map"))
+
+ )
+ ),
+ tabPanel(
+ "Explore",
+ verticalLayout(
+ mainPanel(
+ plotOutput("distPlot"),
+ width = 12,
+ absolutePanel(id = "controls",
+ class = "panel panel-default",
+ top = 175, left = 75, width = 300, fixed=TRUE,
+ draggable = TRUE, height = "auto",
+ sliderInput("date",
+ "Date:",
+ min = as.Date("1998-01-01"),
+ max = as.Date("2020-01-01"),
+ value = c(as.Date("1998-01-01"), as.Date("2020-01-01")))
+
+
+ )
+ ),
+ $hr(),
+ tags
+ sidebarLayout(
+ sidebarPanel(
+ selectInput("x_variable", "X Variable", cols, selected = "SampleDate"),
+ selectInput("y_variable", "Y Variable", cols, selected = "Count"),
+ selectInput("color_variable", "Color", cols, selected = "CommonName")
+
+ ),
+ # Show a plot with configurable axes
+ mainPanel(
+ plotOutput("varPlot")
+
+ )
+ ),$hr()
+ tags
+ )
+ )
+ )
+ )
+# Define server logic required to draw the two plots
+<- function(input, output) {
+ server
+
+ $map <- renderLeaflet({leaflet(sites) %>%
+ outputaddTiles() %>%
+ addCircleMarkers(data = sites,
+ lat = ~Latitude,
+ lng = ~Longitude,
+ radius = 10, # arbitrary scaling
+ fillColor = "gray",
+ fillOpacity = 1,
+ weight = 0.25,
+ color = "black",
+ label = ~StationCode)
+
+ })
+ # turbidity plot
+ $distPlot <- renderPlot({
+ output
+ ggplot(delta_data, mapping = aes(SampleDate, Secchi)) +
+ geom_point(colour="red", size=4) +
+ xlim(c(input$date[1],input$date[2])) +
+ labs(x = "Sample Date", y = "Secchi Depth (m)") +
+ theme_light()
+
+ })
+ # mix and match plot
+ $varPlot <- renderPlot({
+ outputggplot(delta_data, mapping = aes(x = .data[[input$x_variable]],
+ y = .data[[input$y_variable]],
+ color = .data[[input$color_variable]])) +
+ labs(x = to_any_case(input$x_variable, case = "title"),
+ y = to_any_case(input$y_variable, case = "title"),
+ color = to_any_case(input$color_variable, case = "title")) +
+ geom_point(size=4) +
+ theme_light()
+
+ })
+ }
+
+# Run the application
+shinyApp(ui = ui, server = server)
In March 2023, GitHub announced that it will require 2FA for “all developers who contribute code on GitHub.com” (GitHub Blog). This rollout will be completed by the end of 2023.
+All users have the flexibility to use their preferred 2FA method, including: TOTP, SMS, security keys, or GitHub Mobile app. GitHub strongly recommends using security keys and TOTPs. While SMS-based 2FA is available to use, it does not provide the same level of protection, and is no longer recommended under NIST (National Institute of Standards and Technology) 800-63B.
+GitHub outlines these steps online in an article: Configuring two-factor authentication.
+Term | +Definition | +
---|---|
Quick Response (QR) Code | +A type of two-dimensional matrix barcode that contains specific information | +
Recovery Code | +A unique code(s) used to reset passwords or regain access to accounts | +
Short Message Service (SMS) | +A text messaging service that allows mobile devices to exchange short text messages | +
Time-based one-time password (TOTP) | +A string of unique codes that changes based on time. Often, these appear as six-digit numbers that regenerate every 30 seconds | +
Two-factor Authentication (2FA) | +An identity and access management security method that requires two forms of identification to access accounts, resources, or data | +