diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md new file mode 100644 index 000000000..6cc9e527e --- /dev/null +++ b/.github/ISSUE_TEMPLATE.md @@ -0,0 +1,9 @@ +Please delete the text below before submitting your contribution. + +--- + +Thanks for contributing! If this contribution is for instructor training, please send an email to checkout@carpentries.org with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution. + +Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact Kate Hertweck (k8hertweck@gmail.com). + +--- diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 000000000..6cc9e527e --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,9 @@ +Please delete the text below before submitting your contribution. + +--- + +Thanks for contributing! If this contribution is for instructor training, please send an email to checkout@carpentries.org with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution. + +Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact Kate Hertweck (k8hertweck@gmail.com). + +--- diff --git a/.mailmap b/.mailmap index 72cfe3c75..90d6dd459 100644 --- a/.mailmap +++ b/.mailmap @@ -1,83 +1,44 @@ -Aron Ahmadia -Matthew Aiello-Lammens -Joshua Ainsley -James Allen +Abigail Cabunoc Mayes +Abigail Cabunoc Mayes +Andy Boughton +Andy Teucher Areej Alsheikh-Hussain -Paula Andrea -Jeffrey Arnold -Alex Bajcz -Piotr Banaszkiewicz +Beth Signal +Daniel Chen +Daniel Turek Diego Barneche -Greg Bass -Trevor Bekolay -Mik Black -John Blischak -Andy Boughton -Karl Broman +Donna Henderson Eric Bruger -Abigail Cabunoc Mayes -Scott Chamberlain -Daniel Chen -Harriet Dashnow -Matt Dickenson -Alastair Droop -Jonah Duckles -Rémi Emonet -Marianna Foos -Auriel Fournier -David Fredman -Javier García-Algarra -Noushin Ghaffari -Heather Gibling -Jeremy Gray -Jessica Guo -Melissa Guzman -Denis Haine -Michael Hansen +Eric Milliman +Evan P. Williamson Fabian Held -Donna Henderson -James Hiebert -Jeff Hollister -Mike Jackson -Elsie Jacobson -W. Trevor King -Michael Levy -Mark Mandel -Carlos Martinez -Ben Marwick +Félix-Antoine Fortin François Michonneau -James Mickley -Eric Milliman -Bill Mills +François Michonneau +Ge Baolai +Greg Bass +Greg Wilson +James Allen +Javier García-Algarra +Jeffrey Arnold Joaquin Moris -Hani Nakhoul -Matthias Nilsson -Aaron O'Leary -Frank Pennekamp -Raissa Philibert -Jon Pipitone -Timothée Poisot -Louis Ranjard -Joey Reid -Scott Ritchie -Natalie Robinson -Michael Sachs -Pat Schloss -Peter Schmiedeskamp -Beth Signal -Raniere Silva -Gavin Simpson -Karthik Srinivasan Joseph Stachelek -Valentina Staneva -Sarah Stevens +Kara Woo +Karthik Srinivasan +Kate Hertweck +Louis Ranjard +Melissa Guzman Michael Sumner +Mik Black +Mike Jackson +Natalie Robinson +Noushin Ghaffari +Noushin Ghaffari +Raniere Silva +Raniere Silva +Rémi Emonet +Rémi Emonet Sarah Supp -Andy Teucher -Daniel Turek -Stephen Turner -Lukas Weber -Greg Wilson -Kara Woo -Tom Wright -Naupaka Zimmerman +Scott Chamberlain +Timothée Poisot +Valentina Staneva diff --git a/AUTHORS b/AUTHORS index 4e9e1a276..1b4a766f9 100644 --- a/AUTHORS +++ b/AUTHORS @@ -7,6 +7,7 @@ Paula Andrea Jeffrey Arnold Alex Bajcz Piotr Banaszkiewicz +Ge Baolai Diego Barneche Greg Bass Trevor Bekolay @@ -19,11 +20,13 @@ Abigail Cabunoc Mayes Scott Chamberlain Daniel Chen Harriet Dashnow +Gabriel A. Devenyi Matt Dickenson Alastair Droop Jonah Duckles Rémi Emonet Marianna Foos +Félix-Antoine Fortin Auriel Fournier David Fredman Javier García-Algarra @@ -36,10 +39,13 @@ Denis Haine Michael Hansen Fabian Held Donna Henderson +Kate Hertweck James Hiebert Jeff Hollister Mike Jackson Elsie Jacobson +Zbigniew Jędrzejewski-Szmek +Jonathan Keane W. Trevor King Michael Levy Mark Mandel @@ -77,6 +83,7 @@ Andy Teucher Daniel Turek Stephen Turner Lukas Weber +Evan P. Williamson Greg Wilson Kara Woo Tom Wright diff --git a/CONDUCT.md b/CONDUCT.md index e83b08fa9..5e4943b4c 100644 --- a/CONDUCT.md +++ b/CONDUCT.md @@ -32,13 +32,14 @@ or other unprofessional conduct. Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions -that are not aligned to this Code of Conduct. +that are not aligned to our [Code of Conduct][coc]. Project maintainers who do not follow the Code of Conduct may be removed from the project team. Instances of abusive, harassing, or otherwise unacceptable behavior -may be reported by opening an issue or contacting one or more of the project maintainers. +may be reported by following our [reporting guidelines][coc-reporting]. -This Code of Conduct is adapted from -the [Contributor Covenant][contrib-covenant] Version 1.0.0. -[contrib-covenant]: http://contributor-covenant.org/ +- [Software and Data Carpentry Code of Conduct][coc] +- [Code of Conduct Reporting Guide][coc-reporting] + +{% include links.md %} diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4ec80f76c..9de5a12a3 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -46,8 +46,8 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in , - which can be viewed at . + please work in , + which can be viewed at . 2. If you wish to change the example lesson, please work in , @@ -140,13 +140,12 @@ You can also [reach us by email][contact]. [dc-lessons]: http://datacarpentry.org/lessons/ [dc-site]: http://datacarpentry.org/ [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss -[example-site]: https://swcarpentry.github.io/lesson-example/ [github]: http://github.com [github-flow]: https://guides.github.com/introduction/flow/ [github-join]: https://github.com/join [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github -[issues]: https://github.com/swcarpentry/r-novice-inflammation/issues/ -[repo]: https://github.com/swcarpentry/r-novice-inflammation/ +[issues]: https://github.com/swcarpentry/FIXME/issues/ +[repo]: https://github.com/swcarpentry/FIXME/ [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry [swc-lessons]: http://software-carpentry.org/lessons/ [swc-site]: http://software-carpentry.org/ diff --git a/LICENSE.md b/LICENSE.md index 566ce5533..179758a7e 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -73,7 +73,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ## Trademark -"Software Carpentry" an "Data Carpentry" and their respective logos +"Software Carpentry" and "Data Carpentry" and their respective logos are registered trademarks of [NumFOCUS][numfocus]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ diff --git a/Makefile b/Makefile index 0f395a310..b5dfe2fa4 100644 --- a/Makefile +++ b/Makefile @@ -9,6 +9,7 @@ DST=_site # Controls .PHONY : commands clean files +.NOTPARALLEL: all : commands ## commands : show all commands. @@ -16,11 +17,11 @@ commands : @grep -h -E '^##' ${MAKEFILES} | sed -e 's/## //g' ## serve : run a local server. -serve : lesson-rmd +serve : lesson-md ${JEKYLL} serve ## site : build files but do not run a server. -site : lesson-rmd +site : lesson-md ${JEKYLL} build # repo-check : check repository settings. @@ -53,7 +54,7 @@ workshop-check : ## ---------------------------------------- ## Commands specific to lesson websites. -.PHONY : lesson-check lesson-rmd lesson-files lesson-fixme +.PHONY : lesson-check lesson-md lesson-files lesson-fixme # RMarkdown files RMD_SRC = $(wildcard _episodes_rmd/??-*.Rmd) @@ -79,13 +80,16 @@ HTML_DST = \ $(patsubst _extras/%.md,${DST}/%/index.html,$(wildcard _extras/*.md)) \ ${DST}/license/index.html -## lesson-rmd : convert Rmarkdown files to markdown -lesson-rmd: $(RMD_SRC) - @bin/knit_lessons.sh $(RMD_SRC) +## lesson-md : convert Rmarkdown files to markdown +lesson-md : ${RMD_DST} + +# Use of .NOTPARALLEL makes rule execute only once +${RMD_DST} : ${RMD_SRC} + @bin/knit_lessons.sh ${RMD_SRC} ## lesson-check : validate lesson Markdown. lesson-check : - @bin/lesson_check.py -s . -p ${PARSER} + @bin/lesson_check.py -s . -p ${PARSER} -r _includes/links.md ## lesson-check-all : validate lesson Markdown, checking line lengths and trailing whitespace. lesson-check-all : diff --git a/README.md b/README.md index baa72e61d..54b32ba9c 100644 --- a/README.md +++ b/README.md @@ -11,13 +11,13 @@ Maintainers: The goal of this lesson is to teach novice programmers to write modular code to perform a data analysis. R is used to teach these skills because it is a commonly used programming language in many scientific disciplines. However, the -emphasis is not on teaching every aspect of R, but instead the focus is on +emphasis is not on teaching every aspect of R, but instead on language agnostic principles like automation with loops and encapsulation with functions (see [Best Practices for Scientific Computing][best-practices] to -learn more). In fact, this lesson is a translation of the [Python version][py], -and the lesson is also available in [MATLAB][]. +learn more). This lesson is a translation of the [Python version][py], +and is also available in [MATLAB][MATLAB]. -The example used in this lesson is analyzing a set of 12 data files with +The example used in this lesson analyzes a set of 12 data files with inflammation data collected from a trial for a new treatment for arthritis (the data was simulated). Learners are shown how it is better to create a function and apply it to each of the 12 files using a loop instead of using copy-paste @@ -42,7 +42,7 @@ To view how the changes will look, when viewed in a web browser, you can render Once you've made your edits and rendered the corresponding html files, you need to add, commit, and push just the source R Markdown file(s) -and any supporting files (e.g. data files). Changes generated by the `make preview` command should not be committed or included in a pull request. These changes will be taken care off by the lesson maintainer when the PR is merged. +and any supporting files (e.g. data files). Changes generated by the `make preview` command should not be committed or included in a pull request. These changes will be taken care of by the lesson maintainer when the PR is merged. ## Getting Help diff --git a/_episodes/01-starting-with-data.md b/_episodes/01-starting-with-data.md index fc9cfd005..bfb20f7a4 100644 --- a/_episodes/01-starting-with-data.md +++ b/_episodes/01-starting-with-data.md @@ -33,7 +33,7 @@ keypoints: We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. -The data sets are stored in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-(csv)) (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. +The data sets are stored in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-csv) (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. The first few rows of our first file look like this: @@ -56,7 +56,7 @@ To do all that, we'll have to learn a little bit about programming. ### Loading Data -To load our inflammation data, first we need to tell our computer where is the file that contains the values. We have been told its name is `inflammation-01.csv`. This is very important in R, if we forget this step we’ll get an error message when trying to read the file. We can change the current working directory using the function `setwd`. For this example, we change the path to the directory we just created: +Let's import the file called `inflammation-01.csv` into our R environment. To import the file, first we need to tell our computer where the file is. We do that by choosing a working directory, that is, a local directory on our computer containing the files we need. This is very important in R. If we forget this step we???ll get an error message saying that the file does not exist. We can set the working directory using the function `setwd`. For this example, we change the path to our new directory at the desktop: ~~~ @@ -67,7 +67,7 @@ setwd("~/Desktop/r-novice-inflammation/") Just like in the Unix Shell, we type the command and then press `Enter` (or `return`). Alternatively you can change the working directory using the RStudio GUI using the menu option `Session` -> `Set Working Directory` -> `Choose Directory...` -The data files are located in the directory `data` inside the working directory. Now we can load the data into R using `read.csv`: +The data file is located in the directory `data` inside the working directory. Now we can load the data into R using `read.csv`: ~~~ @@ -104,16 +104,16 @@ The filename needs to be a character string (or [string]({{ page.root }}/referen > Take a look at `?read.csv` and write the code to load a file called `commadec.txt` that has numeric values with commas as decimal mark, separated by semicolons. {: .challenge} -The utility of a function is that it will perform its given action on whatever value is passed to the named argument(s). -For example, in this case if we provided the name of a different file to the argument `file`, `read.csv` would read it instead. -We'll learn more of the details about functions and their arguments in the next lesson. +A function will perform its given action on whatever value is passed to the argument(s). +For example, in this case if we provided the name of a different file to the argument `file`, `read.csv` would read that instead. +We'll learn more about the details of functions and their arguments in the next lesson. Since we didn't tell it to do anything else with the function's output, the console will display the full contents of the file `inflammation-01.csv`. Try it out. -`read.csv` read the file, but we can't use data unless we assign it to a variable. -A variable is just a name for a value, such as `x`, `current_temperature`, or `subject_id`. -We can create a new variable simply by assigning a value to it using `<-` +`read.csv` reads the file, but we can't use data unless we assign it to a variable. +We can think of a variable as a container with a name, such as `x`, `current_temperature`, or `subject_id` that contains one or more values. +We can create a new variable and assign a value to it using `<-` ~~~ @@ -137,7 +137,7 @@ weight_kg ~~~ {: .output} -We can do arithmetic with the variable: +We can do arithmetics with the variable: ~~~ @@ -160,7 +160,7 @@ We can do arithmetic with the variable: > read it) have an easier time following what the code is doing. {: .callout} -We can also change an object's value by assigning it a new value: +We can also change an variable's value by assigning it a new value: ~~~ @@ -176,6 +176,15 @@ weight_kg [1] 57.5 ~~~ {: .output} +> ## Variable Naming Conventions +> +> Historically, R programmers have used a variety of conventions for naming variables. The `.` character +> in R can be a valid part of a variable name; thus the above assignment could have easily been `weight.kg <- 57.5`. +> This is often confusing to R newcomers who have programmed in languages where `.` has a more significant meaning. +> Today, most R programmers 1) start variable names with lower case letters, 2) separate words in variable names with +> underscores, and 3) use only lowercase letters, underscores, and numbers in variable names. The book *R Packages* includes +> a [chapter](http://r-pkgs.had.co.nz/style.html) on this and other style considerations. +{: .callout} If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value: @@ -262,7 +271,7 @@ This is different from the way spreadsheets work. > and finally prints the assigned value of the variable `total_weight`. {: .callout} -Now that we know how to assign things to variables, let's re-run `read.csv` and save its result: +Now that we know how to assign things to variables, let's re-run `read.csv` and save its result into a variable called 'dat': ~~~ @@ -270,9 +279,9 @@ dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE) ~~~ {: .r} -This statement doesn't produce any output because assignment doesn't display anything. -If we want to check that our data has been loaded, we can print the variable's value. -However, for large data sets it is convenient to use the function `head` to display only the first few rows of data. +This statement doesn't produce any output because the assignment doesn't display anything. +If we want to check if our data has been loaded, we can print the variable's value by typing the name of the variable `dat`. However, for large data sets it is convenient to use the function `head` to display only the first few rows of data. + ~~~ @@ -322,7 +331,7 @@ head(dat) ### Manipulating Data -Now that our data is loaded in memory, we can start doing things with it. +Now that our data are loaded into R, we can start doing things with them. First, let's ask what type of thing `dat` is: @@ -338,10 +347,10 @@ class(dat) ~~~ {: .output} -The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. -Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns. +The output tells us that it is a data frame. We can think of this as a spreadsheet in MS Excel, which many of us are familiar with. +Data frames are very useful for organizing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns. -We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions), of the data frame with the function `dim`: +We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions-of-an-array), of the data frame with the function `dim`: ~~~ @@ -358,7 +367,7 @@ dim(dat) This tells us that our data frame, `dat`, has 60 rows and 40 columns. -If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets, just as we do in math: +If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets. The first number specifies the row and the second the column: ~~~ @@ -496,7 +505,7 @@ dat[, 16] > You can learn more about subsetting by column name in this supplementary [lesson]({{ page.root }}/10-supp-addressing-data/). {: .callout} -Now let's perform some common mathematical operations to learn about our inflammation data. +Now let's perform some common mathematical operations to learn more about our inflammation data. When analyzing data we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary data frame, and then perform the calculation on this subset: @@ -612,6 +621,31 @@ sd(dat[, 7]) ~~~ {: .output} +R also has a function that summaries the previous common calculations: + + +~~~ +# Summarize function +summary(dat[,1:4]) +~~~ +{: .r} + + + +~~~ + V1 V2 V3 V4 + Min. :0 Min. :0.00 Min. :0.000 Min. :0.00 + 1st Qu.:0 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:1.00 + Median :0 Median :0.00 Median :1.000 Median :2.00 + Mean :0 Mean :0.45 Mean :1.117 Mean :1.75 + 3rd Qu.:0 3rd Qu.:1.00 3rd Qu.:2.000 3rd Qu.:3.00 + Max. :0 Max. :1.00 Max. :2.000 Max. :3.00 +~~~ +{: .output} + +For every column in the data frame, the function "summary" calculates: the minimun value, the first quartile, the median, the mean, the third quartile and the max value, given helpful details about the sample distribution. + + What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across a margin of the data frame: @@ -658,31 +692,31 @@ We'll learn why this is so in the next lesson. > A subsection of a data frame is called a [slice]({{ page.root }}/reference/#slice). > We can take slices of character vectors as well: > -> +> > ~~~ > animal <- c("m", "o", "n", "k", "e", "y") > # first three characters > animal[1:3] > ~~~ > {: .r} -> -> -> +> +> +> > ~~~ > [1] "m" "o" "n" > ~~~ > {: .output} -> -> -> +> +> +> > ~~~ > # last three characters > animal[4:6] > ~~~ > {: .r} -> -> -> +> +> +> > ~~~ > [1] "k" "e" "y" > ~~~ @@ -708,6 +742,14 @@ We'll learn why this is so in the next lesson. > 2. `max(dat[3:7, 5])` > 3. `max(dat[5, 3:7])` > 4. `max(dat[5, 3, 7])` +> +> > ## Solution +> > +> > Answer: 3 +> > +> > Explanation: You want to extract the part of the dataframe representing data for patient 5 from days three to seven. In this dataframe, patient data is organised in columns and the days are represented by the rows. Subscripting in R follows the `[i,j]` principle, where `i=columns` and `j=rows`. Thus, answer 3 is correct since the patient is represented by the value for i (5) and the days are represented by the values in j, which is a slice spanning day 3 to 7. +> > +> {: .solution} {: .challenge} > ## Slicing and Re-Assignment @@ -724,7 +766,7 @@ We'll learn why this is so in the next lesson. > > whichPatients <- seq(2,40,2) > > whichDays <- c(1:5) > > dat2 <- dat -> > dat2[whichPatients,whichDays] <- dat2[whichPatients,whichDays]/2 +> > dat2[whichDays, whichPatients] <- dat2[whichDays, whichPatients]/2 > > (dat2) > > ~~~ > > {: .r} diff --git a/_episodes/02-func-R.md b/_episodes/02-func-R.md index b36fb75a8..5e8d8f29e 100644 --- a/_episodes/02-func-R.md +++ b/_episodes/02-func-R.md @@ -171,7 +171,7 @@ Real-life functions will usually be larger than the ones shown here--typically h > e.g. `x <- c("A", "B", "C")` creates a vector `x` with three elements. > Furthermore, we can extend that vector again using `c`, e.g. `y <- c(x, "D")` creates a vector `y` with four elements. > Write a function called `fence` that takes two vectors as arguments, called -> original` and `wrapper`, and returns a new vector that has the wrapper vector +> `original` and `wrapper`, and returns a new vector that has the wrapper vector > at the beginning and end of the original: > > @@ -248,7 +248,7 @@ Real-life functions will usually be larger than the ones shown here--typically h > 2. 11 > 3. 23 > 4. 30 -> 2. If mySum(3) == 13, why does mySum(b=3) return an error? +> 2. If `mySum(3)` returns 13, why does `mySum(input_2 = 3)` return an error? {: .challenge} ### Testing and Documenting diff --git a/_episodes/03-loops-R.md b/_episodes/03-loops-R.md index ff82f3dc2..bbd821d73 100644 --- a/_episodes/03-loops-R.md +++ b/_episodes/03-loops-R.md @@ -163,7 +163,7 @@ print_words(best_practice) ~~~ {: .output} -This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well: +This is shorter - certainly shorter than something that prints every character in a hundred-letter string - and more robust as well: ~~~ @@ -195,7 +195,8 @@ for (variable in collection) { We can name the [loop variable]({{ page.root }}/reference/#loop-variable) anything we like (with a few [restrictions][], e.g. the name of the variable cannot start with a digit). `in` is part of the `for` syntax. -Note that the body of the loop is enclosed in curly braces `{ }`. +Note that the condition (`variable in collection`) is enclosed in parentheses, +and the body of the loop is enclosed in curly braces `{ }`. For a single-line loop body, as here, the braces aren't needed, but it is good practice to include them as we did. [restrictions]: http://cran.r-project.org/doc/manuals/R-intro.html#R-commands_003b-case-sensitivity-etc diff --git a/_episodes/04-cond.md b/_episodes/04-cond.md index 4a9daca18..a026faaac 100644 --- a/_episodes/04-cond.md +++ b/_episodes/04-cond.md @@ -220,7 +220,7 @@ sign(2/3) ~~~ {: .output} -Note that when combining `else` and `if` in an `else if` statement (similar to `elif` in Python), the `if` portion still requires a direct input condition. This is never the case for the `else` statement alone, which is only executed if all other conditions go unsatisfied. +Note that when combining `else` and `if` in an `else if` statement, the `if` portion still requires a direct input condition. This is never the case for the `else` statement alone, which is only executed if all other conditions go unsatisfied. Note that the test for equality uses two equal signs, `==`. > ## Other Comparisons @@ -564,7 +564,7 @@ Now we can save all of the results with just one line of code: ~~~ -analyze_all("inflammation*.csv") +analyze_all("inflammation.*csv") ~~~ {: .r} diff --git a/_episodes/05-cmdline.md b/_episodes/05-cmdline.md index 57d4d9ec9..7913c0ec2 100644 --- a/_episodes/05-cmdline.md +++ b/_episodes/05-cmdline.md @@ -50,7 +50,7 @@ $ Rscript readings.R --max data/inflammation-*.csv Our overall requirements are: -1. If no filename is given on the command line, read data from [standard input]({{ page.root }}/reference/#standard-input). +1. If no filename is given on the command line, read data from [standard input]({{ page.root }}/reference/#standard-input-stdin). 2. If one or more filenames are given, read data from them and report statistics for each file separately. 3. Use the `--min`, `--mean`, or `--max` flag to determine what statistic to print. @@ -82,11 +82,17 @@ Rscript session-info.R ~~~ -R version 3.1.2 (2014-10-31) -Platform: x86_64-apple-darwin13.4.0 (64-bit) +R version 3.3.3 (2017-03-06) +Platform: x86_64-pc-linux-gnu (64-bit) +Running under: Antergos Linux locale: -[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 + [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C + [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 + [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 + [7] LC_PAPER=en_US.UTF-8 LC_NAME=C + [9] LC_ADDRESS=C LC_TELEPHONE=C +[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets base @@ -112,7 +118,7 @@ cat(args, sep = "\n") The function `commandArgs` extracts all the command line arguments and returns them as a vector. The function `cat`, similar to the `cat` of the Unix Shell, outputs the contents of the variable. -Since we did not specify a filename for writing, `cat` sends the output to [standard output]({{ page.root }}/reference/#standard-output-(stdout)), +Since we did not specify a filename for writing, `cat` sends the output to [standard output]({{ page.root }}/reference/#standard-output-stdout), which we can then pipe to other Unix functions. Because we set the argument `sep` to `"\n"`, which is the symbol to start a new line, each element of the vector is printed on its own line. Let's see what happens when we run this program in the Unix Shell: @@ -127,7 +133,7 @@ Rscript print-args.R ~~~ -/Library/Frameworks/R.framework/Resources/bin/exec/R +/usr/lib64/R/bin/exec/R --slave --no-restore --file=print-args.R @@ -156,7 +162,7 @@ R --slave --no-restore --file=print-args.R --args ~~~ -/Library/Frameworks/R.framework/Resources/bin/exec/R +/usr/lib64/R/bin/exec/R --slave --no-restore --file=print-args.R @@ -176,7 +182,7 @@ Rscript print-args.R first second third ~~~ -/Library/Frameworks/R.framework/Resources/bin/exec/R +/usr/lib64/R/bin/exec/R --slave --no-restore --file=print-args.R diff --git a/_episodes/06-best-practices-R.md b/_episodes/06-best-practices-R.md index 1d4827369..1cdd17c80 100644 --- a/_episodes/06-best-practices-R.md +++ b/_episodes/06-best-practices-R.md @@ -7,7 +7,7 @@ questions: objectives: - "Define best formatting practices when writing code in R scripts." - "Synthesize a consistent personal coding style to increase code readability, consistency, and repeatability." -- "Apply this style to one's own code." +- "Apply this style to your own code." keypoints: - "Start each program with a description of what it does." - "Then load all required packages." @@ -19,7 +19,7 @@ keypoints: - "Factor out common operations rather than repeating them." - "Keep all of the source files for a project in one directory and use relative paths to access them." - "Keep track of the memory used by your program." -- "Always start with a clean environment instead of saving session history." +- "Always start with a clean environment instead of saving the workspace." - "Keep track of session information in your project folder." - "Have someone else review your code." - "Use version control." @@ -27,7 +27,7 @@ keypoints: -1. Start your code with an annotated description of what the code does when it is run: +1. Start your code with an annotated description of what the code does: ~~~ @@ -36,7 +36,7 @@ keypoints: ~~~ {: .r} -2. Next, load all of the packages that will be necessary to run your code (using `library`): +2. Next, load all of the packages needed to run your code (using `library()`): ~~~ @@ -46,14 +46,17 @@ library(vegan) ~~~ {: .r} +If you use only one or two functions from a package, it is sometimes useful to note that fact in a comment, e.g. `library(reshape2) ## for melt()` + 3. Set your working directory before `source()`ing a script, or start `R` inside your project folder: -One should exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility: +Exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility: + -* `setwd()` will return an error if the directory to which you're trying to change doesn't exit or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently. -* If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, and if they then call the script again, this will cause further problems. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems. +* `setwd()` will return an error if the directory you're trying to change to doesn't exist or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently. +* If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, which will cause further problems if they then call the script again. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems. Putting a commented-out `setwd()` call at the top of your code can be a reasonable compromise: it reminds you where on your machine your material is living, is easy to copy-and-paste if necessary, but doesn't commit other users. -The following error message indicates that R has failed to set the working directory you specified: +This error message indicates that R has failed to set the working directory you specified: ``` Error in setwd("~/path/to/working/directory") : cannot change working directory @@ -63,7 +66,7 @@ It is best practice to have the user running the script begin in a consistent di 4. Annotate and mark your code using `#` or `#-` to set off sections of your code and to make finding specific parts of your code easier. -5. If you create only one or a few custom functions in your script, put them toward the top of your code so they are among the first objects created. If you have written many functions, put them all in their own .R file and then `source` those files. `source` will define all of these functions so that your code can make use of them as needed. For the reasons listed above, try to avoid using `setwd()` (or other functions that have side-effects in the user's workspace) in scripts you `source`. +5. If you create only a few custom functions in your script, put them toward the top of your code so they are among the first objects created. If you have written many functions, put them all in their own .R file and then `source()` those files. `source()` will define all of these functions so that your code can use them as needed. For the reasons listed above, avoid using `setwd()` (or other functions that have side-effects in the user's workspace) in scripts you `source()`. ~~~ @@ -71,7 +74,7 @@ source("my_genius_fxns.R") ~~~ {: .r} -6. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Consistency makes code easier to read and problems easier to spot. +6. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Indent consistently and decide on a scheme for multi-word variable names (e.g. `resource.use`, `resource_use`, or `resourceUse`). Consistency makes code easier to read and problems easier to spot. 7. Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces. @@ -93,22 +96,22 @@ dat <- read.csv(file = "/Users/Karthik/Documents/sannic-project/files/dataset-20 ~~~ {: .r} -10. R can run into memory issues. It is a common problem to run out of memory after running R scripts for a long time. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose. However, sometimes, R will not clean up unused memory for a while after you delete objects. You can force R to tidy up its memory by using `gc()`. +10. R can run into memory issues. R scripts that run for a long time often run out of memory. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long chunks of computationally intensive code is to remove temporary objects after they have served their purpose. However, R will not always clean up unused memory immediately after you delete objects. You can force R to tidy up its memory by using `gc()`. ~~~ interim_object <- data.frame(rep(1:100,10),rep(101:200,10),rep(201:300,10)) # Sample dataset of 1000 rows object.size(interim_object) # Reports the memory size allocated to the object -rm(interim_object) # Removes only the object itself and not necessarily the memory allotted to it +rm("interim_object") # Removes only the object itself and not necessarily the memory allotted to it gc() # Force R to release memory it is no longer using ls() # Lists all the objects in your current workspace rm(list = ls()) # If you want to delete all the objects in the workspace and start with a clean slate ~~~ {: .r} -11. Don't save a session history (the default option in R, when it asks if you want an `RData` file). Instead, start in a clean environment so that older objects don't remain in your environment any longer than they need to. If that happens, it can lead to unexpected results. +11. Don't save your workspace (the default option in R, when it asks if you want to "Save workspace image [y/n/c]?"). Instead, start in a clean workspace without old objects cluttering it. Leftover objects from previous sessions can lead to unexpected, hard-to-debug results. Do *not* put `rm(list=ls())` (which removes all objects in your current workspace, as shown in the previous code example) at the top of your code, as this is a trap for other users who might `source()` or copy-and-paste your code in the course of their R session. Instead, restart R when you want to start fresh. -12. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived). +12. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived). For more complex projects, you may want to use the [packrat](https://CRAN.R-project.org/package=packrat) package. 13. Collaborate. Grab a buddy and practice "code review". Review is used for preparing experiments and manuscripts; why not use it for code as well? Our code is also a major scientific achievement and the product of lots of hard work! diff --git a/_episodes/08-making-packages-R.md b/_episodes/08-making-packages-R.md index 4ecdbfb0d..3f399c273 100644 --- a/_episodes/08-making-packages-R.md +++ b/_episodes/08-making-packages-R.md @@ -6,7 +6,10 @@ questions: - "How do I collect my code together so I can reuse it and share it?" - "How do I make my own packages?" objectives: -- "Quick summary on how (and why) making R packages." +- "Describe the required structure of R packages." +- "Create the required structure of a simple R package." +- "Write documentation comments that can be automatically compiled to R's native help and documentation format." + keypoints: - "A package is the basic unit of reusability in R." - "Every package must have a DESCRIPTION file and an R directory containing code." @@ -117,16 +120,16 @@ Place each function into a separate R script and add documentation like this: ~~~ -#' Convert Fahrenheit to Kelvin +#' Converts Fahrenheit to Kelvin #' #' This function converts input temperatures in Fahrenheit to Kelvin. -#' @param temp The input temperature. +#' @param temp The temperature in Fahrenheit. +#' @return The temperature in Kelvin. #' @export #' @examples #' fahr_to_kelvin(32) fahr_to_kelvin <- function(temp) { - #Converts Fahrenheit to Kelvin kelvin <- ((temp - 32) * (5/9)) + 273.15 kelvin } diff --git a/_episodes/11-supp-read-write-csv.md b/_episodes/11-supp-read-write-csv.md index 3f94d90fa..66685fbad 100644 --- a/_episodes/11-supp-read-write-csv.md +++ b/_episodes/11-supp-read-write-csv.md @@ -19,7 +19,7 @@ keypoints: The most common way that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), -users often find it easier to save their spreadsheets in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-(csv)) files (CSV) +users often find it easier to save their spreadsheets in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-csv) files (CSV) and then use R's built in functionality to read and manipulate the data. In this short lesson, we'll learn how to read data from a .csv and write to a new .csv, and explore the [arguments]({{ page.root }}/reference/#argument) that allow you read and write the data correctly for your needs. diff --git a/_episodes/13-supp-data-structures.md b/_episodes/13-supp-data-structures.md index 2fd97960f..15e0978fa 100644 --- a/_episodes/13-supp-data-structures.md +++ b/_episodes/13-supp-data-structures.md @@ -1,21 +1,21 @@ --- title: "Data Types and Structures" +keypoints: +- R's basic data types are character, numeric, integer, complex, and logical. +- R's basic data structures include the vector, list, matrix, data frame, and factors. +- Objects may have attributes, such as name, dimension, and class. +objectives: +- Expose learners to the different data types in R. +- Learn how to create vectors of different types. +- Be able to check the type of vector. +- Learn about missing data and other special values. +- Getting familiar with the different data structures (lists, matrices, data frames). +questions: +- What are the different data types in R? +- What are the different data structures in R? +- How do I access data within the various data structures? teaching: 45 exercises: 0 -questions: -- "What are the different data types in R?" -- "What are the different data structures in R?" -- "How do I access data within the various data structures?" -objectives: -- "Expose learners to the different data types in R." -- "Learn how to create vectors of different types." -- "Be able to check the type of vector." -- "Learn about missing data and other special values." -- "Getting familiar with the different data structures (lists, matrices, data frames)." -keypoints: -- "R's basic data types are character, numeric, integer, complex, and logical." -- "R's basic data structures include the vector, list, matrix, data frame, and factors." -- "Objects may have attributes, such as name, dimension, and class." --- @@ -727,6 +727,21 @@ mdat ~~~ {: .output} +Elements of a matrix can be referenced by specifying the index along each dimension (e.g. "row" and "column") in single square brackets. + + +~~~ +mdat[2,3] +~~~ +{: .r} + + + +~~~ +[1] 13 +~~~ +{: .output} + ### List In R lists act as containers. Unlike atomic vectors, the contents of a list are @@ -779,6 +794,7 @@ length(x) ~~~ {: .output} +The content of elements of a list can be retrieved by using double square brackets. ~~~ @@ -793,6 +809,7 @@ NULL ~~~ {: .output} +Vectors can be coerced to lists as follows: ~~~ @@ -813,6 +830,9 @@ length(x) 2. What about `x[[1]]`? +Elements of a list can be named (i.e. lists can have the `names` atttibute) + + ~~~ xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris)) xlist @@ -839,16 +859,30 @@ $data ~~~ {: .output} + + +~~~ +names(xlist) +~~~ +{: .r} + + + +~~~ +[1] "a" "b" "data" +~~~ +{: .output} + 1. What is the length of this object? What about its structure? -Lists can be extremely useful inside functions. You can “staple” together lots +Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can "staple" together lots of different kinds of results into a single object that a function can return. A list does not print to the console like a vector. Instead, each element of the list starts on a new line. Elements are indexed by double brackets. Single brackets will still return -a(nother) list. +a(nother) list. If the elements of a list are named, they can be referenced by the `$` notation (i.e. `xlist$data`). ### Data Frame @@ -856,7 +890,7 @@ a(nother) list. A data frame is a very important data type in R. It's pretty much the *de facto* data structure for most tabular data and what we use for statistics. -A data frame is a special type of list where every element of the list has same length. +A data frame is a *special type of list* where every element of the list has same length (i.e. data frame is a "rectangular" list). Data frames can have additional attributes such as `rownames()`, which can be useful for annotating data, like `subject_id` or `sample_id`. But most of the @@ -864,12 +898,11 @@ time they are not used. Some additional information on data frames: -* Usually created by `read.csv()` and `read.table()`. -* Can convert to matrix with `data.matrix()` (preferred) or `as.matrix()` -* Coercion will be forced and not always what you expect. -* Can also create with `data.frame()` function. +* Usually created by `read.csv()` and `read.table()`, i.e. when importing the data into R. +* Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect. +* Can also create a new data frame with `data.frame()` function. * Find the number of rows and columns with `nrow(dat)` and `ncol(dat)`, respectively. -* Rownames are usually 1, 2, ..., n. +* Rownames are often automatically generated and look like 1, 2, ..., n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset. ### Creating Data Frames by Hand @@ -901,20 +934,21 @@ dat > ## Useful Data Frame Functions > -> * `head()` - shown first 6 rows -> * `tail()` - show last 6 rows -> * `dim()` - returns the dimensions +> * `head()` - shows first 6 rows +> * `tail()` - shows last 6 rows +> * `dim()` - returns the dimensions of data frame (i.e. number of rows and number of columns) > * `nrow()` - number of rows > * `ncol()` - number of columns -> * `str()` - structure of each column +> * `str()` - structure of data frame - name, type and preview of data in each column > * `names()` - shows the `names` attribute for a data frame, which gives the column names. +> * `sapply(dataframe, class)` - shows the class of each column in the data frame {: .callout} See that it is actually a special list: ~~~ -is.list(iris) +is.list(dat) ~~~ {: .r} @@ -928,7 +962,7 @@ is.list(iris) ~~~ -class(iris) +class(dat) ~~~ {: .r} @@ -939,17 +973,66 @@ class(iris) ~~~ {: .output} +Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix). + + +~~~ +dat[1,3] +~~~ +{: .r} + + + +~~~ +[1] 11 +~~~ +{: .output} + +As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a `$`. + + +~~~ +dat[["y"]] +~~~ +{: .r} + + + +~~~ + [1] 11 12 13 14 15 16 17 18 19 20 +~~~ +{: .output} + + + +~~~ +dat$y +~~~ +{: .r} + + + +~~~ + [1] 11 12 13 14 15 16 17 18 19 20 +~~~ +{: .output} + +The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain. + | Dimensions | Homogenous | Heterogeneous | | ------- | ---- | ---- | | 1-D | atomic vector | list | | 2-D | matrix | data frame | +> Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be "rectangular". However in order for the list to qualify as a data frame, the lenghth of each element has to be the same. +{: .callout} + > ## Column Types in Data Frames > -> Knowing that data frames are lists of lists, can columns be of different type? +> Knowing that data frames are lists, can columns be of different type? > -> What type of structure do you expect on the iris data frame? Hint: Use `str()`. +> What type of structure do you expect to see when you explore the structure of the `iris` data frame? Hint: Use `str()`. > > ~~~ > # The Sepal.Length, Sepal.Width, Petal.Length and Petal.Width columns are all diff --git a/_episodes/15-supp-loops-in-depth.md b/_episodes/15-supp-loops-in-depth.md index 74f501e98..51a3a2985 100644 --- a/_episodes/15-supp-loops-in-depth.md +++ b/_episodes/15-supp-loops-in-depth.md @@ -219,7 +219,7 @@ system.time(avg2 <- analyze2(filenames)) ~~~ user system elapsed - 0.076 0.004 0.149 + 0.038 0.000 0.039 ~~~ {: .output} @@ -249,7 +249,7 @@ system.time(avg3 <- analyze3(filenames)) ~~~ user system elapsed - 0.050 0.002 0.054 + 0.039 0.000 0.039 ~~~ {: .output} diff --git a/_episodes_rmd/01-starting-with-data.Rmd b/_episodes_rmd/01-starting-with-data.Rmd index 15cf73c6c..dce58167f 100644 --- a/_episodes_rmd/01-starting-with-data.Rmd +++ b/_episodes_rmd/01-starting-with-data.Rmd @@ -36,7 +36,7 @@ knitr_fig_path("01-starting-with-data-") We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. -The data sets are stored in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-(csv)) (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. +The data sets are stored in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-csv) (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. The first few rows of our first file look like this: ```{r echo = FALSE} @@ -55,7 +55,7 @@ To do all that, we'll have to learn a little bit about programming. ### Loading Data -To load our inflammation data, first we need to tell our computer where is the file that contains the values. We have been told its name is `inflammation-01.csv`. This is very important in R, if we forget this step we’ll get an error message when trying to read the file. We can change the current working directory using the function `setwd`. For this example, we change the path to the directory we just created: +Let's import the file called `inflammation-01.csv` into our R environment. To import the file, first we need to tell our computer where the file is. We do that by choosing a working directory, that is, a local directory on our computer containing the files we need. This is very important in R. If we forget this step we???ll get an error message saying that the file does not exist. We can set the working directory using the function `setwd`. For this example, we change the path to our new directory at the desktop: ```{r,eval=FALSE} setwd("~/Desktop/r-novice-inflammation/") @@ -64,7 +64,7 @@ setwd("~/Desktop/r-novice-inflammation/") Just like in the Unix Shell, we type the command and then press `Enter` (or `return`). Alternatively you can change the working directory using the RStudio GUI using the menu option `Session` -> `Set Working Directory` -> `Choose Directory...` -The data files are located in the directory `data` inside the working directory. Now we can load the data into R using `read.csv`: +The data file is located in the directory `data` inside the working directory. Now we can load the data into R using `read.csv`: ```{r, results="hide"} read.csv(file = "data/inflammation-01.csv", header = FALSE) @@ -99,16 +99,16 @@ The filename needs to be a character string (or [string]({{ page.root }}/referen > Take a look at `?read.csv` and write the code to load a file called `commadec.txt` that has numeric values with commas as decimal mark, separated by semicolons. {: .challenge} -The utility of a function is that it will perform its given action on whatever value is passed to the named argument(s). -For example, in this case if we provided the name of a different file to the argument `file`, `read.csv` would read it instead. -We'll learn more of the details about functions and their arguments in the next lesson. +A function will perform its given action on whatever value is passed to the argument(s). +For example, in this case if we provided the name of a different file to the argument `file`, `read.csv` would read that instead. +We'll learn more about the details of functions and their arguments in the next lesson. Since we didn't tell it to do anything else with the function's output, the console will display the full contents of the file `inflammation-01.csv`. Try it out. -`read.csv` read the file, but we can't use data unless we assign it to a variable. -A variable is just a name for a value, such as `x`, `current_temperature`, or `subject_id`. -We can create a new variable simply by assigning a value to it using `<-` +`read.csv` reads the file, but we can't use data unless we assign it to a variable. +We can think of a variable as a container with a name, such as `x`, `current_temperature`, or `subject_id` that contains one or more values. +We can create a new variable and assign a value to it using `<-` ```{r} weight_kg <- 55 @@ -121,7 +121,7 @@ In general, R will print to the console any object returned by a function or ope weight_kg ``` -We can do arithmetic with the variable: +We can do arithmetics with the variable: ```{r} # weight in pounds: @@ -135,13 +135,22 @@ We can do arithmetic with the variable: > read it) have an easier time following what the code is doing. {: .callout} -We can also change an object's value by assigning it a new value: +We can also change an variable's value by assigning it a new value: ```{r} weight_kg <- 57.5 # weight in kilograms is now weight_kg ``` +> ## Variable Naming Conventions +> +> Historically, R programmers have used a variety of conventions for naming variables. The `.` character +> in R can be a valid part of a variable name; thus the above assignment could have easily been `weight.kg <- 57.5`. +> This is often confusing to R newcomers who have programmed in languages where `.` has a more significant meaning. +> Today, most R programmers 1) start variable names with lower case letters, 2) separate words in variable names with +> underscores, and 3) use only lowercase letters, underscores, and numbers in variable names. The book *R Packages* includes +> a [chapter](http://r-pkgs.had.co.nz/style.html) on this and other style considerations. +{: .callout} If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value: @@ -184,15 +193,15 @@ This is different from the way spreadsheets work. > and finally prints the assigned value of the variable `total_weight`. {: .callout} -Now that we know how to assign things to variables, let's re-run `read.csv` and save its result: +Now that we know how to assign things to variables, let's re-run `read.csv` and save its result into a variable called 'dat': ```{r} dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE) ``` -This statement doesn't produce any output because assignment doesn't display anything. -If we want to check that our data has been loaded, we can print the variable's value. -However, for large data sets it is convenient to use the function `head` to display only the first few rows of data. +This statement doesn't produce any output because the assignment doesn't display anything. +If we want to check if our data has been loaded, we can print the variable's value by typing the name of the variable `dat`. However, for large data sets it is convenient to use the function `head` to display only the first few rows of data. + ```{r} head(dat) @@ -213,17 +222,17 @@ head(dat) ### Manipulating Data -Now that our data is loaded in memory, we can start doing things with it. +Now that our data are loaded into R, we can start doing things with them. First, let's ask what type of thing `dat` is: ```{r} class(dat) ``` -The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. -Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns. +The output tells us that it is a data frame. We can think of this as a spreadsheet in MS Excel, which many of us are familiar with. +Data frames are very useful for organizing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns. -We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions), of the data frame with the function `dim`: +We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions-of-an-array), of the data frame with the function `dim`: ```{r} dim(dat) @@ -231,7 +240,7 @@ dim(dat) This tells us that our data frame, `dat`, has `r nrow(dat)` rows and `r ncol(dat)` columns. -If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets, just as we do in math: +If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets. The first number specifies the row and the second the column: ```{r} # first value in dat @@ -277,7 +286,7 @@ dat[, 16] > You can learn more about subsetting by column name in this supplementary [lesson]({{ page.root }}/10-supp-addressing-data/). {: .callout} -Now let's perform some common mathematical operations to learn about our inflammation data. +Now let's perform some common mathematical operations to learn more about our inflammation data. When analyzing data we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary data frame, and then perform the calculation on this subset: @@ -320,6 +329,15 @@ sd(dat[, 7]) > are already defined as vectors. {: .callout} +R also has a function that summaries the previous common calculations: + +```{r} +# Summarize function +summary(dat[,1:4]) +``` + +For every column in the data frame, the function "summary" calculates: the minimun value, the first quartile, the median, the mean, the third quartile and the max value, given helpful details about the sample distribution. + What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across a margin of the data frame: @@ -390,6 +408,14 @@ We'll learn why this is so in the next lesson. > 2. `max(dat[3:7, 5])` > 3. `max(dat[5, 3:7])` > 4. `max(dat[5, 3, 7])` +> +> > ## Solution +> > +> > Answer: 3 +> > +> > Explanation: You want to extract the part of the dataframe representing data for patient 5 from days three to seven. In this dataframe, patient data is organised in columns and the days are represented by the rows. Subscripting in R follows the `[i,j]` principle, where `i=columns` and `j=rows`. Thus, answer 3 is correct since the patient is represented by the value for i (5) and the days are represented by the values in j, which is a slice spanning day 3 to 7. +> > +> {: .solution} {: .challenge} > ## Slicing and Re-Assignment @@ -406,7 +432,7 @@ We'll learn why this is so in the next lesson. > > whichPatients <- seq(2,40,2) > > whichDays <- c(1:5) > > dat2 <- dat -> > dat2[whichPatients,whichDays] <- dat2[whichPatients,whichDays]/2 +> > dat2[whichDays, whichPatients] <- dat2[whichDays, whichPatients]/2 > > (dat2) > > ~~~ > > {: .r} diff --git a/_episodes_rmd/02-func-R.Rmd b/_episodes_rmd/02-func-R.Rmd index 495796ad3..cd54443a0 100644 --- a/_episodes_rmd/02-func-R.Rmd +++ b/_episodes_rmd/02-func-R.Rmd @@ -123,7 +123,7 @@ Real-life functions will usually be larger than the ones shown here--typically h > e.g. `x <- c("A", "B", "C")` creates a vector `x` with three elements. > Furthermore, we can extend that vector again using `c`, e.g. `y <- c(x, "D")` creates a vector `y` with four elements. > Write a function called `fence` that takes two vectors as arguments, called -> original` and `wrapper`, and returns a new vector that has the wrapper vector +> `original` and `wrapper`, and returns a new vector that has the wrapper vector > at the beginning and end of the original: > > ```{r, echo=-1} diff --git a/_episodes_rmd/03-loops-R.Rmd b/_episodes_rmd/03-loops-R.Rmd index 08c8a7d39..320f1693f 100644 --- a/_episodes_rmd/03-loops-R.Rmd +++ b/_episodes_rmd/03-loops-R.Rmd @@ -103,7 +103,7 @@ print_words <- function(sentence) { print_words(best_practice) ``` -This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well: +This is shorter - certainly shorter than something that prints every character in a hundred-letter string - and more robust as well: ```{r} print_words(best_practice[-6]) @@ -120,7 +120,8 @@ for (variable in collection) { We can name the [loop variable]({{ page.root }}/reference/#loop-variable) anything we like (with a few [restrictions][], e.g. the name of the variable cannot start with a digit). `in` is part of the `for` syntax. -Note that the body of the loop is enclosed in curly braces `{ }`. +Note that the condition (`variable in collection`) is enclosed in parentheses, +and the body of the loop is enclosed in curly braces `{ }`. For a single-line loop body, as here, the braces aren't needed, but it is good practice to include them as we did. [restrictions]: http://cran.r-project.org/doc/manuals/R-intro.html#R-commands_003b-case-sensitivity-etc diff --git a/_episodes_rmd/04-cond.Rmd b/_episodes_rmd/04-cond.Rmd index 9d596ef4e..30d90e594 100644 --- a/_episodes_rmd/04-cond.Rmd +++ b/_episodes_rmd/04-cond.Rmd @@ -152,7 +152,7 @@ sign(0) sign(2/3) ``` -Note that when combining `else` and `if` in an `else if` statement (similar to `elif` in Python), the `if` portion still requires a direct input condition. This is never the case for the `else` statement alone, which is only executed if all other conditions go unsatisfied. +Note that when combining `else` and `if` in an `else if` statement, the `if` portion still requires a direct input condition. This is never the case for the `else` statement alone, which is only executed if all other conditions go unsatisfied. Note that the test for equality uses two equal signs, `==`. > ## Other Comparisons @@ -411,7 +411,7 @@ analyze_all <- function(pattern) { Now we can save all of the results with just one line of code: ```{r} -analyze_all("inflammation*.csv") +analyze_all("inflammation.*csv") ``` Now if we need to make any changes to our analysis, we can edit the `analyze` function and quickly regenerate all the figures with `analyze_all`. diff --git a/_episodes_rmd/05-cmdline.Rmd b/_episodes_rmd/05-cmdline.Rmd index 02ea5c0c4..c24974bd0 100644 --- a/_episodes_rmd/05-cmdline.Rmd +++ b/_episodes_rmd/05-cmdline.Rmd @@ -52,7 +52,7 @@ $ Rscript readings.R --max data/inflammation-*.csv Our overall requirements are: -1. If no filename is given on the command line, read data from [standard input]({{ page.root }}/reference/#standard-input). +1. If no filename is given on the command line, read data from [standard input]({{ page.root }}/reference/#standard-input-stdin). 2. If one or more filenames are given, read data from them and report statistics for each file separately. 3. Use the `--min`, `--mean`, or `--max` flag to determine what statistic to print. @@ -92,7 +92,7 @@ cat print-args.R The function `commandArgs` extracts all the command line arguments and returns them as a vector. The function `cat`, similar to the `cat` of the Unix Shell, outputs the contents of the variable. -Since we did not specify a filename for writing, `cat` sends the output to [standard output]({{ page.root }}/reference/#standard-output-(stdout)), +Since we did not specify a filename for writing, `cat` sends the output to [standard output]({{ page.root }}/reference/#standard-output-stdout), which we can then pipe to other Unix functions. Because we set the argument `sep` to `"\n"`, which is the symbol to start a new line, each element of the vector is printed on its own line. Let's see what happens when we run this program in the Unix Shell: diff --git a/_episodes_rmd/06-best-practices-R.Rmd b/_episodes_rmd/06-best-practices-R.Rmd index dc7acbfd6..5808443ee 100644 --- a/_episodes_rmd/06-best-practices-R.Rmd +++ b/_episodes_rmd/06-best-practices-R.Rmd @@ -7,7 +7,7 @@ questions: objectives: - "Define best formatting practices when writing code in R scripts." - "Synthesize a consistent personal coding style to increase code readability, consistency, and repeatability." -- "Apply this style to one's own code." +- "Apply this style to your own code." keypoints: - "Start each program with a description of what it does." - "Then load all required packages." @@ -19,39 +19,42 @@ keypoints: - "Factor out common operations rather than repeating them." - "Keep all of the source files for a project in one directory and use relative paths to access them." - "Keep track of the memory used by your program." -- "Always start with a clean environment instead of saving session history." +- "Always start with a clean environment instead of saving the workspace." - "Keep track of session information in your project folder." - "Have someone else review your code." - "Use version control." --- -```{r, include = FALSE} +```{r source, include = FALSE} source("../bin/chunk-options.R") ``` -1. Start your code with an annotated description of what the code does when it is run: +1. Start your code with an annotated description of what the code does: -```{r} +```{r comment1} #This is code to replicate the analyses and figures from my 2014 Science paper. #Code developed by Sarah Supp, Tracy Teal, and Jon Borelli ``` -2. Next, load all of the packages that will be necessary to run your code (using `library`): +2. Next, load all of the packages needed to run your code (using `library()`): -```{r, eval=FALSE} +```{r loadpkgs, eval=FALSE} library(ggplot2) library(reshape) library(vegan) ``` +If you use only one or two functions from a package, it is sometimes useful to note that fact in a comment, e.g. `library(reshape2) ## for melt()` + 3. Set your working directory before `source()`ing a script, or start `R` inside your project folder: -One should exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility: +Exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility: + -* `setwd()` will return an error if the directory to which you're trying to change doesn't exit or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently. -* If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, and if they then call the script again, this will cause further problems. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems. +* `setwd()` will return an error if the directory you're trying to change to doesn't exist or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently. +* If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, which will cause further problems if they then call the script again. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems. Putting a commented-out `setwd()` call at the top of your code can be a reasonable compromise: it reminds you where on your machine your material is living, is easy to copy-and-paste if necessary, but doesn't commit other users. -The following error message indicates that R has failed to set the working directory you specified: +This error message indicates that R has failed to set the working directory you specified: ``` Error in setwd("~/path/to/working/directory") : cannot change working directory @@ -61,13 +64,13 @@ It is best practice to have the user running the script begin in a consistent di 4. Annotate and mark your code using `#` or `#-` to set off sections of your code and to make finding specific parts of your code easier. -5. If you create only one or a few custom functions in your script, put them toward the top of your code so they are among the first objects created. If you have written many functions, put them all in their own .R file and then `source` those files. `source` will define all of these functions so that your code can make use of them as needed. For the reasons listed above, try to avoid using `setwd()` (or other functions that have side-effects in the user's workspace) in scripts you `source`. +5. If you create only a few custom functions in your script, put them toward the top of your code so they are among the first objects created. If you have written many functions, put them all in their own .R file and then `source()` those files. `source()` will define all of these functions so that your code can use them as needed. For the reasons listed above, avoid using `setwd()` (or other functions that have side-effects in the user's workspace) in scripts you `source()`. -```{r, eval=FALSE} +```{r source_ex, eval=FALSE} source("my_genius_fxns.R") ``` -6. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Consistency makes code easier to read and problems easier to spot. +6. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Indent consistently and decide on a scheme for multi-word variable names (e.g. `resource.use`, `resource_use`, or `resourceUse`). Consistency makes code easier to read and problems easier to spot. 7. Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces. @@ -75,30 +78,30 @@ source("my_genius_fxns.R") 9. Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, use -```{r, eval=FALSE} +```{r relpath, eval=FALSE} dat <- read.csv(file = "files/dataset-2013-01.csv", header = TRUE) ``` rather than: -```{r, eval=FALSE} +```{r abspath, eval=FALSE} dat <- read.csv(file = "/Users/Karthik/Documents/sannic-project/files/dataset-2013-01.csv", header = TRUE) ``` -10. R can run into memory issues. It is a common problem to run out of memory after running R scripts for a long time. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose. However, sometimes, R will not clean up unused memory for a while after you delete objects. You can force R to tidy up its memory by using `gc()`. +10. R can run into memory issues. R scripts that run for a long time often run out of memory. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long chunks of computationally intensive code is to remove temporary objects after they have served their purpose. However, R will not always clean up unused memory immediately after you delete objects. You can force R to tidy up its memory by using `gc()`. -```{r, eval=FALSE} +```{r gc_ex, eval=FALSE} interim_object <- data.frame(rep(1:100,10),rep(101:200,10),rep(201:300,10)) # Sample dataset of 1000 rows object.size(interim_object) # Reports the memory size allocated to the object -rm(interim_object) # Removes only the object itself and not necessarily the memory allotted to it +rm("interim_object") # Removes only the object itself and not necessarily the memory allotted to it gc() # Force R to release memory it is no longer using ls() # Lists all the objects in your current workspace rm(list = ls()) # If you want to delete all the objects in the workspace and start with a clean slate ``` -11. Don't save a session history (the default option in R, when it asks if you want an `RData` file). Instead, start in a clean environment so that older objects don't remain in your environment any longer than they need to. If that happens, it can lead to unexpected results. +11. Don't save your workspace (the default option in R, when it asks if you want to "Save workspace image [y/n/c]?"). Instead, start in a clean workspace without old objects cluttering it. Leftover objects from previous sessions can lead to unexpected, hard-to-debug results. Do *not* put `rm(list=ls())` (which removes all objects in your current workspace, as shown in the previous code example) at the top of your code, as this is a trap for other users who might `source()` or copy-and-paste your code in the course of their R session. Instead, restart R when you want to start fresh. -12. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived). +12. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived). For more complex projects, you may want to use the [packrat](https://CRAN.R-project.org/package=packrat) package. 13. Collaborate. Grab a buddy and practice "code review". Review is used for preparing experiments and manuscripts; why not use it for code as well? Our code is also a major scientific achievement and the product of lots of hard work! diff --git a/_episodes_rmd/08-making-packages-R.Rmd b/_episodes_rmd/08-making-packages-R.Rmd index 0c3c1d3c4..9f0c4592a 100644 --- a/_episodes_rmd/08-making-packages-R.Rmd +++ b/_episodes_rmd/08-making-packages-R.Rmd @@ -111,16 +111,16 @@ Add our functions to the R directory. Place each function into a separate R script and add documentation like this: ```{r} -#' Convert Fahrenheit to Kelvin +#' Converts Fahrenheit to Kelvin #' #' This function converts input temperatures in Fahrenheit to Kelvin. -#' @param temp The input temperature. +#' @param temp The temperature in Fahrenheit. +#' @return The temperature in Kelvin. #' @export #' @examples #' fahr_to_kelvin(32) fahr_to_kelvin <- function(temp) { - #Converts Fahrenheit to Kelvin kelvin <- ((temp - 32) * (5/9)) + 273.15 kelvin } diff --git a/_episodes_rmd/11-supp-read-write-csv.Rmd b/_episodes_rmd/11-supp-read-write-csv.Rmd index 4c06e1f2c..94e1c9ee7 100644 --- a/_episodes_rmd/11-supp-read-write-csv.Rmd +++ b/_episodes_rmd/11-supp-read-write-csv.Rmd @@ -21,7 +21,7 @@ source('../bin/chunk-options.R') The most common way that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), -users often find it easier to save their spreadsheets in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-(csv)) files (CSV) +users often find it easier to save their spreadsheets in [comma-separated values]({{ page.root }}/reference/#comma-separated-values-csv) files (CSV) and then use R's built in functionality to read and manipulate the data. In this short lesson, we'll learn how to read data from a .csv and write to a new .csv, and explore the [arguments]({{ page.root }}/reference/#argument) that allow you read and write the data correctly for your needs. diff --git a/_episodes_rmd/13-supp-data-structures.Rmd b/_episodes_rmd/13-supp-data-structures.Rmd index ab3112cc6..9a481a489 100644 --- a/_episodes_rmd/13-supp-data-structures.Rmd +++ b/_episodes_rmd/13-supp-data-structures.Rmd @@ -1,21 +1,21 @@ --- title: "Data Types and Structures" +keypoints: +- R's basic data types are character, numeric, integer, complex, and logical. +- R's basic data structures include the vector, list, matrix, data frame, and factors. +- Objects may have attributes, such as name, dimension, and class. +objectives: +- Expose learners to the different data types in R. +- Learn how to create vectors of different types. +- Be able to check the type of vector. +- Learn about missing data and other special values. +- Getting familiar with the different data structures (lists, matrices, data frames). +questions: +- What are the different data types in R? +- What are the different data structures in R? +- How do I access data within the various data structures? teaching: 45 exercises: 0 -questions: -- "What are the different data types in R?" -- "What are the different data structures in R?" -- "How do I access data within the various data structures?" -objectives: -- "Expose learners to the different data types in R." -- "Learn how to create vectors of different types." -- "Be able to check the type of vector." -- "Learn about missing data and other special values." -- "Getting familiar with the different data structures (lists, matrices, data frames)." -keypoints: -- "R's basic data types are character, numeric, integer, complex, and logical." -- "R's basic data structures include the vector, list, matrix, data frame, and factors." -- "Objects may have attributes, such as name, dimension, and class." --- ```{r, include = FALSE} @@ -297,6 +297,12 @@ mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE) mdat ``` +Elements of a matrix can be referenced by specifying the index along each dimension (e.g. "row" and "column") in single square brackets. + +```{r} +mdat[2,3] +``` + ### List In R lists act as containers. Unlike atomic vectors, the contents of a list are @@ -316,8 +322,17 @@ x x <- vector("list", length = 5) ## empty list length(x) +``` + +The content of elements of a list can be retrieved by using double square brackets. + +```{r} x[[1]] +``` +Vectors can be coerced to lists as follows: + +```{r} x <- 1:10 x <- as.list(x) length(x) @@ -326,21 +341,25 @@ length(x) 1. What is the class of `x[1]`? 2. What about `x[[1]]`? + +Elements of a list can be named (i.e. lists can have the `names` atttibute) + ```{r} xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris)) xlist +names(xlist) ``` 1. What is the length of this object? What about its structure? -Lists can be extremely useful inside functions. You can “staple” together lots +Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can "staple" together lots of different kinds of results into a single object that a function can return. A list does not print to the console like a vector. Instead, each element of the list starts on a new line. Elements are indexed by double brackets. Single brackets will still return -a(nother) list. +a(nother) list. If the elements of a list are named, they can be referenced by the `$` notation (i.e. `xlist$data`). ### Data Frame @@ -348,7 +367,7 @@ a(nother) list. A data frame is a very important data type in R. It's pretty much the *de facto* data structure for most tabular data and what we use for statistics. -A data frame is a special type of list where every element of the list has same length. +A data frame is a *special type of list* where every element of the list has same length (i.e. data frame is a "rectangular" list). Data frames can have additional attributes such as `rownames()`, which can be useful for annotating data, like `subject_id` or `sample_id`. But most of the @@ -356,12 +375,11 @@ time they are not used. Some additional information on data frames: -* Usually created by `read.csv()` and `read.table()`. -* Can convert to matrix with `data.matrix()` (preferred) or `as.matrix()` -* Coercion will be forced and not always what you expect. -* Can also create with `data.frame()` function. +* Usually created by `read.csv()` and `read.table()`, i.e. when importing the data into R. +* Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect. +* Can also create a new data frame with `data.frame()` function. * Find the number of rows and columns with `nrow(dat)` and `ncol(dat)`, respectively. -* Rownames are usually 1, 2, ..., n. +* Rownames are often automatically generated and look like 1, 2, ..., n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset. ### Creating Data Frames by Hand @@ -374,33 +392,52 @@ dat > ## Useful Data Frame Functions > -> * `head()` - shown first 6 rows -> * `tail()` - show last 6 rows -> * `dim()` - returns the dimensions +> * `head()` - shows first 6 rows +> * `tail()` - shows last 6 rows +> * `dim()` - returns the dimensions of data frame (i.e. number of rows and number of columns) > * `nrow()` - number of rows > * `ncol()` - number of columns -> * `str()` - structure of each column +> * `str()` - structure of data frame - name, type and preview of data in each column > * `names()` - shows the `names` attribute for a data frame, which gives the column names. +> * `sapply(dataframe, class)` - shows the class of each column in the data frame {: .callout} See that it is actually a special list: ```{r} -is.list(iris) -class(iris) +is.list(dat) +class(dat) +``` + +Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix). + +```{r} +dat[1,3] ``` +As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a `$`. + +```{r} +dat[["y"]] +dat$y +``` + +The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain. + | Dimensions | Homogenous | Heterogeneous | | ------- | ---- | ---- | | 1-D | atomic vector | list | | 2-D | matrix | data frame | +> Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be "rectangular". However in order for the list to qualify as a data frame, the lenghth of each element has to be the same. +{: .callout} + > ## Column Types in Data Frames > -> Knowing that data frames are lists of lists, can columns be of different type? +> Knowing that data frames are lists, can columns be of different type? > -> What type of structure do you expect on the iris data frame? Hint: Use `str()`. +> What type of structure do you expect to see when you explore the structure of the `iris` data frame? Hint: Use `str()`. > > ~~~ > # The Sepal.Length, Sepal.Width, Petal.Length and Petal.Width columns are all diff --git a/_extras/guide.md b/_extras/guide.md index 646c0e6f8..2709fdcb3 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -79,13 +79,11 @@ line and this should resolve the issue. ## Teaching Notes * Watching the instructor grow programs step by step - is as helpful to learners as anything to do with Python. - Resist the urge to update a single cell repeatedly + is as helpful to learners as anything to do with R. + Resist the urge to clean up your R script as you go (which is what you'd probably do in real life). - Instead, - clone the previous cell and write the update in the new copy - so that learners have a complete record of how the program grew. - Once you've done this, + Instead, keep intermediate steps in your script. + Once you've reached the final version you can say, "Now why don't we just breaks things into small functions right from the start?" diff --git a/_includes/carpentries.html b/_includes/carpentries.html index 69e2e1cc5..a0e0181fc 100644 --- a/_includes/carpentries.html +++ b/_includes/carpentries.html @@ -1,3 +1,6 @@ +{% comment %} + General description of Software and Data Carpentry. +{% endcomment %}
Software Carpentry logo @@ -23,5 +26,19 @@ building on learners' existing knowledge to enable them to quickly apply skills learned to their own research.
- - +
+
+
+ Library Carpentry logo +
+
+ Library Carpentry is made by librarians to help librarians + automate repetitive, boring, error-prone tasks; + create, maintain and analyse sustainable and reusable data; + work effectively with IT and systems colleagues; + better understand the use of software in research; + and much more. + Library Carpentry was the winner of the 2016 + British Library Labs Teaching and Learning Award. +
+
diff --git a/_includes/dc/intro.html b/_includes/dc/intro.html new file mode 100644 index 000000000..741aeebb5 --- /dev/null +++ b/_includes/dc/intro.html @@ -0,0 +1,18 @@ +

+ Data Carpentry + aims to help researchers get their work done + in less time and with less pain + by teaching them basic research computing skills. + This hands-on workshop will cover basic concepts and tools, + including program design, version control, data management, + and task automation. + Participants will be encouraged to help one another + and to apply what they have learned to their own research problems. +

+

+ + For more information on what we teach and why, + please see our paper + "Best Practices for Scientific Computing". + +

diff --git a/_includes/dc/schedule.html b/_includes/dc/schedule.html new file mode 100644 index 000000000..6894a19e3 --- /dev/null +++ b/_includes/dc/schedule.html @@ -0,0 +1,24 @@ +
+
+

Day 1

+ + + + + + + +
09:00 Automating tasks with the Unix shell
10:30 Coffee
12:00 Lunch break
13:00 Building programs with Python
14:30 Coffee
16:00 Wrap-up
+
+
+

Day 2

+ + + + + + + +
09:00 Version control with Git
10:30 Coffee
12:00 Lunch break
13:00 Managing data with SQL
14:30 Coffee
16:00 Wrap-up
+
+
diff --git a/_includes/dc/syllabus.html b/_includes/dc/syllabus.html new file mode 100644 index 000000000..a325ceec2 --- /dev/null +++ b/_includes/dc/syllabus.html @@ -0,0 +1,96 @@ +
+
+

The Unix Shell

+
    +
  • Files and directories
  • +
  • History and tab completion
  • +
  • Pipes and redirection
  • +
  • Looping over files
  • +
  • Creating and running shell scripts
  • +
  • Finding things
  • +
  • Reference...
  • +
+
+
+

Programming in Python

+
    +
  • Using libraries
  • +
  • Working with arrays
  • +
  • Reading and plotting data
  • +
  • Creating and using functions
  • +
  • Loops and conditionals
  • +
  • Defensive programming
  • +
  • Using Python from the command line
  • +
  • Reference...
  • +
+
+ + +
+ +
+
+

Version Control with Git

+
    +
  • Creating a repository
  • +
  • Recording changes to files: add, commit, ...
  • +
  • Viewing changes: status, diff, ...
  • +
  • Ignoring files
  • +
  • Working on the web: clone, pull, push, ...
  • +
  • Resolving conflicts
  • +
  • Open licenses
  • +
  • Where to host work, and why
  • +
  • Reference...
  • +
+
+ +
+

Open Refine

+
    +
  • Introduction to OpenRefine
  • +
  • Importing data
  • +
  • Basic functions
  • +
  • Advanced Functions
  • +
  • Reference...
  • +
+
+
diff --git a/_includes/dc/who.html b/_includes/dc/who.html new file mode 100644 index 000000000..2d8e94ae3 --- /dev/null +++ b/_includes/dc/who.html @@ -0,0 +1,8 @@ +

+ Who: + The course is aimed at graduate students and other researchers. + + You don't need to have any previous knowledge of the tools + that will be presented at the workshop. + +

diff --git a/_includes/episode_keypoints.html b/_includes/episode_keypoints.html index 85378a568..2baa53ef0 100644 --- a/_includes/episode_keypoints.html +++ b/_includes/episode_keypoints.html @@ -1,3 +1,6 @@ +{% comment %} + Display key points for an episode. +{% endcomment %}

Key Points

    diff --git a/_includes/episode_navbar.html b/_includes/episode_navbar.html index a789d3d99..b9f85f6bd 100644 --- a/_includes/episode_navbar.html +++ b/_includes/episode_navbar.html @@ -1,26 +1,11 @@ {% comment %} - Find previous and next episodes (if any). -{% endcomment %} -{% for episode in site.episodes %} - {% if episode.url == page.url %} - {% unless forloop.first %} - {% assign prev_episode = prev %} - {% endunless %} - {% unless forloop.last %} - {% assign next_episode = site.episodes[forloop.index] %} - {% endunless %} - {% endif %} - {% assign prev = episode %} -{% endfor %} - -{% comment %} - Display title and prev/next links. + Navigation bar for an episode. {% endcomment %}

    - {% if prev_episode %} - previous episode + {% if page.previous.url %} + previous episode {% else %} lesson home {% endif %} @@ -29,13 +14,12 @@

    {% if include.episode_navbar_title %}

    {{ site.title }}

    -

    {{ page.title }}

    {% endif %}

    - {% if next_episode %} - next episode + {% if page.next.url %} + next episode {% else %} lesson home {% endif %} diff --git a/_includes/episode_title.html b/_includes/episode_title.html index 5b9c821ca..d0abc6545 100644 --- a/_includes/episode_title.html +++ b/_includes/episode_title.html @@ -1,42 +1,9 @@ -{% comment %} - Find previous and next episodes (if any). -{% endcomment %} -{% for episode in site.episodes %} - {% if episode.url == page.url %} - {% unless forloop.first %} - {% assign prev_episode = prev %} - {% endunless %} - {% unless forloop.last %} - {% assign next_episode = site.episodes[forloop.index] %} - {% endunless %} - {% endif %} - {% assign prev = episode %} -{% endfor %} - -{% comment %} - Display title and prev/next links. -{% endcomment %}
    -

    - {% if prev_episode %} - - {% else %} - - {% endif %} -

    -

    {{ site.title }}

    {{ page.title }}

    -

    - {% if next_episode %} - - {% else %} - - {% endif %} -

    diff --git a/_includes/javascript.html b/_includes/javascript.html index 010ae4af1..a2066c202 100644 --- a/_includes/javascript.html +++ b/_includes/javascript.html @@ -1,3 +1,6 @@ +{% comment %} + Javascript used in lesson and workshop pages. +{% endcomment %} diff --git a/_includes/lc/intro.html b/_includes/lc/intro.html new file mode 100644 index 000000000..6794b542c --- /dev/null +++ b/_includes/lc/intro.html @@ -0,0 +1,19 @@ +

    + Library Carpentry + is made by librarians, for librarians to help you: +

    +
      +
    • automate repetitive, boring, error-prone tasks
    • +
    • create, maintain and analyse sustainable and reusable data
    • +
    • work effectively with IT and systems colleagues
    • +
    • better understand the use of software in research
    • +
    • and much more...
    • +
    +

    + + Library Carpentry introduces you to the fundamentals of computing + and provides you with a platform for further self-directed learning. + For more information on what we teach and why, please see our paper + "Library Carpentry: software skills training for library professionals". + +

    diff --git a/_includes/lc/schedule.html b/_includes/lc/schedule.html new file mode 100644 index 000000000..cc2b59202 --- /dev/null +++ b/_includes/lc/schedule.html @@ -0,0 +1,24 @@ +
    +
    +

    Day 1

    + + + + + + + +
    09:00 Data Intro for Librarians
    10:30 Coffee
    12:00 Lunch break
    13:00 Shell Lessons for Libraries
    14:30 Coffee
    16:00 Wrap-up
    +
    +
    +

    Day 2

    + + + + + + + +
    09:00 Git Intro for Librarians
    10:30 Coffee
    12:00 Lunch break
    13:00 OpenRefine for Librarians
    14:30 Coffee
    16:00 Wrap-up
    +
    +
    diff --git a/_includes/lc/syllabys.html b/_includes/lc/syllabys.html new file mode 100644 index 000000000..4dc20776d --- /dev/null +++ b/_includes/lc/syllabys.html @@ -0,0 +1,69 @@ +
    +
    +

    Data Intro

    +
      +
    • Intro to data
    • +
    • Jargon busting
    • +
    • Keyboard shortcuts
    • +
    • Plain text formats
    • +
    • Naming files
    • +
    • Regular expressions
    • +
    • Reference...
    • +
    +
    +
    +

    The Unix Shell

    +
      +
    • Files and directories
    • +
    • History and tab completion
    • +
    • Counting and sorting contents in files
    • +
    • Pipes and redirection
    • +
    • Mining or searching in files
    • +
    • Reference...
    • +
    +
    + +
    + +
    +
    +

    Version Control with Git

    +
      +
    • Creating a repository
    • +
    • Configuring git
    • +
    • Recording changes to files: add, commit, ...
    • +
    • Viewing state changes with status
    • +
    • Working on the web: clone, pull, push, ...
    • +
    • Where to host work, and why
    • +
    • Reference...
    • +
    +
    +
    +
    +

    Open Refine

    +
      +
    • Introduction to OpenRefine
    • +
    • Importing data
    • +
    • Basic functions
    • +
    • Advanced Functions
    • +
    • Reference...
    • +
    +
    +
    +
    + diff --git a/_includes/lc/who.html b/_includes/lc/who.html new file mode 100644 index 000000000..fd9b38c47 --- /dev/null +++ b/_includes/lc/who.html @@ -0,0 +1,8 @@ +

    + Who: + The course is for librarians, archivists, and other information workers. + + You don't need to have any previous knowledge of the tools that + will be presented at the workshop. + +

    diff --git a/_includes/lesson_footer.html b/_includes/lesson_footer.html index beef2cb38..fa5d88870 100644 --- a/_includes/lesson_footer.html +++ b/_includes/lesson_footer.html @@ -1,19 +1,37 @@ +{% comment %} + Footer for lesson pages. +{% endcomment %}