diff --git a/_episodes_rmd/01-starting-with-data.Rmd b/_episodes_rmd/01-starting-with-data.Rmd index a92254ea4..d51c2cc96 100644 --- a/_episodes_rmd/01-starting-with-data.Rmd +++ b/_episodes_rmd/01-starting-with-data.Rmd @@ -6,7 +6,7 @@ questions: - "How do I read data into R?" - "How do I assign variables?" - "What is a data frame?" -- "How do I access subsets a data frame?" +- "How do I access subsets of a data frame?" - "How do I calculate simple statistics like mean and median?" - "Where can I get help?" - "How can I plot my data?" @@ -22,7 +22,7 @@ keypoints: - "The function `dim` gives the dimensions of a data frame." - "Use `object[x, y]` to select a single element from a data frame." - "Use `from:to` to specify a sequence that includes the indices from `from` to `to`." -- "All the indexing and slicing that works on data frames also works on vectors." +- "All the indexing and subsetting that works on data frames also works on vectors." - "Use `#` to add comments to programs." - "Use `mean`, `max`, `min` and `sd` to calculate simple statistics." - "Use `apply` to calculate statistics across the rows or columns of a data frame." @@ -114,20 +114,25 @@ We can create a new variable and assign a value to it using `<-` weight_kg <- 55 ``` -Once a variable has a value, we can print it by typing the name of the variable and hitting `Enter` (or `return`). +Once a variable is created, we can use the variable name to refer to the value it was assigned. The variable name now acts as a tag. Whenever R reads that tag (`weight_kg`), it substitutes the value (`55`). + +Variables as Tags + +To see the value of a variable, we can print it by typing the name of the variable and hitting `Enter` (or `return`). In general, R will print to the console any object returned by a function or operation *unless* we assign it to a variable. ```{r} weight_kg ``` - -We can do arithmetics with the variable: +We can treat our variable like a regular number, and do arithmetic it: ```{r} # weight in pounds: 2.2 * weight_kg ``` +Variables as Tags + > ## Commenting > > We can add comments to our code using the `#` character. It is useful to @@ -152,13 +157,13 @@ weight_kg > a [chapter](http://r-pkgs.had.co.nz/style.html) on this and other style considerations. {: .callout} -If we imagine the variable as a sticky note with a name written on it, -assignment is like putting the sticky note on a particular value: +Reassigning Variables + +Assigning a new value to a variable breaks the connection with the old value; R forgets that number and applies the variable name to the new value. -Variables as Sticky Notes +When you assign a value to a variable, R only stores the value, not the calculation you used to create it. This is an important point if you're used to the way a spreadsheet program automatically updates linked cells. Let's look at an example. -This means that assigning a value to one object does not change the values of other variables. -For example, let's store the subject's weight in pounds in a variable: +First, we'll convert `weight_kg` into pounds, and store the new value in the variable `weight_lb`: ```{r} weight_lb <- 2.2 * weight_kg @@ -168,9 +173,12 @@ weight_kg weight_lb ``` -Creating Another Variable +In words, we're asking R to look up the value we tagged `weight_kg`, +multiply it by 2.2, and tag the result with the name `weight_lb`: + +Creating Another Variable -and then change `weight_kg`: +If we now change the value of `weight_kg`: ```{r} weight_kg <- 100.0 @@ -180,7 +188,7 @@ weight_kg weight_lb ``` -Updating a Variable +Updating a Variable Since `weight_lb` doesn't "remember" where its value came from, it isn't automatically updated when `weight_kg` changes. This is different from the way spreadsheets work. @@ -229,8 +237,9 @@ First, let's ask what type of thing `dat` is: class(dat) ``` -The output tells us that it is a data frame. We can think of this as a spreadsheet in MS Excel, which many of us are familiar with. -Data frames are very useful for organizing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns. +The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. +Data frames are very useful for storing data and you will use them frequently when programming in R. +A typical data frame of experimental data contains individual observations in rows and variables in columns. We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions-of-an-array), of the data frame with the function `dim`: @@ -242,42 +251,51 @@ This tells us that our data frame, `dat`, has `r nrow(dat)` rows and `r ncol(dat If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets. The first number specifies the row and the second the column: -```{r} -# The first value in dat is indexed at row 1 column 1 +```{r selecting data frame elements} +# first value in dat, row 1, column 1 dat[1, 1] -# The middle value in dat is indexed at row 30 column 20 +# middle value in dat, row 30, column 20 dat[30, 20] ``` -An index like `[30, 20]` selects a single element of a data frame, but we can select whole sections as well. -For example, we can select values for the first four patients (rows) during the first ten days of treatment (columns) like this: +The first value in a data frame index is the row, the second value is the column. +If we want to select more than one row or column, we can use the function `c`, which stands for **c**ombine. +For example, to pick columns 10 and 20 from rows 1, 3, and 5, we can do this: -```{r} -dat[1:4, 1:10] +```{r selecting with c} +dat[c(1, 3, 5), c(10, 20)] +``` + +We frequently want to select contiguous rows or columns, such as the first ten rows, or columns 3 through 7. You can use `c` for this, but it's more convenient to use the `:` operator. This special function generates sequences of numbers: + +```{r sequences} +1:5 +3:12 ``` -The slice does not need to start at 1, e.g. the line below selects rows 5 through 10, and columns 3 through 10 : +For example, we can select the first ten columns of values for the first four rows like this: ```{r} -dat[5:10, 3:10] +dat[1:4, 1:10] ``` -We can use the function `c`, which stands for **c**ombine, to select non-contiguous values: + +or the first ten columns of rows 5 to 10 like this: ```{r} -dat[c(3, 8, 37, 56), c(10, 14, 29)] +dat[5:10, 1:10] ``` -We can also provide a slice for the rows but not for the columns, or for the columns but not for the rows. -If we don't include a slice for the rows, R returns all the rows; if we don't include a slice for the columns, R returns all the columns. -If we don't provide a slice for either rows or columns, e.g. `dat[, ]`, R returns the full data frame. +If you want to select all rows or all columns, leave that index value empty. ```{r} # All columns from row 5 dat[5, ] -# All rows from column 16 -dat[, 16] +# All rows from column 16-18 +dat[, 16:18] ``` +If you leave both index values empty (i.e., `dat[, ]`), you get the entire data frame. + > ## Addressing Columns by Name > > Columns can also be addressed by name, with either the `$` operator (ie. `dat$Age`) or square brackets (ie. `dat[,'Age']`). @@ -294,6 +312,42 @@ patient_1 <- dat[1, ] # max inflammation for patient 1 max(patient_1) ``` + + + We don't actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call: @@ -373,10 +427,10 @@ We'll learn why this is so in the next lesson. > `colMeans`, respectively. {: .callout} -> ## Slicing (Subsetting) Data + +> ## Subsetting Data > -> A subsection of a data frame is called a [slice]({{ page.root }}/reference/#slice). -> We can take slices of character vectors as well: +> We can take subsets of character vectors as well: > > ```{r} > animal <- c("m", "o", "n", "k", "e", "y") @@ -399,7 +453,7 @@ We'll learn why this is so in the next lesson. > ## Subsetting More Data > > Suppose you want to determine the maximum inflammation for patient 5 across days three to seven. -> To do this you would extract the relevant slice from the data frame and calculate the maximum value. +> To do this you would extract the relevant subset from the data frame and calculate the maximum value. > Which of the following lines of R code gives the correct answer? > > 1. `max(dat[5, ])` @@ -416,7 +470,7 @@ We'll learn why this is so in the next lesson. > {: .solution} {: .challenge} -> ## Slicing and Re-Assignment +> ## Subsetting and Re-Assignment > > Using the inflammation data frame `dat` from above: > Let's pretend there was something wrong with the instrument on the first five days for every second patient (#2, 4, 6, etc.), which resulted in the measurements being twice as large as they should be. diff --git a/fig/arithmetic-variables.svg b/fig/arithmetic-variables.svg new file mode 100644 index 000000000..cfe9fb8e1 --- /dev/null +++ b/fig/arithmetic-variables.svg @@ -0,0 +1,253 @@ + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + weight_kg + + + + + + + + 2.2 + + + + × + + + + 55 + + + + + + + = + + + + 121 + + + + + diff --git a/fig/memory-variables.svg b/fig/memory-variables.svg new file mode 100644 index 000000000..076a4794a --- /dev/null +++ b/fig/memory-variables.svg @@ -0,0 +1,298 @@ + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + 126.5 + + + + + + + + + + + + weight_lb + + + + + + + + + + + + + + + + + weight_kg + + + + + + + + + 100 + + + + + + + + + diff --git a/fig/new-variables.svg b/fig/new-variables.svg new file mode 100644 index 000000000..53cfa9f7b --- /dev/null +++ b/fig/new-variables.svg @@ -0,0 +1,334 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + weight_kg + + + + + + + 2.2 + + + + × + + + + 57.5 + + + + + + + = + + + + + + 126.5 + + + + + + + + + + + + weight_lb + + + + + + + + + diff --git a/fig/reassign-variables.svg b/fig/reassign-variables.svg new file mode 100644 index 000000000..f44f30ca8 --- /dev/null +++ b/fig/reassign-variables.svg @@ -0,0 +1,244 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + weight_kg + + + + + <- + + + + 57.5 + + + + + + + 55 + + + + + + + + × + + + + + 57.5 + + + + + + + + + diff --git a/fig/tag-variables.svg b/fig/tag-variables.svg new file mode 100644 index 000000000..f2af6f0a3 --- /dev/null +++ b/fig/tag-variables.svg @@ -0,0 +1,141 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + weight_kg + + + + + + + + 55 + + + + + + + +