Skip to content

Commit

Permalink
Merge pull request swcarpentry#255 from plantarum/gh-pages
Browse files Browse the repository at this point in the history
New svg diagrams for variables, refactoring subsetting presentation
  • Loading branch information
chendaniely authored Jul 16, 2017
2 parents 50fb901 + 7d7da82 commit 6ed2843
Show file tree
Hide file tree
Showing 6 changed files with 1,360 additions and 36 deletions.
126 changes: 90 additions & 36 deletions _episodes_rmd/01-starting-with-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ questions:
- "How do I read data into R?"
- "How do I assign variables?"
- "What is a data frame?"
- "How do I access subsets a data frame?"
- "How do I access subsets of a data frame?"
- "How do I calculate simple statistics like mean and median?"
- "Where can I get help?"
- "How can I plot my data?"
Expand All @@ -22,7 +22,7 @@ keypoints:
- "The function `dim` gives the dimensions of a data frame."
- "Use `object[x, y]` to select a single element from a data frame."
- "Use `from:to` to specify a sequence that includes the indices from `from` to `to`."
- "All the indexing and slicing that works on data frames also works on vectors."
- "All the indexing and subsetting that works on data frames also works on vectors."
- "Use `#` to add comments to programs."
- "Use `mean`, `max`, `min` and `sd` to calculate simple statistics."
- "Use `apply` to calculate statistics across the rows or columns of a data frame."
Expand Down Expand Up @@ -114,20 +114,25 @@ We can create a new variable and assign a value to it using `<-`
weight_kg <- 55
```
Once a variable has a value, we can print it by typing the name of the variable and hitting `Enter` (or `return`).
Once a variable is created, we can use the variable name to refer to the value it was assigned. The variable name now acts as a tag. Whenever R reads that tag (`weight_kg`), it substitutes the value (`55`).
<img src="../fig/tag-variables.svg" alt="Variables as Tags" />
To see the value of a variable, we can print it by typing the name of the variable and hitting `Enter` (or `return`).
In general, R will print to the console any object returned by a function or operation *unless* we assign it to a variable.
```{r}
weight_kg
```
We can do arithmetics with the variable:
We can treat our variable like a regular number, and do arithmetic it:
```{r}
# weight in pounds:
2.2 * weight_kg
```
<img src="../fig/arithmetic-variables.svg" alt="Variables as Tags" />
> ## Commenting
>
> We can add comments to our code using the `#` character. It is useful to
Expand All @@ -152,13 +157,13 @@ weight_kg
> a [chapter](http://r-pkgs.had.co.nz/style.html) on this and other style considerations.
{: .callout}
If we imagine the variable as a sticky note with a name written on it,
assignment is like putting the sticky note on a particular value:
<img src="../fig/reassign-variables.svg" alt="Reassigning Variables" />
Assigning a new value to a variable breaks the connection with the old value; R forgets that number and applies the variable name to the new value.
<img src="../fig/python-sticky-note-variables-01.svg" alt="Variables as Sticky Notes" />
When you assign a value to a variable, R only stores the value, not the calculation you used to create it. This is an important point if you're used to the way a spreadsheet program automatically updates linked cells. Let's look at an example.
This means that assigning a value to one object does not change the values of other variables.
For example, let's store the subject's weight in pounds in a variable:
First, we'll convert `weight_kg` into pounds, and store the new value in the variable `weight_lb`:
```{r}
weight_lb <- 2.2 * weight_kg
Expand All @@ -168,9 +173,12 @@ weight_kg
weight_lb
```
<img src="../fig/python-sticky-note-variables-02.svg" alt="Creating Another Variable" />
In words, we're asking R to look up the value we tagged `weight_kg`,
multiply it by 2.2, and tag the result with the name `weight_lb`:
<img src="../fig/new-variables.svg" alt="Creating Another Variable" />
and then change `weight_kg`:
If we now change the value of `weight_kg`:
```{r}
weight_kg <- 100.0
Expand All @@ -180,7 +188,7 @@ weight_kg
weight_lb
```
<img src="../fig/python-sticky-note-variables-03.svg" alt="Updating a Variable" />
<img src="../fig/memory-variables.svg" alt="Updating a Variable" />
Since `weight_lb` doesn't "remember" where its value came from, it isn't automatically updated when `weight_kg` changes.
This is different from the way spreadsheets work.
Expand Down Expand Up @@ -229,8 +237,9 @@ First, let's ask what type of thing `dat` is:
class(dat)
```
The output tells us that it is a data frame. We can think of this as a spreadsheet in MS Excel, which many of us are familiar with.
Data frames are very useful for organizing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with.
Data frames are very useful for storing data and you will use them frequently when programming in R.
A typical data frame of experimental data contains individual observations in rows and variables in columns.
We can see the shape, or [dimensions]({{ page.root }}/reference/#dimensions-of-an-array), of the data frame with the function `dim`:
Expand All @@ -242,42 +251,51 @@ This tells us that our data frame, `dat`, has `r nrow(dat)` rows and `r ncol(dat
If we want to get a single value from the data frame, we can provide an [index]({{ page.root }}/reference/#index) in square brackets. The first number specifies the row and the second the column:
```{r}
# The first value in dat is indexed at row 1 column 1
```{r selecting data frame elements}
# first value in dat, row 1, column 1
dat[1, 1]
# The middle value in dat is indexed at row 30 column 20
# middle value in dat, row 30, column 20
dat[30, 20]
```
An index like `[30, 20]` selects a single element of a data frame, but we can select whole sections as well.
For example, we can select values for the first four patients (rows) during the first ten days of treatment (columns) like this:
The first value in a data frame index is the row, the second value is the column.
If we want to select more than one row or column, we can use the function `c`, which stands for **c**ombine.
For example, to pick columns 10 and 20 from rows 1, 3, and 5, we can do this:
```{r}
dat[1:4, 1:10]
```{r selecting with c}
dat[c(1, 3, 5), c(10, 20)]
```
We frequently want to select contiguous rows or columns, such as the first ten rows, or columns 3 through 7. You can use `c` for this, but it's more convenient to use the `:` operator. This special function generates sequences of numbers:
```{r sequences}
1:5
3:12
```
The slice does not need to start at 1, e.g. the line below selects rows 5 through 10, and columns 3 through 10 :
For example, we can select the first ten columns of values for the first four rows like this:
```{r}
dat[5:10, 3:10]
dat[1:4, 1:10]
```
We can use the function `c`, which stands for **c**ombine, to select non-contiguous values:
or the first ten columns of rows 5 to 10 like this:
```{r}
dat[c(3, 8, 37, 56), c(10, 14, 29)]
dat[5:10, 1:10]
```
We can also provide a slice for the rows but not for the columns, or for the columns but not for the rows.
If we don't include a slice for the rows, R returns all the rows; if we don't include a slice for the columns, R returns all the columns.
If we don't provide a slice for either rows or columns, e.g. `dat[, ]`, R returns the full data frame.
If you want to select all rows or all columns, leave that index value empty.
```{r}
# All columns from row 5
dat[5, ]
# All rows from column 16
dat[, 16]
# All rows from column 16-18
dat[, 16:18]
```
If you leave both index values empty (i.e., `dat[, ]`), you get the entire data frame.
> ## Addressing Columns by Name
>
> Columns can also be addressed by name, with either the `$` operator (ie. `dat$Age`) or square brackets (ie. `dat[,'Age']`).
Expand All @@ -294,6 +312,42 @@ patient_1 <- dat[1, ]
# max inflammation for patient 1
max(patient_1)
```
<!--
OUCH!! The following may be true, but it will vary by R version, not by
installation! There shouldn't be an issue with a data frame where all
columns are numeric without missing values. Under those circumstances,
coercion should do what you expect. You'll get problems with mixed
types (factors, character etc), or with missing values. If this is
actually a problem, we need to change the example - we should be able
to come up with an example that doesn't require this ugliness in the
very first lesson.
Also, columns always work as expected because by definition a column
contains a vector of values of the same type. Rows include values from
different columns, which of course can be different types. This doesn't
need to be confusing, and if we're careful in our presentation here we
can avoid this until the students know enough to cope rationally.
-->
<!--
> ## Forcing Conversion
>
> The code above may give you an error in some R installations,
> since R does not automatically convert a row from a `data.frame` to a vector.
> (Confusingly, subsetted columns are automatically converted.)
> If this happens, you can use the `as.numeric` command to convert the row of data to a numeric vector:
>
> `patient_1 <- as.numeric(dat[1, ])`
>
> `max(patient_1)`
>
> You can also check the `class` of each object:
>
> `class(dat[1, ])`
>
> `class(as.numeric(dat[1, ]))`
{: .callout}
-->
We don't actually need to store the row in a variable of its own.
Instead, we can combine the selection and the function call:
Expand Down Expand Up @@ -373,10 +427,10 @@ We'll learn why this is so in the next lesson.
> `colMeans`, respectively.
{: .callout}
> ## Slicing (Subsetting) Data
<!-- Slice is a Python thing. I've never seen this term used in R -->
> ## Subsetting Data
>
> A subsection of a data frame is called a [slice]({{ page.root }}/reference/#slice).
> We can take slices of character vectors as well:
> We can take subsets of character vectors as well:
>
> ```{r}
> animal <- c("m", "o", "n", "k", "e", "y")
Expand All @@ -399,7 +453,7 @@ We'll learn why this is so in the next lesson.
> ## Subsetting More Data
>
> Suppose you want to determine the maximum inflammation for patient 5 across days three to seven.
> To do this you would extract the relevant slice from the data frame and calculate the maximum value.
> To do this you would extract the relevant subset from the data frame and calculate the maximum value.
> Which of the following lines of R code gives the correct answer?
>
> 1. `max(dat[5, ])`
Expand All @@ -416,7 +470,7 @@ We'll learn why this is so in the next lesson.
> {: .solution}
{: .challenge}
> ## Slicing and Re-Assignment
> ## Subsetting and Re-Assignment
>
> Using the inflammation data frame `dat` from above:
> Let's pretend there was something wrong with the instrument on the first five days for every second patient (#2, 4, 6, etc.), which resulted in the measurements being twice as large as they should be.
Expand Down
Loading

0 comments on commit 6ed2843

Please sign in to comment.