forked from datacarpentry/R-genomics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-data-frames.Rmd
170 lines (130 loc) · 5.77 KB
/
03-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
layout: topic
title: More on data frames
author: Data Carpentry contributors
minutes: 30
---
```{r, echo=FALSE, purl=FALSE, message = FALSE}
# source("setup.R")
metadata <- read.csv("data/Ecoli_metadata.csv")
```
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
------------
> ## Learning Objectives
>
> * understand the concept of a `data.frame`
> * using sequences to index data
> * know how to access any element of a `data.frame`
------------
<!--
# What are data frames?
`data.frame` is the _de facto_ data structure for most tabular data and what we
use for statistics and plotting.
A `data.frame` is a collection of vectors of identical lengths. Each vector
represents a column, and each vector can be of a different data type (e.g.,
characters, integers, factors). The `str()` function is useful to inspect the
data types of the columns.
A `data.frame` can be created by the functions `read.csv()` or `read.table()`, in
other words, when importing spreadsheets from your hard drive (or the web).
By default, `data.frame` converts (= coerces) columns that contain characters
(i.e., text) into the `factor` data type. Depending on what you want to do with
the data, you may want to keep these columns as `character`. To do so,
`read.csv()` and `read.table()` have an argument called `stringsAsFactors` which
can be set to `FALSE`:
```{r, eval=FALSE, purl=FALSE}
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
```
<!--- talk about colClasses argument?, row names? --->
<!--
You can also create `data.frame` manually with the function `data.frame()`. This
function can also take the argument `stringsAsFactors`. Compare the output of
these examples, and compare the difference between when the data are being read
as `character` and when they are being as `factor`.
```{r, results='show', purl=TRUE}
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
```
-->
# Indexing and sequences (within a vector)
```{r, echo=FALSE, purl=TRUE}
## Indexing and sequences
```
If we want to extract one or several values from a vector, we must provide one
or several indices in square brackets, just as we do in math.
<!--
For instance:
```{r, results='show', purl=FALSE, eval=FALSE}
expression[2] # what level of expression is in the second element of the vector?
expression[c(3, 2)]
expression[2:4]
expression[c(3,2, 2:4)] # combining both what do you get?
```
-->
R indexes start at 1. Programming languages like Fortran, MATLAB, and R start
counting at 1, because that's what human beings typically do. Languages in the C
family (including C++, Java, Perl, and Python) count from 0 because that's
simpler for computers to do.
<!--
`:` is a special function that creates numeric vectors of integer in increasing
or decreasing order, test `1:10` and `10:1` for instance. The function `seq()`
(for __seq__uence) can be used to create more complex patterns:
```{r, results='show', purl=FALSE, eval=FALSE}
seq(1, 10, by=2)
seq(5, 10, length.out=3) # equal breaks of sequence into vector length = length.out
seq(50, by=5, length.out=10) # sequence 50 by 5 until you hit vector length = length.out
seq(1, 8, by=3) # sequence by 3 until you hit 8
```
-->
Our metadata data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we want
from it. Row numbers come first, followed by column numbers (i.e. [row, column]).
```{r, purl=FALSE, eval=FALSE}
metadata[1, 2] # first element in the 2nd column of the data frame
metadata[1, 6] # first element in the 6th column
metadata[1:3, 7] # first three elements in the 7th column
metadata[3, ] # the 3rd element for all columns
metadata[, 7] # the entire 7th column
head_meta <- metadata[1:6, ] # metadata[1:6, ] is equivalent to head(metadata)
```
<!--
### Challenge
1. The function `nrow()` on a `data.frame` returns the number of rows. Use it,
in conjuction with `seq()` to create a new `data.frame` called
`meta_by_2` that includes every other row of the survey data frame
starting at row 2 (2, 4, 6, ...)
```{r, purl=FALSE}
meta_by_2 <- metadata[seq(2, nrow(metadata), by=2), ]
```
--->
# Indexing and sequences (within a `data.frame`)
For larger datasets, it can be tricky to remember the column number that
corresponds to a particular variable. (Are species names in column 5 or 7? oh,
right... they are in column 6). In some cases, in which column the variable will
be can change if the script you are using adds or removes columns. It's
therefore often better to use column names to refer to a particular variable,
and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the `$`
sign. In this case, the entire column is a vector. You can use
`names(metadata)` or `colnames(metadata)` to remind yourself of the column names.
For instance, to extract all the strain information from our datasets:
```{r, eval=FALSE}
metadata$strain
```
In some cases, you may way to select more than one column. You can do this using
the square brackets. Suppose we wanted strain and clade information:
```{r, eval=FALSE}
metadata[, c("strain", "clade")]
```
You can even access columns by column name _and_ select specific rows of interest. For example, if we wanted the strain and clade of just rows
4 through 7, we could do:
```{r, eval=FALSE}
metadata[4:7, c("strain", "clade")]
```