-
Notifications
You must be signed in to change notification settings - Fork 10
/
r-basics.Rmd
156 lines (108 loc) · 7.73 KB
/
r-basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: 'R: The Basics'
---
```{r, echo=FALSE, message=FALSE, eval=TRUE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, results="hide", cache=TRUE)
options(digits=3)
options(max.print=200)
.ex <- 1 # Track ex numbers w/ hidden var. Increment each ex: `r .ex``r .ex=.ex+1`
```
This section introduces the R environment and some of the most basic funcionality aspects of R that are used through the remainder of the class. This section assumes little to no experience with statistical computing with R. We will introduce the R statistical computing environment, RStudio, and the dataset that we will work with for the remainder of the lesson. We will cover very basic functionality in R, including variables, functions, and importing/inspecting data frames.
**Make sure you [complete the setup here](setup.html) prior to the class.**
## RStudio
Let's start by learning about RStudio. **R** is the underlying statistical computing environment, but using R alone is no fun. **RStudio** is a graphical integrated development environment that makes using R much easier.
- **Options:** First, let's change a few options. We'll only have to do this once. Under _Tools... Global Options..._:
- Under _General_: Uncheck "Restore most recently opened project at startup"
- Under _General_: Uncheck "Restore .RData into workspace at startup"
- Under _General_: Set "Save workspace to .RData on exit:" to Never.
- Under _General_: Set "Save workspace to .RData on exit:" to Never.
- Under _R Markdown_: Uncheck "Show output inline for all R Markdown documents"
- Projects: first, start a new project in a new folder somewhere easy to remember. When we start reading in data it'll be important that the _code and the data are in the same place._ Creating a project creates an Rproj file that opens R running _in that folder_. This way, when you want to read in dataset _whatever.txt_, you just tell it the filename rather than a full path. This is critical for reproducibility, and we'll talk about that more later.
- Code that you type into the console is code that R executes. From here forward we will use the editor window to write a script that we can save to a file and run it again whenever we want to. We usually give it a `.R` extension, but it's just a plain text file. If you want to send commands from your editor to the console, use `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
- Anything after a `#` sign is a comment. Use them liberally to *comment your code*.
## Basic operations
R can be used as a glorified calculator. Try typing this in directly into the console. Make sure you're typing into into the editor, not the console, and save your script. Use the run button, or press `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
```{r calculator}
2+2
5*4
2^3
```
R Knows order of operations and scientific notation.
```{r calculator2}
2+3*4/(5+3)*15/2^2+3*4^2
5e4
```
However, to do useful and interesting things, we need to assign _values_ to _objects_. To create objects, we need to give it a name followed by the assignment operator `<-` and the value we want to give it:
```{r assignment}
weight_kg <- 55
```
`<-` is the assignment operator. Assigns values on the right to objects on the left, it is like an arrow that points from the value to the object. Mostly similar to `=` but not always. Learn to use `<-` as it is good programming practice. Using `=` in place of `<-` can lead to issues down the line. The keyboard shortcut for inserting the `<-` operator is `Alt-dash`.
Objects can be given any name such as `x`, `current_temperature`, or `subject_id`. You want your object names to be explicit and not too long. They cannot start with a number (`2x` is not valid but `x2` is). R is case sensitive (e.g., `weight_kg` is different from `Weight_kg`). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., `if`, `else`, `for`, see [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a complete list). In general, even if it's allowed, it's best to not use other function names, which we'll get into shortly (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`). In doubt check the help to see if the name is already in use. It's also best to avoid dots (`.`) within a variable name as in `my.dataset`. It is also recommended to use nouns for variable names, and verbs for function names.
When assigning a value to an object, R does not print anything. You can force to print the value by typing the name:
```{r printAssignment}
weight_kg
```
Now that R has `weight_kg` in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg).
```{r modAssignment}
2.2 * weight_kg
```
We can also change a variable's value by assigning it a new one:
```{r newAssignment}
weight_kg <- 57.5
2.2 * weight_kg
```
This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a variable.
```{r calculationWithVar}
weight_lb <- 2.2 * weight_kg
```
and then change `weight_kg` to 100.
```{r modAssignment2}
weight_kg <- 100
```
What do you think is the current content of the object `weight_lb`? 126.5 or 220?
You can see what objects (variables) are stored by viewing the Environment tab in Rstudio. You can also use the `ls()` function. You can remove objects (variables) with the `rm()` function. You can do this one at a time or remove several objects at once. You can also use the little broom button in your environment pane to remove everything from your environment.
```{r rm, eval=FALSE}
ls()
rm(weight_lb, weight_kg)
ls()
weight_lb # oops! you should get an error because weight_lb no longer exists!
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
What are the values after each statement in the following?
```{r ex1}
mass <- 50 # mass?
age <- 30 # age?
mass <- mass * 2 # mass?
age <- age - 10 # age?
mass_index <- mass/age # massIndex?
```
----
## Functions
R has built-in functions.
```{r fns}
# Notice that this is a comment.
# Anything behind a # is "commented out" and is not run.
sqrt(144)
log(1000)
```
Get help by typing a question mark in front of the function's name, or `help(functionname)`:
```
help(log)
?log
```
Note syntax highlighting when typing this into the editor. Also note how we pass *arguments* to functions. The `base=` part inside the parentheses is called an argument, and most functions use arguments. Arguments modify the behavior of the function. Functions some input (e.g., some data, an object) and other options to change what the function will return, or how to treat the data provided. Finally, see how you can *next* one function inside of another (here taking the square root of the log-base-10 of 1000).
```{r log}
log(1000)
log(1000, base=10)
log(1000, 10)
sqrt(log(1000, base=10))
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
See `?abs` and calculate the square root of the log-base-10 of the absolute value of `-4*(2550-50)`. Answer should be `2`.
----
## Data Frames
There are _lots_ of different basic data structures in R. If you take any kind of longer introduction to R you'll probably learn about arrays, lists, matrices, etc. We are going to skip straight to the data structure you'll probably use most -- the **data frame**. We use data frames to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented as columns; heterogeneous, meaning that columns/features/variables can be different classes (on variable, e.g. age, can be numeric, while another, e.g., cause of death, can be text).
[Let's move on to learning about data frames](r-dataframes.html).