-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path99-introduction-to-r.qmd
351 lines (217 loc) · 12.2 KB
/
99-introduction-to-r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
# Introduction to R
This lecture introduces you to basic operations when you first start using R such as navigation, the object-oriented framework, loading a package, and creating some data vectors.
## Navigation
You need to know a few operations to help you maneuver the R work environment, such as listing objects (datasets and functions) that are active, changing your working directory, listing available files, and finding help.
### Setting Your Working Directory
When you are ready to load data, R needs to know where to look for your files. You can check what is avaiable in the current directory (i.e. folder) by asking to list all of the current files using **dir()**.
```{r, eval=F}
dir()
```
If the file that you need is located in a different folder, you can change directories easily in R Studio by Session -> Set working director -> Choose directory (or Ctrl + Shift + H).
If you are writing a script, you want to keep track of this step so that it can be reproduced. Use the function **get.wd()** to check your current working directory, and **set.wd()** to change. You need to specify your path as an argument to this function, such as.
```{r, eval=F}
setwd( "C:/user/projects/financial model" )
```
NOTE! R uses unix style notation with forward slashes, so if you copy and paste from Windows it will look like this, with back slashes:
```{r, eval=F}
setwd( "C:\user\projects\financial model" )
```
You will need to change them around for it to work.
It is best to save all of your steps in your scripts so that the analysis can be reproduced by yourself or others. In some cases you are doing exploratory or summary work, and you may want to find a file a quickly. You can use the **file.choose()** function to open a GUI to select your file directly. This function is used as an argument inside of a load data function.
```{r, eval=F}
my.dat <- read.csv( file.choose() )
```
## Commenting Code
Most computer languages have a special character that is used to "comment out" lines so that it is not run by the program. It is used for two important purposes. First, we can add text to document our functions and it will not interfere with the program. And two, we can use it to run a program while ignoring some of the code, often for debugging purposes.
The **#** hash tag is used for comments in R.
```{r}
##==============================================
##
## Here is some documentation for this script
##
##==============================================
x <- 1:10
sum( x )
# y <- 1:25 # not run
# sum( y ) # not run
```
## Help
You will use the help functions frequently to figure out what arguments and values are needed for specific functions. Because R is very customizable, you will find that many functions have several or dozens of arguments, and it is difficult to remember the correct syntax and values. But don't worry, to look them up all you need is the function name and a call for help:
> help( dotchart ) # opens an external helpfile
If you just need to remind yourself which arguments are defined in a function, you can use the *args()* command:
```{r}
args( dotchart )
```
If you can't recall a function name, you can list all of the functions from a specific package as follows:
> help( package="stats" ) # lists all functions in stats package
## Install Programs (packages)
When you open R by default it will launch a core set of programs, called "packages" in R speak, that are use for most data operations. To see which packages are currently active use the **search()** function.
```{r}
search()
```
These programs manage the basic data operations, run the core graphics engine, and give you basic statistical methods.
The real magic for R comes from the over 7,000 contributed packages available on the CRAN: https://cran.r-project.org/web/views/
A package consists of custom functions and datasets that are generated by users. They are *packaged* together so that they can be shared with others. A package also includes documentation that describes each function, defines all of the arguments, and documents any datasets that are included.
If you know a package name, it is easy to install. In R Studio you can select Tools -> Install Packages and a list of available packages will be generated. But it is easier to use the **install.packages()** command. We will use the Lahman Package in this course, so let's install that now.
**Description** _This package provides the tables from Sean Lahman's
Baseball Database as a set of R data.frames. It uses the data
on pitching, hitting and fielding performance and other tables
from 1871 through 2013, as recorded in the 2014 version of the
database._
See the documentation here: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf
```{r, eval=F}
install.packages( "Lahman" )
```
You will be asked to select a "mirror". In R speak this just means the server from which you will download the package (choose anything nearby). R is a community of developers and universities that create code and maintain the infrastructure. A couple of dozen universities around the world host servers that contain copies of the R packages so that they can be easily accessed everywhere.
If the package is successfully installed you will get a message similar to this:
> package 'Lahman' successfully unpacked and MD5 sums checked
Once a new program is installed you can now open ("load" in R speak) the package using the **library()** command:
```{r, eval=F}
library( "Lahman" )
```
If you now type **search()** you can see that Lahman has been added to the list of active programs. We can now access all of the functions and data that are available in the Lahman package.
## Accessing Built-In Datasets in R
One nice feature of R is that is comes with a bunch of built-in datasets that have been contributed by users are are loaded automatically. You can see the list of available datasets by typing:
```{r, eval=F}
data()
```
This will list all of the default datasets in core R packages. If you want to see all of the datasets available in installed packages as well use:
```{r, eval=F}
data( package = .packages(all.available = TRUE) )
```
### Basic Data Operations
Let's ignore the underlying data structure right now and look at some ways that we might interact with data.
We will use the **USArrests** dataset available in the core files.
To access the data we need to load it into working memory. Anything that is active in R will be listed in the environment, which you can check using the **ls()** command. We will load the dataset using the **data()** command.
```{r, eacho=F}
remove( list=ls() )
```
```{r}
ls() # nothing currently available
data( "USArrests" )
ls() # data is now available for use
```
Now that we have loaded a dataset, we can start to access the variables and analyze relationships. Let's get to know our dataset.
```{r}
names( USArrests ) # which variables are in the dataset?
nrow( USArrests ) # how many observations are there?
dim( USArrests ) # a quick way to see rows and columns
# observation labels (row names) in our data:
row.names( USArrests ) |> head()
# summary statistics for each variable
summary( USArrests ) |> pander::pander()
```
We can see that the dataset consists of four variables: Murder, Assault, UrbanPop, and Rape. We also see that our unit of analysis is the state. But where does the data come from, and how are these variables measured?
To see the documentation for a specific dataset you will need to use the **help()** function:
```{r, eval=F}
help( "USArrests" )
```
We get valuable information about the source and metrics:
**Description** *This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.*
**Format** *A data frame with 50 observations on 4 variables.*
* **Murder**: numeric Murder arrests (per 100,000)
* **Assault**: numeric Assault arrests (per 100,000)
* **UrbanPop**: numeric Percent urban population
* **Rape**: numeric Rape arrests (per 100,000)
To access a specific variable inside of a dataset, you will use the *$* operator between the dataset name and the variable name:
```{r, eval=F}
summary( USArrests$Murder )
summary( USArrests$Assault )
```
```{r, echo=F}
summary( USArrests$Murder ) |> pander::pander()
summary( USArrests$Assault ) |> pander::pander()
```
Is there a relationship between urban density and crime?
```{r}
plot( USArrests$UrbanPop, USArrests$Murder,
frame.plot=F, pch=19, cex=2,
col=gray( level=0.5, alpha=0.5 ) )
abline( lm( USArrests$Murder ~ USArrests$UrbanPop ), col="firebrick" )
```
### Using the Lahman Data
Let's take a look at some of the data available in the Lahman package.
```{r, eval=F}
data( package = "Lahman" ) # All datasets in package "Lahman":
```
| DATASET NAME | DESCRIPTION |
| ------------------- | ------------------------- |
| AllstarFull | AllstarFull table |
| Appearances | Appearances table |
| AwardsManagers | AwardsManagers table |
| AwardsPlayers | AwardsPlayers table |
| AwardsShareManagers | AwardsShareManagers table |
| AwardsSharePlayers | AwardsSharePlayers table |
| Batting | Batting table |
| BattingPost | BattingPost table |
| CollegePlaying | CollegePlaying table |
| Fielding | Fielding table |
| FieldingOF | FieldingOF table |
| FieldingOFsplit | FieldingOFsplit table |
| FieldingPost | FieldingPost data |
| HallOfFame | Hall of Fame Voting Data |
| HomeGames | HomeGames table |
| LahmanData | Lahman Datasets |
| Managers | Managers table |
| ManagersHalf | ManagersHalf table |
| Parks | Parks table |
| People | People table |
| Pitching | Pitching table |
| PitchingPost | PitchingPost table |
| Salaries | Salaries table |
| Schools | Schools table |
| SeriesPost | SeriesPost table |
| Teams | Teams table |
| TeamsFranchises | TeamFranchises table |
| TeamsHalf | TeamsHalf table |
| battingLabels | Variable Labels |
| fieldingLabels | Variable Labels |
| pitchingLabels | Variable Labels |
We see that we have lots of datasets to choose from here. I will use the **People** dataset, which is a list of all of the Major League Baseball players over the past century and their personal information.
```{r, eval=F}
library( Lahman ) # loads Lahman package
data( People ) # loads the People dataset from Lahman
head( People ) # preview dataset
```
Here are some common functions for exploring datasets:
```{r, eval=F}
names( People ) # variable names
nrow( People ) # number of players (rows) in dataset
summary( People ) # descriptive statistics for each variable
```
We can use **help(People)** to get information about the dataset, including a data dictionary.
```{r, eval=F}
help( People ) # players dataset
```
<br>
> Start helpfile:
<hr>
<br>
```{r, echo=FALSE, results="asis"}
library( Lahman )
help_console <- function(topic, format=c("text", "html", "latex", "Rd"),
lines=NULL, before=NULL, after=NULL) {
format=match.arg(format)
if (!is.character(topic)) topic <- deparse(substitute(topic))
helpfile = utils:::.getHelpFile(help(topic))
hs <- capture.output(switch(format,
text=tools:::Rd2txt(helpfile),
html=tools:::Rd2HTML(helpfile),
latex=tools:::Rd2latex(helpfile),
Rd=tools:::prepare_Rd(helpfile)
)
)
if(!is.null(lines)) hs <- hs[lines]
hs <- c(before, hs, after)
# cat(hs, sep="\n")
return(hs)
}
hs <- help_console( topic="People", format="html" )
hs <- gsub( "<!DOCTYPE html>", "", hs )
hs <- gsub( '<table style="width: 100%;"><tr><td>People</td><td style="text-align: right;">R Documentation</td></tr></table>', '', hs )
cat( hs, sep="\n" )
```
<br>
> End helpfile
<hr>
<br>