-
Notifications
You must be signed in to change notification settings - Fork 1
/
01_R_introduction_dplyr.Rmd
890 lines (687 loc) · 26.6 KB
/
01_R_introduction_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
---
title: "R introduction and dplyr"
author: "Gregor Pirs, Jure Demsar and Erik Strumbelj"
date: "25/7/2019"
output:
prettydoc::html_pretty:
highlight: github
theme: architect
toc: true
toc_depth: 2
---
<div style="text-align:center">
<img src="./bstatcomp.png" alt="drawing" width="128"/>
</div>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# R and Rstudio
R (https://www.r-project.org/) is free open-source software for statistical
computing.
The basic interface to R is via console, which is quite rigid. RStudio
(https://www.rstudio.com) provides
us with a better user interface and additional functionalities
(R notebooks, RMarkdown,...).
Usually the user interface in RStudio is split into four parts.
Upper left part is used for scripts. These are R files (or similar) which
include our code and represent the main building blocks of our programs.
Lower left part is the console, equivalent to the basic console interface of
R. Upper right part is dedicated to the environment and history. Lower
right part shows our workspace, plots, packages, and help.
To create a new script, go to File -> New File -> R Script. To run the code,
highlight the desired part of the code and press Ctrl + Enter. Alternatively,
you can run the code by clicking the run icon in the top-right corner of
the script.
To specify the working directory, use `setwd()` function, where you
provide the working directory in parentheses. For example to set the
working directory to C:/Author you would call
```{r, eval=FALSE}
setwd("C:/Author")
```
# Variables
Variables are the main data type of every program. In R, we define the values
of variables with the syntax `<-`. We do not need to initialize the type of
the variables, as R predicts it. We denote strings with `""`. Comments
are written with `#`.
Let's create some variables.
```{r, error=TRUE}
n <- 20
x <- 2.7
m <- n # m gets value 20
my_flag <- TRUE
student_name <- "Luke"
student_name <- Luke # because there is no variable named Luke, it returns an error
```
By using the function `typeof()` we can check the type of a variable.
```{r}
typeof(n)
typeof(student_name)
typeof(my_flag)
```
We can change the types of variables with as.type functions.
The main
types are **integer**, **double**, **character** (strings), and **logical**.
Note that the type character is used for strings and we do not have a separate
type for single characters.
```{r, error=TRUE}
typeof(as.integer(n))
typeof(as.character(n))
```
Another common type is date. We can convert a character string to a date with
the `as.Date()` function. When using this function, we have to be careful to
provide the correct format of the date.
```{r, error=TRUE}
some_date <- as.Date("2019-01-01", format = "%Y-%m-%d")
some_date
```
To access the values of the variables, we use variable names.
```{r error=TRUE}
n
m
my_flag
student_name
```
We can apply arithmetic operations on numerical variables.
```{r}
n + x
n - x
diff <- n - x # variable diff gets the difference between n and x
diff
n * x
n / x
x^2
sqrt(x)
n > 2 * n # logical is greater
n == n # equals
n == 2 * n
n != n # not equals
```
We can concatenate strings with functions `paste()` and `paste0()`. The
difference between these functions is that the first one forces a space
between inputs, while the second one does not.
```{r}
paste(student_name, "is", n, "years old")
paste0(student_name, "is", n, "years old")
L_username <- paste0(student_name, n)
```
Function `paste()` can get an additional parameter `sep`, which should
be used between the inputs. If we want to find out more about a function,
we put a question mark before the function's name in the console.
```{r}
# ?paste
paste(student_name, "is", n, "years_old", sep = "_")
```
# Basic data structures
## Vector
Vectors are the most common data structure in R. They consist of several
elements of the same type. We create them with the function `c()` (combine).
```{r}
student_ages <- c(20, 23, 21)
student_names <- c("Luke", "Jen", "Mike")
passed <- c(TRUE, TRUE, FALSE)
```
To access individual elements of vectors we use square brackets with the
sequential number of the elements we want.
**The indexing in R starts with 1**, as opposed to 0 (C++, Java,...).
```{r}
student_ages[2]
student_names[2]
passed[2]
```
To get the length of the vector use `length()`.
```{r}
length(student_names)
```
We can use element-wise arithmetic operations on vectors, and we can use
the scalar product (`%*%`).
Note that you have to be careful with vector lengths.
For example, if we have an operation on two elements---in our case vectors---and
they are not of the same length, the smaller one will start preiodically
repeating itself, until it reaches the size of the larger one. In that case,
R will provide us with a warning.
```{r error = TRUE}
a <- c(1, 3, 5)
b <- c(2, 2, 1)
d <- c(6, 7)
a + b
a * b
a + d # not the same length, d becomes (6, 7, 6)
a + 2 * b
a %*% b # scalar product
a > b # logical relations between elements
b == a
```
We often want to select only specific elements of a vector. There are several
ways to do that---for example all of the calls below return the first two
elements of vector `a`.
```{r error = TRUE}
a[c(TRUE, TRUE, FALSE)] # selection based on logical vector
a[c(1,2)] # selection based on indexes
a[a < 5] # selection based on logical condition
```
We can also use several conditions. If we want both conditions to hold, we use
and (`&`), if only one has to hold we use if (`|`). Note that only here we
use only a single
symbol for each, as opposed to some other programming languages
that use two.
```{r error = TRUE}
a[a > 2 & a < 4]
a[a < 2 | a > 4]
```
## Factor
Factors are used for coding categorical variables, which can only take
a finite number of predetermined values. We can further divide categorical
variables into nominal and ordinal. Nominal values don't have an ordering
(for example car brand), while ordinal variables do (for example
frequency---never, rarely, sometimes, often, always). Ordinal variables
have an ordering but usually we can not assign values to them (for example
sometimes is more than rarely, but we do not know how much more).
In R we create factors with function `factor()`. When creating factors, we can
determine in advance, which values the factor can take with the argument
`levels`. If we wish to add a non-existing level to a factor variable, R
turns it into NA.
```{r error = TRUE}
car_brand <- factor(c("Audi", "BMW", "Mercedes", "BMW"), ordered = FALSE)
car_brand
freq <- factor(x = NA,
levels = c("never","rarely","sometimes","often","always"),
ordered = TRUE)
freq[1:3] <- c("rarely", "sometimes", "rarely")
freq
freq[4] <- "quite_often" # non-existing level, returns NA
freq
```
## Matrix
Two-dimensional generalizations of vectors are matrices. We create them
with the function `matrix()`, where we have to provide the values and either
the number of rows or columns. Additionally, the argument `byrow = TRUE`
fills the matrix with provided elements by rows (default is by columns).
```{r}
my_matrix <- matrix(c(1, 2, 1,
5, 4, 2),
nrow = 2,
byrow = TRUE)
my_matrix
my_square_matrix <- matrix(c(1, 3,
2, 3),
nrow = 2)
my_square_matrix
```
To access individual elements we use square brackets, where we divide the
dimensions by a comma.
```{r}
my_matrix[1,2] # first row, second column
my_matrix[2, ] # second row
my_matrix[ ,3] # third column
```
Some useful functions for matrices.
```{r}
nrow(my_matrix) # number of matrix rows
ncol(my_matrix) # number of matrix columns
dim(my_matrix) # matrix dimension
t(my_matrix) # transpose
diag(my_matrix) # the diagonal of the matrix as vector
diag(1, nrow = 3) # creates a diagonal matrix
det(my_square_matrix) # matrix determinant
```
We can also use arithmetic operations on matrices. Note that we have to be
careful with matrix dimensions. For matrix multiplication, we use `%*%`
```{r error = TRUE}
my_matrix + 2 * my_matrix
my_matrix * my_matrix # element-wise multiplication
my_matrix %*% t(my_matrix) # matrix multiplication
my_square_matrix %*% my_matrix
my_matrix %*% my_square_matrix # wrong dimensions
```
We can transform a matrix into a vector.
```{r error = TRUE}
my_vec <- as.vector(my_matrix)
my_vec
```
## Array
Multi-dimensional generalizations of matrices are arrays.
```{r}
my_array <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2))
my_array[1, 1, 1]
my_array[2, 2, 1]
my_array[1, , ]
dim(my_array)
```
## Data frame
Data frames are the basic data structure used in R for data analysis.
It has the form of a table, where columns represent individual variables, and
rows represent observations. They differ from matrices, as the columns can be
of different types. We access elements the same way as in matrices.
We can combine vectors into data frames with `data.frame()`. The
function transforms variables of type character into factors by default.
if we do not want that, we have to add an argument `stringsAsFactors = FALSE`.
We can assign column names with the function `colnames()`.
```{r}
student_data <- data.frame(student_names, student_ages, passed,
stringsAsFactors = FALSE)
colnames(student_data) <- c("Name", "Age", "Pass")
student_data
```
We can also assign column names directly, when creating a data frame.
```{r}
student_data <- data.frame("Name" = student_names,
"Age" = student_ages,
"Pass" = passed)
student_data
```
Similar to vectors, we can access the elements in data frames (and matrices)
with logical calls. Here we need to be careful if we are selecting rows or
columns. To access specific columns, we can also use the name of the column
preceded by `$`.
```{r}
student_data[ ,colnames(student_data) %in% c("Name", "Pass")]
student_data[student_data$Pass == TRUE, ]
student_data$Pass
```
## List
Lists are very useful data structure, especially when we are dealing with
different data sets and data structures. We can imagine a list as a vector,
where each element can be a different data structure. For example, a list can
have a vector stored on index 1, a matrix on index 2, and a data frame on
index 3. Moreover, a list can be an element of a list and so on.
```{r}
first_list <- list(student_ages, my_matrix, student_data)
second_list <- list(student_ages, my_matrix, student_data, first_list)
```
We access the elements of a list with double square brackets.
```{r}
first_list[[1]]
second_list[[4]]
second_list[[4]][[1]] # first element of the fourth element of second_list
```
We can also apply `length()` to get the number of elements in the list.
```{r}
length(second_list)
```
To append to list, we use the call below.
```{r}
second_list[[length(second_list) + 1]] <- "add_me"
second_list[[length(second_list)]] # check, what is on the last index
```
Additionally, we can name the elements of the list, and access them by name.
For that we use the `names()` function.
```{r}
names(first_list) <- c("Age", "Matrix", "Data")
first_list$Age
```
# Packages
R is an open-source programming language and anyone can contribute to its
development. Many packages exist that make our work in R easier.
Additionally, some packages include different
statistical models---some of which are implemented in other languages for
efficiency (for example C++). An open-source repository CRAN consists of most
packages that you are going to need. To install a specific package, we use
the function `install.packages()`, or we can use R-Studio's UI. Once a
package is installed, we can load it into our workspace with `library()`.
We will get to know several useful packages during this workshop.
```{r eval = FALSE}
install.packages("stats") # install package
library(stats) # load the package into workspace
```
# Data import {#bpod}
We often encounter data in a csv (comma separated value) format. Different
pacakges in R allow us to read data from csv, txt, xlsx, etc. formats.
Here we will go through reading data from csv and xlsx formats.
To read csv data use `read.csv` from the
package `utils`. Before we read the data, we need to check two things.
First, what is the character that separates the columns and how the decimal
places are denoted (comma or dot).
Second, if the data have a header (Does the first row contain column names?).
Function automatically returns a data frame. `read.csv()` assumes that
comma is the separator and a decimal point. However, it allows the
change of these default values by providing the corresponding arguments. It
also assumes that we have a header by default. When saving your data in the
csv format, we recommend using a semi-colon as the separator, as comma is
often used a) in text, b) as the decimal separator, or c) as
thousands separator.
In our **data** folder, we have medical insurance data set acquired from Kaggle
(https://www.kaggle.com/easonlai/sample-insurance-claim-prediction-dataset/).
To show different reading functions, we saved the data set in three
different formats---csv with a comma separator, csv with a semi-colon
separator, and xlsx file. The file also contains a
header. Function `head()` returns
the first six rows of the data frame.
```{r}
library(utils)
claim_data <- read.csv("./data/insurance01.csv")
head(claim_data)
```
The dot in the string represents current working directory. We see that
R automatically converted string variables (sex, smoker, region) to factors.
In our case this is sensible. However, sometimes we want strings to remain
strings. In those cases, change the argument `stringsAsFactors` to false.
Along with a semi-colon as the separator, the second file has
a decimal comma. Therefore
```{r eval = FALSE}
claim_data <- read.csv("./data/insurance02.csv", sep = ";", dec = ",")
```
Data is often saved as xlsx. To read data from xlsx, we use the `read.xlsx`
function from the package __xlsx__.
However, this function can be quite slow, so if you are dealing with
large data frames, it might be better to save the excel file as a csv file
and then read it as csv.
```{r eval = FALSE}
library(xlsx)
claim_data <- read.csv("./data/insurance03.xlsx")
```
# If statement
We often want to execute code based on some condition. For that we use
the `if`-`else` pair.
```{r}
x <- 5
if (x < 0) {
print("x is smaller than 0")
} else if (x == 0) {
print("x is 0")
} else {
print("x is greater than 0")
}
```
# Loops
The most useful loop in R is the for loop. In the for loop we have to define
a new variable, which will represent the different iterations of the loop.
Then we have to define the values over which that variable will iterate. Often,
these are sequential numbers. For example, let us add first 10 natural numbers.
```{r}
my_sum <- 0
for (i in 1:10) { # 1:10 returns a vector of natural numbers between 1 and 10
my_sum <- my_sum + i
}
my_sum
```
The values in a for loop do not have to be sequential numbers.
```{r}
my_sum <- 0
some_numbers <- c(2, 3.5, 6, 100)
for (i in some_numbers) {
my_sum <- my_sum + i
}
my_sum
```
For example, let us calculate the average charges per region on our data set.
```{r}
regions <- unique(claim_data$region) # returns unique values in region column
for (reg in regions) {
tmp_data <- claim_data[claim_data$region == reg, ]
charges <- tmp_data$charges
print(paste0("Region: ", reg,
", average charges: ", mean(charges)))
}
```
# Functions
Base R consists of several function intended for easier work with data, for
example `length()`, `dim()`, `colnames()`,... We can extend the set of functions
with packages. For example, package **stats** allows us to create statistical
models with the use of a single function---for example the linear model `lm()`.
Here we will present some useful functions, more complex functions will
follow in later chapters. Remember, if you want additional information about
functions, we can call the name of the function in the console, where we
add a question mark (for example `?length`).
```{r}
1:10 # special function that returns a sequence of numbers
sum(1:10) # sum of first 10 natural numbers
sum(c(3,5,6,3))
rep(1, times = 5) # returns a vector of lenght 5, where all values are 1
rep(c(1,2), times = 5) # returns a vector of length 5 where 1 and 2 are periodically changing
seq(0, 2, by = 0.5) # vector from 0 to 2, by adding 0.5
prod(1:10) # multiply first 10 numbers
round(5.24)
5^5 # square
sqrt(16) # square root
as.character(c(1,6,3)) # transforms a numerical vector to a character vector
```
We often want a summary of our data. We can get it with `summary()`. We
can use it on vectors and on data frames. The returned values are dependent
on the types of variables.
```{r}
summary(student_ages)
summary(student_names)
summary(passed)
summary(car_brand)
summary(freq)
summary(student_data) # summary of the whole data frame
```
## Writing functions
We can write our own functions with `function()`. In the brackets, we
define the parameters the function gets, and in curly brackets we define what
the function does. We use `return()` to return values.
```{r}
sum_first_n_elements <- function (n) {
my_sum <- 0
for (i in 1:n) {
my_sum <- my_sum + i
}
return (my_sum)
}
sum_first_n_elements(10)
```
If we want that the function returns several different data structures,
we use a list.
For example, let us look at a function which gets a matrix
as input, and returns its transpose and determinant.
```{r}
get_transpose_and_det <- function (mat) {
trans_mat <- t(mat)
det_mat <- det(mat)
out <- list("transposed" = trans_mat,
"determinant" = det_mat)
return (out)
}
mat_vals <- get_transpose_and_det(my_square_matrix)
mat_vals$transposed
mat_vals$determinant
```
## Other useful functions for data summarizing
There are several functions that are useful when working with data. We already
mentioned the `summary()` function. Let's look at some other functions.
To generate random numbers we can use a variety of random number generators.
Which we select depends on the data that we wish to generate. Usually, we
want to be able to replicate our analysis exactly, therefore we recommend the
use of a seed---this will generate the same random numbers everytime you
call the function. There is a function for that in R called `set.seed()`.
```{r}
set.seed(0)
norm_dat <- rnorm(1000, 5, 6) # generate 1000 samples from the normal
# distribution with mean 5 and standard deviation 6
count_dat <- rpois(2000, 8) # generate 2000 samples from the Poisson
# distribution with mean 8
unif_dat <- runif(1000, -2, 5) # generate 1000 samples from the uniform
# distribution form -2 to 5
```
In data science, we often work with statistics, so let's look at some functions
which provide us with meaningful information about our data.
```{r}
mean(norm_dat)
var(norm_dat) # variance
sd(norm_dat) # standard deviation
max(norm_dat)
min(norm_dat)
quantile(norm_dat) # calculates 5 quantiles of the data
```
We often want to standardize the data, before doing analysis. We can do that
manually, or we can use R's `scale()` function.
```{r}
st_dat <- scale(norm_dat)
mean(st_dat)
var(st_dat)
```
# Debugging
For the debugging in R we will use the `browser()` function. It stops the
execution of the code and you can access the variables in the environment at
the moment that browser was called.
For browser commands see `?browser` or type help when browser is active.
# Data wrangling with dplyr
Dplyr is a package for easier data manipulation. It is a part of a collection of
packages called **tidyverse**, which consist of several R packages intended for
data science. Dplyr is especially useful for data frame manipulation.
The main format of working with data in tidyverse is a **tibble**. This data
structure is very smilar to base R's data frame, however it is designed for
easier work with other packages in tidyverse and also provides a different
print output. Let's look at it on our insurance data set.
```{r message=FALSE, warning=FALSE}
library(dplyr)
```
```{r}
claim_data <- read.csv("./data/insurance01.csv")
head(claim_data)
claim_data <- as_tibble(claim_data)
claim_data
```
A tibble only shows the first 10 rows of the data set for clarity. Additionally,
it only prints as many columns as fit into a page, and lists other columns
below. If we wish to see all of the tibble, we can use the function `View()`.
Under the variable names, a tibble shows the type of the variables.
Now that we have our starting data set, we can begin manipulating it. This
usually consists of selecting specific rows and columns, and adding statistics
derived from variables in the data frame. Below we describe five functions
which will enable us dynamic data set manipulation.
## Filter
The function `filter()` allows us to select rows, based on values of the
variables. As input it gets a tibble and the conditions and it outputs a new
tibble that consists only of desired rows.
```{r warning=FALSE}
filter(claim_data, region == "southwest")
filter(claim_data, region == "southwest", age >= 30)
```
The conditions in filter use and---all conditions have to be satisfied. If
we want to use or, we have to divide them with a pipe |.
```{r}
filter(claim_data, region == "southwest" | region == "northwest")
```
Or, the same can be achieved by using the operator %in%.
```{r}
filter(claim_data, region %in% c("southwest", "northwest"))
```
For example, let's say we are interested in doing further analysis on
people older than 29, who live in the south. We can construct a new tibble,
where we filter out the unnecessary rows.
```{r}
claim_df <- filter(claim_data, region %in% c("southwest", "southeast"),
age >= 30)
```
## Arrange
To arrange data we use dplyr's function `arrange()`, which gets a tibble and the
variables on which to arrange. If we want a descending arrangement, we have to
use function `desc()`.
```{r}
arrange(claim_df, age)
arrange(claim_df, age, desc(charges))
```
## Select
In our current data set we have a relatively small number of columns, so
working with our tibble is not too complicated. However, we often encounter
data sets with large numbers of columns. In such situations, we might want
to select a subset of columns. For that we have the function `select`.
To select certain columns, input the names into select.
```{r}
select(claim_df, age, sex)
```
We can also select all columns between two columns with a colon. Using a
minus sign will select all columns except the ones in the expression.
```{r}
select(claim_df, bmi:region)
select(claim_df, -(bmi:region))
```
There are several utility functions that let us select columns based on
their names, for example `ends_with`, `starts_with`, or `contains`.
```{r}
select(claim_df, starts_with("c"))
```
## Mutate
To create new variables in the data frame, dependent on the existing variables,
we can use the `mutate()` function. For example, let's create a new variable,
which will consist of charges per insured person.
```{r}
claim_df <- mutate(claim_df, charges_per_person = charges / (children + 1))
claim_df
```
We can also use own functions when creating new variables.
For example, let us create a new variable, which will classify the
insured according to the standard BMI categories.
```{r}
classify_bmi <- function (bmi) {
bmi_classes <- rep("underweight", times = length(bmi))
bmi_classes[bmi >= 18.5 & bmi < 25] <- "normal"
bmi_classes[bmi >= 25] <- "overweight"
bmi_classes <- factor(bmi_classes, levels = c("underweight",
"normal",
"overweight"),
ordered = TRUE)
return(bmi_classes)
}
claim_df <- mutate(claim_df, bmi_class = classify_bmi(bmi))
claim_df
```
The tibble is too wide to show all variables. Let us use select to
check the values of our new variable.
```{r}
select(claim_df, bmi, bmi_class)
```
## Summarise
The `summarise` function aggregates the data according to some condition.
Conditions are provided with the function `group_by`, if they are not,
the data are aggregated over the whole tibble.
```{r}
summarise(claim_df, mean_age = mean(age), mean_charges = mean(charges))
```
To get something more meaningful, we first need to group the data.
For example let us look at the mean charges, dependent on whether
the insured is a smoker and his BMI class.
```{r}
g_data <- group_by(claim_df, smoker, bmi_class)
summarise(g_data, mean_charges = mean(charges))
```
## The pipe
To arrive at the above results we made several changes to the original data
set. However, we can use the pipe `%>%` to do all these calls sequentially,
without
creating an additional data set, or changing the original.
Let us demonstrate how to get the same result as above with use of the pipe.
```{r}
claim_df %>%
filter(age >= 30, region %in% c("southwest", "southeast")) %>%
mutate(bmi_class = classify_bmi(bmi)) %>%
group_by(smoker, bmi_class) %>%
summarise(mean_charges = mean(charges))
```
To count the number of cases in each group, use `count()`.
```{r}
claim_df %>%
filter(age >= 30, region %in% c("southwest", "southeast")) %>%
mutate(bmi_class = classify_bmi(bmi)) %>%
group_by(smoker, bmi_class) %>%
count()
```
# Long and wide data formats
Usually we encounter data in a wide format. A wide format of data is
a format where each row represents an object, some columns
represent identifiers of this object, and several columns contain
measurements associated with this object. On the other hand, in a long format
each row represents a measurement. In other words, the columns that
contain object identifiers remain unchanged, but we get a new row for
each of the measured values. The long format is usually easier to process,
while the wide format is easier to comprehend. Also several R functions
(for example `ggplot`) require a long data format.
The functions for conversion between the formats in __tidyr__ are
`gather` (wide to long) and `spread` (long to wide). Let us look how to
use them on a stock market data (acquired from the R package __datasets__).
Here we have the daily closing prices of four major European stock indices
between the years 1991 and 1998. Each row represents an object -- the day
of the closing prices. Then we have four measurements (prices). This
data frame is therefore in a wide format. Let us convert it to a long format,
and then back to wide, to see how to use `gather` and `spread`.
```{r, warning = FALSE}
library(tidyr)
stock_df <- datasets::EuStockMarkets
stock_df <- as_tibble(data.frame(X = as.matrix(stock_df), time=time(stock_df)))
stock_df
df_long <- gather(stock_df, key = "stock", value = "price", -time)
df_long
df_wide <- spread(df_long, key = "stock", value = "price")
df_wide
```