generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 78
/
r_data_cleaning.Rmd
725 lines (574 loc) · 33 KB
/
r_data_cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
# Data cleaning with r
Jincheng Liu and Yudu Chen
```{r, echo = TRUE}
#install.packages(openintro)
library(openintro)
#install.packages('dplyr')
library(dplyr)
#install.packages('tidyr')
library(tidyr)
#install.packages(imputeTS)
library(imputeTS)
#install.packages(caret)
library(caret)
```
## Data cleaning with R
In the real world, the data sets we are handling are often not data scientists can readily use. They might contain duplicate entries when entries are supposed to be unique. They might contain missing values, which is problematic for tasks such as training predicative modeling. The data features might be at vastly different scale, which both induces instability of float point arithmatics and inaccurate measurement of feature importance in training machine learning model. We will discuss how to use R to clean Data sets where situations above take place.
## Import data set
We will explore techniques of data cleaning using "Airquality" dataset from base R and duke_forest data set from openintro. The data set is available in baseR and does not require import from outside source.
```{r, echo=TRUE}
airQuality_preProcessing <- airquality
head(airQuality_preProcessing, 10)
duke_forest_copy <- duke_forest
```
## Introduction
In the real world, the data sets we are handling are often not data scientists can readily use. They might contain duplicate entries when entries are supposed to be unique. They might contain missing values, which is problematic for tasks such as training predicative modeling. The data features might be at vastly different scale, which both induces instability of float point arithmatics and inaccurate measurement of feature importance in training machine learning model. We will discuss how to use R to clean Data sets where situations above take place.
## Data frame modification (general modification)
For the most of the time, we will have to engineer our data sets before we do some data analyzation on them. Here I present some tricks to modify our dat sets in R.
### Create DataFrame by arrays
Creating DataFrame by two arrays, These two arrays will by columns of df.Array length should be the same.order matters.
```{r, echo=TRUE}
first_column <- c("value_1", "value_2", "value_3")
second_column <- c("value_4", "value_5", "value_6")
third_column <- c("value_7", "value_8", "value_9")
df <- data.frame(first_column, second_column)
df
```
another way to create dataFrame, both method produce the same result.
```{r, echo=TRUE}
df <- data.frame (first_column = c("value_1", "value_2", "value_3"),
second_column = c("value_4", "value_5", "value_6"),
third_column = c("value_7", "value_8", "value_9")
)
df
```
### Create DataFrame by combining two dataFrame(using cbind/rbind)
```{r, echo=TRUE}
# create a new dataFrame
df1 <- data.frame (first_column = c("value_1", "value_2", "value_3"),
second_column = c("value_4", "value_5", "value_6"),
third_column = c("value_7", "value_8", "value_9")
)
df2 <- data.frame (third_column = c("value_7", "value_8", "value_9"),
fourth_column = c("value_10", "value_11", "value_12")
)
```
combine two dataframe by columns,order matters.
```{r, echo=TRUE}
bigger_df <- cbind(df1, df2)
print(bigger_df)
```
combine two dataframe by rows, two DataFrames must have the same column name and order.
```{r, echo=TRUE}
df3 <- data.frame (third_column = c("value_7", "value_8", "value_9"),
fourth_column = c("value_10", "value_11", "value_12")
)
bigger_df2 <- rbind(df2,df3)
print(bigger_df2)
```
### adding/change row name
```{r, echo=TRUE}
print(bigger_df2)
rownames(bigger_df2) <- LETTERS[16:21]
bigger_df2 <- data.frame(bigger_df2,
row.names = LETTERS[16:21])
# both ways produce the same result
print(bigger_df2)
# rownames can change the row name.
rownames(bigger_df2) <- 1:6
print(bigger_df2)
## get row names
rownames(bigger_df2)
```
### changing column names
```{r, echo=TRUE}
print(df)
# change all column names
colnames(df) <- c('C1','C2')
print(df)
# change specific column name
colnames(df)[1] <- c('new column name')
print(df)
#get column names
colnames(df)
```
### Adding array to to dataFrame as new column
```{r, echo=TRUE}
print(bigger_df2)
bigger_df_new = bigger_df2
bigger_df_new$new <- c(3, 3, 6, 7, 8, 12)
bigger_df_new['new'] <- c(3, 3, 6, 7, 8, 12)
new <- c(3, 3, 6, 7, 8, 12)
bigger_df_new <- cbind(df, new)
# three ways produce the same result.
print(bigger_df_new)
```
### adding column by combination of other columns
```{r, echo=TRUE}
# using mutate() from dplyr
df_math <- data.frame (first_column = c(1, 2, 3),
second_column = c(4, 5, 6)
)
## add new column as mathematical operation of other columns.
output <- mutate(df_math,
sum = (first_column + second_column) / 2)
print(output)
## add new column by boolean operation of other columns.
output <- mutate(df_math,
divisible_by2 = case_when(
first_column%%2 == 0 ~ "yes",
TRUE ~ "No" # otherwise not divisiable by 2.
))
print(output)
```
### joining two table **[2]**
Bewaring that the order of dataframe in the merge function matters(right join left)
```{r, echo=TRUE}
emp_df=data.frame(
emp_id=c(1,2,3,4,5,6),
name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
superior_emp_id=c(-1,1,1,2,2,2),
dept_id=c(10,20,10,10,40,50),
dept_branch_id= c(101,102,101,101,104,105)
)
dept_df=data.frame(
dept_id=c(10,20,30,40),
dept_name=c("Finance","Marketing","Sales","IT"),
dept_branch_id= c(101,102,103,104)
)
# inner join
print(merge(x = emp_df, y = dept_df, by = "dept_id"))
# outer join
print(merge(x = emp_df, y = dept_df, by = "dept_id", all = TRUE)) # all means containing all rows
# left join
print(merge(x = emp_df, y = dept_df, by = "dept_id", all.x = TRUE)) # all.x means containing all rows in x
# right join
print(merge(x = emp_df, y = dept_df, by = "dept_id", all.y = TRUE)) # all.y means containing all rows in y
```
### pivoting dataframe, understanding pivot_longer
```{r, echo=TRUE}
# pivot_longer transform a data frame from a wide format to a long format by converting feature names to a categorical feature.
# pivot_longer is from the package tidyr.
needs_pivoting <- data.frame(sticker_type=c('A', 'B', 'C', 'D'),
sparrow=c(12, 15, 19, 19),
eagle=c(22, 29, 18, 12))
print(needs_pivoting)
needs_pivoting %>% pivot_longer(cols=c('sparrow', 'eagle'), # the columns(feature names) to be pivoted
names_to='species', # the name of the column of features
values_to='price') # the name of the column of values
```
#### binning using cut()
This is also a good way to categorize feature
```{r, echo=TRUE}
hour_df <- data.frame(shop_name=c('MAC', 'Tangro', 'cummington', 'Burger King', 'judgement', 'KFC', 'ye', 'Dungeon', 'Razer', 'yeah sir', 'Koban wife', 'string'),
operating_hours=c(2, 5, 4, 7, 7, 8, 5, 4, 5, 11, 13, 8),
rebounds=c(7, 7, 4, 6, 3, 8, 9, 9, 12, 11, 8, 9))
hour_df
new_hour_df <- hour_df %>% mutate(operating_hours_bin = cut(operating_hours, breaks=c(0,5,9,13)))
new_hour_df
```
#### binning using case()
```{r, echo=TRUE}
hour_df2 <- data.frame(shop_name=c('MAC', 'Tangro', 'cummington', 'Burger King', 'judgement', 'KFC', 'ye', 'Dungeon', 'Razer', 'yeah sir', 'Koban wife', 'string'),
operating_hours=c(2, 5, 4, 7, 7, 8, 5, 4, 5, 11, 13, 8),
rebounds=c(7, 7, 4, 6, 3, 8, 9, 9, 12, 11, 8, 9))
hour_df
new_hour_df <- hour_df2 %>% mutate(operating_hours_bin = case_when( # logistics
operating_hours <= 3 ~ 'very short',
operating_hours <= 6 & operating_hours > 3 ~ 'short',
operating_hours <= 10 & operating_hours > 6 ~ 'median',
operating_hours > 10 ~ 'long',
TRUE ~ 'what else will this be?'
))
new_hour_df
```
### ordering
```{r, echo=TRUE}
student_result_wild=data.frame(name=c("Ram","Geeta","John","Paul",
"Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7),
id = c(9,2,3,5,13,2,5))
student_result_wild # data with out ordering
# order data by certain variable (ascending)
student_result_wild[order(student_result_wild$id),]
# order data by certain variable (descending)
student_result_wild[order(-student_result_wild$id),]
# order data by multiple variable, the second orders the duplicates in first variable
student_result_wild[order(-student_result_wild$id, student_result_wild$science),]
```
## Duplicated Values
In many cases, we observe duplicated values in a data set where every instances are supposed to be unique. We will now discussed how to handle duplicates.
### Data set with row-wise duplicates
Although the dataset we use in this example, airquality data, does not have duplicate, we will use it illustrate the techniques of handling duplicate data.
```{r, echo = TRUE}
data_duplicated_values = airQuality_preProcessing
```
We see the data frame should have no duplicate rows to begin with.
```{r, echo = TRUE}
print(paste0('number of duplicated rows in the data is ',sum(duplicated(data_duplicated_values))))
```
Now we randomly pick 5 instances from the data frame and insert them into the data frame to induce duplication
```{r, , echo = TRUE}
for (i in 1: 5)
{
data_duplicated_values[nrow(data_duplicated_values)+1,] = data_duplicated_values[floor(runif(1, min = 1, max = nrow(data_duplicated_values))),]
}
```
We should now have 5 duplicated instances:
```{r, echo = TRUE}
print(paste0('number of duplicated rows in the data after insertion is ',sum(duplicated(data_duplicated_values))))
```
The "duplicated(df)" returns a boolean array where each value at each index indicates if row at the same index in original data frame is duplicated or not. We can use 'duplicated(df)" to extract duplicated rows:
```{r, echo = TRUE}
data_duplicated_values[duplicated(data_duplicated_values),]
```
By adding "!" before "duplicated(df)", we can negate logics in "duplicated(df)" and access non duplicate rows as a data frame:
```{r, echo = TRUE}
head(data_duplicated_values[!duplicated(data_duplicated_values),], 10)
```
There should be no duplicates
```{r, echo = TRUE}
print(paste0('duplicates in data frame formed by unique rows in original data frame is ',sum(duplicated(data_duplicated_values[!duplicated(data_duplicated_values),]))))
```
We can create a new reference to the data set with no duplicates. For the purpose of reusing "data_duplicated_values", I will just assign the new reference to itself:
```{r, echo = TRUE}
data_duplicated_values <- data_duplicated_values[!duplicated(data_duplicated_values),]
```
Now "data_duplicated_values" should have no duplicates:
```{r, echo = TRUE}
print(paste0('number of duplicated value after re setting reference is ', sum(duplicated(data_duplicated_values))))
```
Another way is to use "distinct()" function from tidyverse package.
```{r, echo = TRUE}
for (i in 1: 5)
{
data_duplicated_values[nrow(data_duplicated_values)+1,] = data_duplicated_values[floor(runif(1, min = 1, max = nrow(data_duplicated_values))),]
}
print(paste0('number of duplicate value in data frame before calling distinct is', sum(duplicated(data_duplicated_values))))
data_duplicated_values <- data_duplicated_values %>% distinct()
print(paste0('number of duplicate value in data frame after calling distinct is', sum(duplicated(data_duplicated_values))))
```
### Duplicates based on specific columns
Some times duplication of elements in certain column/columns is not desirable. We want to be able to remove rows with duplication in specified columns. We first insert 5 rows with duplicates in "Day" and "Month", we should see "158" rows after insertion:
```{r, echo = TRUE}
print(paste0('number of rows before adding 5 duplicated rows on Day and Month is', nrow(data_duplicated_values)))
for (i in 1: 5)
{
Ozone <- floor(runif(1,min = 0, max = 50))
Solor <- floor(runif(1,min = 0, max = 300))
Wind <- round(runif(1,min = 0, max = 20), 2)
Temp <- floor(runif(1,min = 0, max = 20))
random_index = floor(runif(1, min = 1, max = nrow(data_duplicated_values)))
Month <- data_duplicated_values[random_index, 'Month']
Day <- data_duplicated_values[random_index, 'Day']
data_duplicated_values[nrow(data_duplicated_values)+1,] = c(Ozone, Solor, Wind, Temp, Month, Day)
}
print(paste0('number of rows after adding 5 duplicated rows on Day and Month is', nrow(data_duplicated_values)))
```
We can remove rows with duplicated "Day" and "Month" combination using "distinct()". The ".keep_all = TRUE" argument makes sure that for each duplicated combination of Day and Month, we keep the first row for every duplicated combination and we keep all variables in the Data. We should observe the number of rows after dropping dupicated columns going back to "153", which is the size of data frame before we insert rows with duplicated "Day" and "Month" combinations.
```{r, echo = TRUE}
data_duplicated_values<- data_duplicated_values %>% distinct(Day, Month, .keep_all = TRUE)
print(paste0('number of rows after removing 5 duplicated rows on Dat and Month is', nrow(data_duplicated_values)))
```
## Missing Values
Encountering missing data in dataset is not uncommon. When collecting temperature data the sensor might be broken and unable to measure temperature. When conduct public opinion surveying, the interviewee might forget filling entires on questionaires. Many data science job requires completeness of data, such as training predicative model based on numerical/categorical data. We will now discuss doing data cleaning in R.
### NA in R
In R, a missing value is represented by symbol "NA".
```{r, echo = TRUE}
head(airquality, 10)
```
We can observe, at index 5, the instance has missing value in "Ozone" and "Solar.R" feasure, represented by symbol "NA".
### Dropping rows with missing data
A simple approach is to drop rows with missing data. If our task requires using all features in dataset, we can opt to drop all rows with at least one missing value. However, if our task required using some features, we only need to drop rows with missing values in feastures we specify. We will discuss both cases.
#### Drop all rows containing missing Data
Suppose we want to use all features in "airquality" dataset and we want to drop all rows with at least 1 missing values. We first check number of rows with missing data in the dataset. The 'is.na(df)' command returns a boolean array, where truth value at each index indicate if data element at the same index in data frame df is NA or not. By summing number of "TRUE" values in the 'is.na(df)', we know total number of missing values in the data set.
```{r, echo = TRUE}
print(paste0('total number of missing values is ', sum(is.na(airQuality_preProcessing))))
```
We see we have totally 44 missing with at least 1 missing values.
To know number of rows with missing value, we use "complete.cases(df)". The return an array of boolean where boolean value at each index indicate if data instance/case at the corresponding row index in data frame df is complete (having no NA the missing value) or not. By summing the number of false in the array, we get the total number of rows that has at least 1 missing value.
```{r, echo = TRUE}
print(paste0("number of rows with at least 1 missing value is ", sum(!complete.cases(airQuality_preProcessing))))
```
To visualize all rows with missing value, we can access them using the boolean array from "complete.cases()". Due to large number of rows with missing value, we only put 10 here:
```{r, echo = TRUE, eval=FALSE}
head(airQuality_preProcessing[!complete.cases(airQuality_preProcessing), ], 10)
```
We now remove rows with na. We use "drop_na()" function from tidyr library to remove all rows with missing values. There are multiple ways to drop rows with missing value, but we will not go exhaustive on this.
```{r, echo = TRUE}
airQuality_na_droped <- airQuality_preProcessing %>% drop_na()
print(paste0("number of missing values after we drop all NA is ",sum(is.na(airQuality_na_droped))))
print(paste0("number of rows in data frame after extracting rows with NA is ", nrow(airQuality_na_droped)))
```
After removing all 42 rows with missing values, we only have 111 rows left, with 42 rows removed from total 153 rows in original data frame.
#### Drop rows with missing data in specified columns
##### NA in single column
We can see dropping all columns with missing data could be wasteful if we don't use all features in the dataset. For example, if our job does not require feature "Ozone", we don't really care if "Ozone" value is missing or not as it is irrelevant to our task. We will explore how to drop rows with na in specific columns.
We see "is.na(df)" evaluate presence of NA(missing value) in the entire data set. If we change its argument from df, the entire data set, into a specific column, the function will only evaluate specific columns in the data set and return a boolean array that only reflects presence of missing value at the columns. For example, if we want to see number of rows with missing value in "Ozone" column:
```{r, echo = TRUE}
print(paste0('number of NA in Ozone column is ', sum(is.na(airQuality_preProcessing$Ozone))))
```
There are 37 rows with NA in "Ozone" column. To drop rows with missing value in "Ozone", we put column name "Ozone" as argument:
```{r, echo = TRUE}
airQuality_na_droped_Ozone <- airQuality_preProcessing %>% drop_na(Ozone)
print(nrow(airQuality_na_droped_Ozone))
```
37 rows are extracted from 153 columns, resulting in 116 rows.
##### NA in several columns (And)
If we want to drop rows with NA in specific column, for example, rows with NA in both "Ozone" and "Solar.R" columns, we can take advantage of the fact that "is.na(df)" is a boolean array, and we can perform element wise boolean operation in two arrays of same dimension:
```{r, echo = TRUE}
sum(is.na(airQuality_preProcessing$Ozone) & is.na(airQuality_preProcessing$Solar.R))
```
We see number of rows with missing value in both "Ozone" and "Solar.R" is 2. We now visualize two rows:
```{r, echo = TRUE}
airQuality_preProcessing[is.na(airQuality_preProcessing$Ozone) & is.na(airQuality_preProcessing$Solar.R), ]
```
Having the boolean array, we can negate the logics and obtain all rows where we don't have missing values in "Ozone" and "Solar.R" at the same time.
```{r, echo = TRUE}
airQuality_NA_Ozone_and_Solor <- airQuality_preProcessing[!(is.na(airQuality_preProcessing$Ozone) & is.na(airQuality_preProcessing$Solar.R)), ]
nrow(airQuality_NA_Ozone_and_Solor)
```
We have droped two rows from 153 rows in original dataset, so we have 151 rows left.
##### NA in several columns (Or)
What if we want to drop columns with NA in "Ozone" or "Solar.R", such that if any of the two columns contains missing values. We could have used boolean operation between several boolean arrays as we did, but it could be tedious as we have more feastures to consider. We could use "complete.cases()" but use specific columns in dataset as input. The "complete.cases()" returns false if a single column in the rows in data frame we pass in has a missing value, so we could perform "or" operation with it.
```{r, echo = TRUE}
sum(!(complete.cases(airQuality_preProcessing[,c("Ozone","Solar.R")])))
```
There are 42 rows with missing values in either "Ozone" or "Solar.R", or both. In fact, missing values only appear in those two columns, such that we get the total number of rows with missing values in the data set. We can obtain all datas that does not having missing value in "Ozone" or "Solar.R".
```{r, echo = TRUE}
airQuality_NA_Ozone_or_Solor<- airQuality_preProcessing[(complete.cases(airQuality_preProcessing[,c("Ozone","Solar.R")])), ]
nrow(airQuality_NA_Ozone_or_Solor)
```
We have 111 rows after removing 42 rows with missing value on either "Ozone" or "Solar.R"
We can also use drop_na() to with columns specified in function arguement:
```{r, echo = TRUE}
airQuality_NA_Ozone_or_Solor<- airQuality_preProcessing %>% drop_na(c("Ozone", "Solar.R"))
nrow(airQuality_NA_Ozone_or_Solor)
```
##### Drop columns with certain number of NA
This might be uncommon, but we might want to drop rows with more than certain number of NA, such as rows with more than 3 NA and even rows with number of NA equal to number of columns, which means the row has all column value being NA. To know number of NA in each row, we can use "rowSums()" functions:
```{r, echo = TRUE}
head(rowSums(is.na(airQuality_preProcessing)),10)
```
This means for row 1 to row 10, row 5 has 2 NA and row 6 and 10 has 1 NA. We can drop all rows with number of NA above certain threshold. We now insert 3 rows with 3 NA into the data frame.
```{r, echo = TRUE}
airQuality_3NA_Inserted <- airQuality_preProcessing
for (i in 1: 3)
{
Ozone <- floor(runif(1,min = 0, max = 50))
Solor <- floor(runif(1,min = 0, max = 300))
Wind <- round(runif(1,min = 0, max = 20), 2)
Temp <- floor(runif(1,min = 0, max = 20))
random_index = floor(runif(1, min = 1, max = nrow(airQuality_3NA_Inserted)))
Month <- airQuality_3NA_Inserted [random_index, 'Month']
Day <- airQuality_3NA_Inserted [random_index, 'Day']
toInsert <- c(Ozone, Solor, Wind, Temp, Month, Day)
for (i in 1:3) {
random_index = floor(runif(1, min = 1, max = 6))
if (is.na(toInsert[random_index])) {
random_index = floor(runif(1, min = 1, max = 6))
}
toInsert <- replace(toInsert, random_index, NA)
}
airQuality_3NA_Inserted [nrow(airQuality_3NA_Inserted )+1,] <- toInsert
}
tail(airQuality_3NA_Inserted, 3)
print(paste('number of rows in airQuality_3NA_Inserted is', nrow(airQuality_3NA_Inserted)))
```
Now last 3 rows have 3 NA at tail of airQuality_3NA_Inserted. We now check number of rows with more than or equal to 3 NA:
```{r, echo = TRUE}
print(paste0("number of rows with at least 3 NA in airQuality_3NA_Inserted is ", sum(rowSums(is.na(airQuality_3NA_Inserted)) >= 3)))
```
Now we have 3 rows with at least 3 NA. We now drop rows with at least 3 NA from airQuality_3NA_Inserted by selecting rows with NA less than 3 and reset reference of airQuality_3NA_Inserted:
```{r, echo = TRUE}
airQuality_3NA_Inserted <- airQuality_3NA_Inserted[!(rowSums(is.na(airQuality_3NA_Inserted)) >= 3), ]
print(paste0("number of rows in airQuality_3NA_Inserted after dropping is ", nrow(airQuality_3NA_Inserted)))
```
We have drop all 3 rows with at least 3 NA, so the size of airQuality_3NA_Inserted drops from 156 to 153.
### Impute
Dropping rows is a straight forward approach. However, dropping too many drows would induce huge data loss, which would be detrimental to our task. We can insert values we specify into entries with missing values. This will minimize data loss. Though we would never know if imputed values could reflect the true pattern of the data.
#### Impute numerical values
##### Impute with constant
For a column with missing value,we select a constant we think will reflect the real data and fill missing entry in the column with the constant we have selected.
Let's say we use constant 30 to fill "Ozone" entries with missing values.
```{r, echo = TRUE}
airQuality_const_filled <- airQuality_preProcessing
print(paste0("number of missing values in Ozone column is ", sum(is.na(airQuality_const_filled$Ozone))))
head(airQuality_const_filled , 5)
```
We now fill:
```{r, echo = TRUE}
airQuality_const_filled$Ozone <- na_replace(airQuality_const_filled$Ozone, fill = 30, maxgap = Inf)
print(paste0("number of missing values in Ozone column is ", sum(is.na(airQuality_const_filled$Ozone))))
head(airQuality_const_filled , 5)
```
We see NA in Ozone column of row 5 is imputed by constant 30.
* Pro
+ Easy to understand.
* Con
+ Value assigned by human intuition is too arbitrary and might be unrealistic.
##### Impute with Sample statistics
We can impute missing values in a column using sample statistics of the columns such as mean and median. This way we fill the missing value and does not alter data statistics in the column. We now fill "Ozone" column with current "Mean" of "Ozone" column and "Solar.R" with median of "Solar.R" column. Note the "na.rm = True" arguement is to neglect missing value when computing sample statistics so that we get a real value answer.
```{r, echo = TRUE}
airQuality_sstat_impute <- airQuality_preProcessing
print(paste0("Mean of Ozone before impute is ", mean(airQuality_sstat_impute$Ozone, na.rm = TRUE)))
print(paste0("Median of Solar.R before impute is ", median(airQuality_sstat_impute$Solar.R, na.rm = TRUE)))
print(paste0("number of missing values in Ozone column before impute is ", sum(is.na(airQuality_sstat_impute$Ozone))))
print(paste0("number of missing values in Solar.R column before impute is ", sum(is.na(airQuality_sstat_impute$Solar.R))))
airQuality_sstat_impute$Ozone <- na_mean(airQuality_sstat_impute$Ozone, option = "mean", maxgap = Inf)
airQuality_sstat_impute$Solar.R <- na_mean(airQuality_sstat_impute$Solar, option = "median", maxgap = Inf)
print(paste0("Mean of Ozone after impute is ", mean(airQuality_sstat_impute$Ozone)))
print(paste0("Median of Solar.R after impute is ", median(airQuality_sstat_impute$Solar.R)))
print(paste0("number of missing values in Ozone column after impute is ", sum(is.na(airQuality_sstat_impute$Ozone))))
print(paste0("number of missing values in Solar.R column after impute is ", sum(is.na(airQuality_sstat_impute$Solar.R))))
```
We see we no longer have missing values in both columns and mean in "Ozone" column and median in "Solar.R" are both unchanging from before the impute.
* Pro
+ Easy to understand.
+ Does not change sample statistics
* Con
+ Maybe the distribution of valid values in columns are skewed to left or right or have many outliers such that sample statistics do not reflect actual pattern of data
##### Impute by value adjacent to Missing value
Often in time-series data, value on a time point is highly assosicated with most adjacent valid values. Therefore, we want to fill a missing entries using valid data values before or after it.
```{r, echo = TRUE}
airQuality_Adjacent_impute <- airQuality_preProcessing
head(airQuality_Adjacent_impute, 10)
```
We first fill missing values in Ozone column with last observed valid value:
```{r}
airQuality_Adjacent_impute$Ozone <- na_locf(airQuality_Adjacent_impute$Ozone, option = "locf")
```
We then fill missing values in Solar.R column with next observed valid value
```{r}
airQuality_Adjacent_impute$Solar.R <- na_locf(airQuality_Adjacent_impute$Solar.R, option = "nocb")
head(airQuality_Adjacent_impute, 10)
```
If we have missing value at beginning or end of data frame, which is rare, we can remedy using "na_remaining" argument in "na_locf()" to handle remaining missing value after filling. More details can be found in imputeTS documentation.
* Pro
+ For time series data, using adjacent value that are close could capture time dependent pattern
* Con
+ If cloested adjacent value is too far away, filling adjacent value would not sustain time dependent pattern.
##### Impute using predicative model
If we believe certain features in the data frame depends on some other features, we can use other features to fit a predicative model to predict missing value in certain feature. We need to make sure the "other features" our prediction is based on are all valid entries, which we need to impute using some method. We now assume "Ozone" and "Solar.R" features depends on climate related features such as "Wind" and "Temperature", and we build linear regression models for "Ozone" and "Solar.R" based on "Wind" and "Temperature".
```{r, echo = TRUE}
airQuality_lm_impute <- airQuality_preProcessing
head(airQuality_lm_impute, 10)
print(paste0('number of missing values in Ozone column before impute is ', sum(is.na(airQuality_lm_impute$Ozone))))
print(paste0('number of missing values in Solar.R column before impute is ', sum(is.na(airQuality_lm_impute$Solar.R))))
```
```{r, echo = TRUE}
linear_model_Ozone <- lm(Ozone ~ Wind + Temp, data = airQuality_lm_impute)
linear_model_Solar.R <- lm(Solar.R ~ Wind + Temp, data = airQuality_lm_impute)
airQuality_lm_impute$Ozone[is.na(airQuality_lm_impute$Ozone)] <- predict(linear_model_Ozone,
newdata = airQuality_lm_impute[is.na(airQuality_lm_impute$Ozone),c('Wind', 'Temp')])
airQuality_lm_impute$Solar.R[is.na(airQuality_lm_impute$Solar.R)] <- predict(linear_model_Solar.R,
newdata = airQuality_lm_impute[is.na(airQuality_lm_impute$Solar.R),c('Wind', 'Temp')])
head(airQuality_lm_impute, 10)
```
Since the original data are integer, We now convert "Ozone" and "Solar.R" back to integer from double.
```{r, echo = TRUE}
airQuality_lm_impute$Ozone <- as.integer(airQuality_lm_impute$Ozone)
airQuality_lm_impute$Solar.R <- as.integer(airQuality_lm_impute$Solar.R)
head(airQuality_lm_impute, 10)
linear_model_Ozone
linear_model_Solar.R
```
We have now filled missing values using linear regresison model prediction from valid values from other columns.
```{r}
print(paste0('number of missing values in Ozone column after impute is ', sum(is.na(airQuality_lm_impute$Ozone))))
print(paste0('number of missing values in Solar.R column after impute is ', sum(is.na(airQuality_lm_impute$Solar.R))))
```
* Pro
+ If the feature predict indeed depend on other features of the data set, the missing value will be replace by values that reflect the true pattern.
* Con
+ Depending on what predicative model we use, the training and prediction could be computationally expensive.
+ If the feature we predict does not depend on other features in data set we select to construct predicative model, the filled missing value will be mis leading for our future analysis.
#### Impute categorical features
As for imputing categorical features,
Replacing NA in categorical features with string
```{r}
na_df <- data.frame(A = c(NA, 7, 8, 5, 3),
B = c(4, 10, NA, 7, 4),
fantasy = c("sad", "we", NA, 'adf', 'NA'),
C = c(1, 0, NA, 9, NA),
D = c("tangro", "ok", NA, 'yes', 'NA'))
i1 <- !sapply(na_df, is.numeric)
na_df[i1] <- lapply(na_df[i1], function(x)
replace(x, is.na(x), 'MISSING'))
na_df
```
Replacing NA in categorical features with mode **[1]**
```{r}
na_df <- data.frame(A = c(NA, 7, 8, 5, 3),
B = c(4, 10, NA, 7, 4),
fantasy = c("sad", "we", NA, 'adf', 'NA'),
C = c(1, 0, NA, 9, NA),
D = c("tangro", "ok", NA, 'yes', 'NA'))
i1 <- !sapply(na_df, is.numeric)
Mode <- function(x) {
ux <- sort(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
na_df[i1] <- lapply(na_df[i1], function(x)
replace(x, is.na(x), Mode(x[!is.na(x)])))
na_df
```
## Data scaling
Scaling is important for data analysis and exploration, Data without scaling can produce misleading result. Also scaling is beneficial for mathematics computation and machine learning.
### standardization
Standardization unify all the samples' mean and std to 0 and 1, so we could compare the distribution among the scaled samples.
```{r, echo=TRUE}
# we can use the scale from dplyr
library(dplyr)
library(openintro)
scale_df = duke_forest_copy[, 2:5]
# what happen if we did not scale.
boxplot(scale_df)
# after scaling
scale_df %>% mutate_all(~(scale(.) %>% as.vector)) %>% boxplot() # apply scale() to every column
```
### Min Max scaling
MINMAX scaling is good for svm, so don't hesistate to use it :)
```{r, echo=TRUE}
minmax <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
scale_df = duke_forest_copy[, 2:5]
# what happen if we did not scale.
boxplot(scale_df)
scale_df %>% mutate_all(~(minmax(.) %>% as.vector)) %>% boxplot() # apply minmax() to every column
```
## data encoding
### ordinal encoding
```{r, echo=TRUE}
ordinal <- function(x, order= unique(x)) {
x <- as.numeric(factor(x, levels = order, exclude = NULL))
x
}
encode_df <- data.frame(A = c(1, 7, 8, 5, 3),
B = c(4, 10, NA, 7, 4),
fantasy = c("sad", "we", NA, 'adf', 'NA'),
C = c(1, 0, NA, 9, NA),
D = c("tangro", "ok", 'NA', 'yes', 'NA'))
print('origin:')
c(encode_df["D"])
print('encoded:')
ordinal(encode_df[["D"]])
```
### one-hot encoding **[3]**
```{r, echo=TRUE}
encode_df <- data.frame(A = c(1, 7, 8, 5, 3),
B = c(4, 10, NA, 7, 4),
fantasy = c("sad", "we", "seaweed", 'adf', 'NA'),
C = c(1, 0, NA, 9, NA),
D = c("tangro", "ok", 'NA', 'yes', 'NA'))
dummy <- dummyVars(" ~ .", data=encode_df)
newdata <- data.frame(predict(dummy, newdata = encode_df))
newdata
```
## Reference
[1] https://sparkbyexamples.com/r-programming/how-to-do-left-join-in-r/#:~:text=How%20to%20do%20left%20join%20on%20data%20frames%20in%20R,join%20data%20frames%20in%20R.
[2]https://stackoverflow.com/questions/36377813/impute-most-frequent-categorical-value-in-all-columns-in-data-frame
[3] https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods