forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
66 lines (51 loc) · 2.73 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
```{r echo=TRUE}
data = read.csv("activity.csv", colClasses=c("numeric", "Date", "numeric"))
```
## What is mean total number of steps taken per day?
Here is a histogram of the total number of steps taken each day.
```{r echo=TRUE}
total_steps=aggregate(steps ~ date, data, sum)
hist(total_steps$steps)
```
The mean and median total number of steps taken per day is `r format(mean(total_steps$steps), nsmall=0)` and `r format(median(total_steps$steps), nsmall=0)`, respectively.
## What is the average daily activity pattern?
From the following chart, we can see that in average people are active between 6:00 and 21:00.
```{r echo=TRUE}
interval_steps = aggregate(steps ~ interval, data, mean, rm.na=TRUE)
with(interval_steps, plot(interval, steps, type="l", xlab="Time"))
```
People are most active at interval `r with(interval_steps, interval[which(steps==max(steps))])`.
## Imputing missing values
The total number of missing values in the dataset is `r sum(is.na(data))`.
Fill up the missing values with mean for that 5-minute interva.
```{r echo=TRUE}
fixed_na = merge(data, interval_steps, by.x="interval", by.y="interval", all.x=TRUE)
fixed_na$steps = sapply(as.numeric(row.names(fixed_na)), function(index) {
with(fixed_na, ifelse(is.na(steps.x[index]), steps.y[index], steps.x[index]) )
})
```
Create a new dataset that is equal to the original dataset but with the missing data filled in.
```{r echo=TRUE}
fixed_na = fixed_na[c("steps", "date", "interval")]
```
Here is a histogram of the total number of steps taken each day.
```{r echo=TRUE}
total_steps=aggregate(steps ~ date, data, sum)
hist(total_steps$steps)
```
The mean and median total number of steps taken per day is `r format(mean(total_steps$steps), nsmall=0)` and `r format(median(total_steps$steps), nsmall=0)`, respectively. Comparing to the original one, we can see the the impact of missing data is very minor.
## Are there differences in activity patterns between weekdays and weekends?
Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
```{r echo=TRUE}
fixed_na$flag = as.factor(ifelse(weekdays(fixed_na$date) %in% c("Saturday", "Sunday"), "weekend", "weekday"))
```
Through the chart below, we can see that people start active in weekend later than in weekday probably because they wake up later, but are more active after 9 AM.
```{r echo=TRUE}
library(ggplot2)
a = aggregate(steps ~ interval + flag, fixed_na, mean, rm.na=TRUE)
qplot(interval, steps, data=a, geom="line", colour=I("blue")) +
facet_wrap( ~ flag, nrow=2) +
theme(panel.background = element_rect(fill='white', colour='black'))
```