-
Notifications
You must be signed in to change notification settings - Fork 1
/
15-dates-regex.Rmd
126 lines (98 loc) · 5.14 KB
/
15-dates-regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Formatting dates"
output:
html_document:
toc: TRUE
---
Working with time-series data can be a challenge for new and experienced R users. You will often have to format the date, time, and timezone when working with raw data. R does not automatically recognize date-time formats and there are many formats for representing date-time (e.g. yyyy-mm-dd, mm-dd-yy, mm/dd/yyyy hh:mm:ss).
`lubridate` is a handy package that is installed as part of the `tidyverse` installation but does not automatically load when you call for the `tidyverse` package (`library(tidyverse)`). You have to explicitly call the package when you need it.
## Lesson
Use one of the Biketown data files you downloaded the other day in the writing functions exercise. Or use `source` to use the function to get Biketown data for 06/2018 through 08/2018.
```{r setup}
library(lubridate)
library(dplyr)
library(ggplot2)
source("/home/tammy/Documents/ds19-class/code/biketown-example.R")
get_data(start = "06/2018", end = "08/2018")
# Function to read in all files and combine into one dataframe
# This function only works if you explicitly set the working drive to
# where the data is being stored
# Aftering running the function, make sure to set the working drive
# back to the folder where your .Rproj file is stored.
setwd("/home/tammy/Documents/ds19-class/data/biketown")
folder <- "/home/tammy/Documents/ds19-class/data/biketown"
filenames <- list.files(path = folder, pattern = "*.csv", all.files = FALSE, full.names = FALSE,
recursive = FALSE, ignore.case = FALSE)
read_csv_filename <- function(filenames){
ret <- read.csv(filenames, stringsAsFactors = F,
strip.white = T, na.strings = "")
ret$Source <- filenames
ret
}
bike_raw <- plyr::ldply(filenames, read_csv_filename)
setwd("/home/tammy/TREC/datascience2019/")
```
```{r data_wrangling}
# check data structure
str(bike_raw)
# create new columns `start.datetime` and `end.datetime`
bike_df1 <- bike_raw %>%
mutate(start.datetime = paste(StartDate, StartTime, sep = " "),
end.datetime = paste(EndDate, EndTime, sep = " "))
# convert `start.datetime` and `end.datetime` into date time format with appropriate timezone
bike_df1$start.datetime <- mdy_hm(bike_df1$start.datetime, tz = "America/Los_Angeles")
bike_df1$end.datetime <- mdy_hm(bike_df1$end.datetime, tz = "America/Los_Angeles")
# convert `Duration` into a useable format
bike_df1$Duration <- hms(bike_df1$Duration)
# this throws a warning about NA's
# checking for NAs in `bike_raw$Duration`
sum(is.na(bike_raw$Duration))
```
There are three functions in `lubridate` that seem synonomous but define very different actions:
1. duration: span of time measured in seconds, and there is no start date involved (see above example useage)
1. interval: measures between two specific time points (in seconds)
1. period: measures time span in units larger than seconds, handy for when accounting for daylight saving times, leap years
```{r datetime}
# calculate interval
bike_df1$interval <- interval(bike_df1$start.datetime, bike_df1$end.datetime)
head(bike_df1$interval)
# calculate duration
bike_df1$duration.all <- as.duration(bike_df1$interval)
head(bike_df1$duration.all)
# calculate period
bike_df1$period <- as.period(bike_df1$duration.all)
head(bike_df1$period)
```
```{r aggregate_datetime}
# using floor_date to help aggregate data
# want weekly mean distance traveled
bike_wkagg <- bike_df1 %>%
mutate(week.datetime = floor_date(start.datetime, unit = "week")) %>%
group_by(week.datetime) %>%
summarise(weekly.meandist = mean(Distance_Miles))
str(bike_wkagg)
bike_wkagg$week.datetime <- as.Date(bike_wkagg$week.datetime)
weekly_meandist_fig <- bike_wkagg %>%
ggplot(aes(x = week.datetime, y = weekly.meandist)) +
geom_bar(stat = "identity", fill = "orange") +
scale_x_date(date_breaks = "1 week") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
weekly_meandist_fig
```
There are a few important regular expression (Regex) base R functions that are handy to know when working with strings:
1. `grep` and `grepl`: `grep` looks for a match within a string vector and returns an indice of matches. `grepl` looks for a match within a string and returns a logical vector.
1. `sub` and `gsub`: replaces the *first* exact matching chunk of text within a string vector with a specified replacement. `gsub` replaces *all* exact matching chunks of text with a specified replacement.
```{r regular expressions}
# Create three different station categories for the start and end stations
bike_df2 <- bike_df1 %>%
mutate(start.station.category = if_else(grepl("Community", StartHub), "Community Station",
if_else(grepl("", StartHub), "Outside Station",
"BIKETOWN Station"))) %>%
mutate(end.station.category = if_else(grepl("Community", EndHub), "Community Station",
if_else(grepl("", EndHub), "Outside Station",
"BIKETOWN Station")))
table(bike_df2$start.station.category)
table(bike_df2$end.station.category)
```
## Resources
Steve Fick's [Regular Expressions](https://d-rug.github.io/blog/2015/regex.fick)