forked from jtr13/cc22tt
-
Notifications
You must be signed in to change notification settings - Fork 0
/
visualization_time_series.Rmd
257 lines (209 loc) · 7.87 KB
/
visualization_time_series.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# Visualizing Time Series Data
Kate Lassiter
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(lubridate)
suppressMessages(library(ggfortify))
library(gridExtra)
library(grid)
library(timevis)
suppressMessages(library(forecast))
suppressMessages(library(plotly))
library(tidyr)
library(zoo)
```
#### Starting Point
Same exploratory questions as with any new data set:\
* Strongly correlated columns
* Variable means
* Sample variance, etc.\
Use familiar techniques:\
* Summary statistics
* Histograms
* Scatter plots, etc.
Be very careful of lookahead!\
* Incorporating information from the future into past smoothing, prediction, etc. when you shouldn't know it yet
* Can happen when time-shifting, smoothing, imputing data
* Can bias your model and make predictions worthless
#### Working with time series (ts) objects
Integration of ts() objects with ggplot2:\
* ggfortify package
* autoplot()
* All the same customizations as ggplot2
* Don't have to convert from ts to dataframe format
* gridExtra package
* Arrange the 4 ggplot plots as a 4-panel grid
* grid package
* Add title to the grid arrangement
```{r,warnings=FALSE,messages=FALSE}
dax=autoplot(EuStockMarkets[,"DAX"])+
ylab("Price")+
xlab("Time")
cac=autoplot(EuStockMarkets[,"CAC"])+
ylab("Price")+
xlab("Time")
smi=autoplot(EuStockMarkets[,"SMI"])+
ylab("Price")+
xlab("Time")
ftse=autoplot(EuStockMarkets[,"FTSE"])+
ylab("Price")+
xlab("Time")
grid.arrange(dax,cac,smi,ftse,top=textGrob("Stock market prices"))
```
#### Time series relevant plotting:\
Working with the data:
* Directly transforming ts() objects for use with ggplot2:\
* complete.cases() to easily remove NA rows - prevent ggplot warning
* avoid irritations of working with ts objects
Looking at changes over time:
* Plot differenced values
* Histogram/scatter plot of the lagged data
* Shows change in values, how values change together
* Trend can hide true relationship, make two series appear highly predictive of one another when they move together
* Use base package diff(), calculates difference between point at time _t_ and _t+1_
```{r,warnings=FALSE}
new=as.data.frame(EuStockMarkets)
new$SMI_diff=c(NA,diff(new$SMI))
new$DAX_diff=c(NA,diff(new$DAX))
p1 <- ggplot(new, aes(SMI,DAX))+
geom_point(shape = 21, colour = "black", fill = "white")
p2 <- ggplot(new[complete.cases(new),], aes(SMI_diff,DAX_diff))+
geom_point(shape = 21, colour = "black", fill = "white")
grid.arrange(p1,p2,top=textGrob("SMI vs DAX"))
```
Exploring Time Lags:
* Lagged differences:\
* Time series analysis: focused on predicting future values from past
* Concerned whether a change in one variable at time _t_ predicts change in another variable at time _t+1_
* lag() to shift forward by one
* Showing density using alpha
```{r}
new$SMI_lag_diff=c(NA,lag(diff(new$SMI),1))
ggplot(new[complete.cases(new),], aes(SMI_lag_diff,DAX_diff))+
geom_point(shape = 21, colour = "black", fill = "white",alpha=0.4,size=2)
```
Now there is no apparent relationship: positive change in SMI today won't predict positive change in DAX tomorrow. There is a positive trend over the long term, but this does little to predict in the short term
Observations:\
* Careful with time series data: use same techniques, but reshape data
* Change in values from one time to another is vital concept
### Dynamics of Time Series Data
#### Seasonality, Cycle, Trend
Three aspects of time series data:
* Seasonal:
* Recurs over a fixed period
* Cycle:
* Recurrent behaviors, variable time period
* Trend:
* Overall drift to higher/lower values over a long time period\
These can be gathered through visual inspection:\
* Line plot
* Clear trend
* Consider log transform or differencing
* Increasing variance
* Multiplicative seasonality
* Seasonal swings grow along with overall values
```{r}
autoplot(AirPassengers)+
xlab("Year")+
ylab("Passengers")
```
* Time series decomposition:
* Break data into seasonal, trend, and remainder components
* Seasonal component:
* LOESS smoothing of all January values, February values, etc.
* Moving window estimate of smoothed value based on point's neighbors
* stats package
* stl()\
```{r}
autoplot(stl(AirPassengers, s.window = 'periodic'), ts.colour = 'red')+
xlab("Year")
```
\
* Observations
* Clear rising trend
* Obvious seasonality
* Difference between the two methods:
* This particular decomposition shows additive, not multiplicative seasonality
* But start and end time series have highest residuals
* Settled on the average seasonal variance
* Both reveal information on patterns that need to be identified and potentially dealt with before forecasting can occur
#### Plotting: exploiting the temporal axis
##### Gannt charts
* Shows overlapping time periods, duration of event relative to others
* timevis package
```{r}
dates=sample(seq(as.Date('1998-01-01'), as.Date('2000-01-01'), by="day"), 16)
dates=dates[order(dates)]
projects = paste0("Project ",seq(1,8))
data <- data.frame(content = projects,
start = dates[1:8],
end = dates[9:16])
timevis(data)
```
##### Using month and year creatively in line plots
* forecast package
* ggseasonplot()
* ggmonthplot()
* suppressMessages() prevents printing information outputted when loading a package
```{r}
ggseasonplot(AirPassengers)
ggmonthplot(AirPassengers)
```
* Observations
* Some months increased more over time than others
* Passenger numbers peak in July or August
* Local peak in March most years
* Overall increase across months over the years
* Growth trend increasing (rate of increase increasing)\
\
##### 3-D Visualizations: plotly package\
* Convert to a format plotly will understand
* Avoid using ts() object
* Dataframe with datetime, numeric columns
* lubridate package for date manipulation
```{r}
new = data.frame(AirPassengers)
new$year=year(seq(as.Date("1949-01-01"),as.Date("1960-12-01"),by="month"))
new$month=lubridate::month(seq(as.Date("1949-01-01"),as.Date("1960-12-01"),by="month"),label=TRUE)
plot_ly(new, x = ~month, y = ~year, z = ~AirPassengers,
color = ~as.factor(month)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Month'),
yaxis = list(title = 'Year'),
zaxis = list(title = 'Passenger Count')))
```
* Allows a better view of the relationships between month and year
#### Data Smoothing
* Usually need to smooth the data before starting analysis or visualization
* Allows better storytelling
* Irrelevant spikes dominate the narrative
* Methods:/
* Moving average/median
* Good for noisy data
* Rolling mean reduces variance
* Keep in mind: affects accuracy, R² statistics, etc.
* Zoo package rollmean()
* Prevent lookahead, use past values as the window (align="right")
* k = 6 is a 6 month rolling window
* gsub() substitute series names for the clearer legend
* tidyr package gather()
* Convert from wide to long, use this as color/group in ggplot
```{r}
new = data.frame(AirPassengers)
new$AirPassengers=as.numeric(new$AirPassengers)
new$year=seq(as.Date("1949-01-01"),as.Date("1960-12-01"),by="month")
new = new %>%
mutate(roll_mean = rollmean(new$AirPassengers,k=6,align="right",fill = NA))
df <- gather(new, key = year, value = Rate,
c("roll_mean", "AirPassengers"))
df$year=gsub("AirPassengers","series",df$year)
df$year=gsub("roll_mean","rolling mean",df$year)
df$date = rep(new$year,2)
ggplot(df, aes(x=date, y = Rate, group = year, colour = year)) +
geom_line()
```
* Exponentially weighted moving averages
* Weigh long past values less than recent
* Geometric mean
* Combats strong serial correlation
* Good for series with data that compounds greatly as time goes on