-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_exploration.qmd
500 lines (393 loc) · 16.5 KB
/
data_exploration.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
---
title-block-style: default
---
# Data Exploration {#sec-dataexplore}
This is a basic introduction to importing and tidying, plotting, summarizing, and exporting data using the *tidyverse* family of functions. This is a very basic primer, for more indepth discussion see: ...
To get started, make sure you have the [tidyverse](https://www.tidyverse.org/) installed. The tidyverse is actually a set of 8 packages designed to work together using a common set of design philosophies. You can doo all the things described below, without tidyverse packages, but I think it is easier to review code that is written using the tidyverse approach.
Install the required packages:
```{r}
#| code-overflow: wrap
#| eval: false
options(repos = c(
txwri = 'https://txwri.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
install.packages("tidyverse")
install.packages("twriTemplates")
```
Load the packages (you will get a message indicating some function conflicts, don't worry about that for now:
```{r}
#| code-overflow: wrap
library(tidyverse)
library(twriTemplates)
```
When you load tidyverse, you will see the 8 different packages.
- *ggplot2* is primarily used to plot data.
- *tibble* is a data structure that extends the native data.frame structure in base R.
- *tidyr* provides tools to orient or transform data from wide to narrow formats and vice versa. It also allows nesting and unnesting of datasets.
- *readr* is used to read (and write) data into R from CSV, TSV and other delimited text files.
- *purrr* is a functional programming toolkit (don't worry about this for now).
- *dplyr* provides many functions (creating new variables, grouping by variables, joining data, summarizing data, etc.) for working with tibbles and dataframes.
- *stringr* has functions for easily manipulating character strings.
- *forcats* functions for creating and reordering "factor" variables.
## Import Data
One of the first steps in any data analysis project is importing or reading
data into R.
For now, we will focus on reading in data from comma separated
value (CSV), tab seperated value (TSV), and similar delimited text files.
For most projects, it is useful to work off a singular data snapshot.
So once you download your data and have it in your project, *do not* manipulate
the data file^[Later on we will discuss loading web data sources directly into
R, but for most of our work your best bet is to store raw data files inside
the project].
:::{.callout-tip}
## Example Data
The [`twriTemplates`](https://txwri.github.io/twriTemplates/) package includes
the example data used here. The `path_to_file()` function returns the file path
to example data csv and text files in the package.
The output from `path_to_file()` can be used directly in the functions used to
read or import csv and text files.
:::
Use the `read_csv()` function from the *readr* package to import the data and
assign it to a variable^[Try downloading and extracting this [example data](https://github.com/TxWRI/r-manual/raw/main/data/tutorial.zip)
to your R project *data* folder.
Instead of using the `path_to_file()` function, type to path to the data folder
and the filename for your first argument in `read_csv()`.
It should look something like: `read_csv("data/easterwood.csv"`].
The console will print some information telling you what type of variable it
made each column of the csv and if there are any problems.
```{r}
## this is a shortcut function in the twriTemplates package
## that returns the file path to example data used in this documentation
example_file_path <- path_to_file("easterwood.csv")
## return a dataframe from the read_csv function with the file path
## as the first argument
df <- read_csv(file = example_file_path)
```
`read_csv()` guesses the column types and sometimes it can guess wrong, so the
safest option is to tell it what to expect using the `col_types` argument.
This argument takes a string where each character represents one column.
Here we will tell it to expect a `character, Date, character, number, number`
which is represented as `cDcnn`.
```{r}
df <- read_csv(file = example_file_path,
col_types = "cDcnn")
df
```
TSV and other delimited files are read in the same way but with `read_delim()`
or `read_tsv()`. If your data is in `.xlsx` format, the *readxl* pacakge is
required.
In *readxl* there is a `read_xlsx()` function that works similar to the
`read_csv()` function except you can specify the sheet and range of cells to
read from.
The `col_types` argument also needs to be spelled out as a character vector
such as `col_types = c("text", "date", "text", "numeric", "numeric")`.
If you have multiple files with the same column names you can read them into
the same data frame.
Here we read in multiple files and are more explicit about defining the column
types:
```{r}
#| message: false
## returns two different file paths
example_file_paths <- path_to_file(c("marabwq2021.csv", "marcewq2021.csv"))
## returns one dataframe from two files
df <- read_csv(file = example_file_paths,
col_types = cols_only(
StationCode = col_factor(),
DateTimeStamp = col_datetime(format = "%m/%e/%Y %H:%M"),
Temp = col_number(),
F_Temp = col_character(),
SpCond = col_number(),
F_SpCond = col_character(),
Sal = col_number(),
F_Sal = col_character(),
DO_mgl = col_number(),
F_DO_mgl = col_character()
))
df
```
::: {.column-margin}
This dataset is from the NOAA NERRS Centralized Data Management Office and
provides measures of water quality parameters in the Mission and Aransas (MAR)
National Estuarine Research Reserve (NERR)[@noaanationalestuarineresearchreservesystemnerrsSystemwideMonitoringProgram2022].
:::
In this data we have a row for each date/time observation with associated
columns for Station, Temperature, Specific Conductance, Salinity, and Dissolved
Oxygen.
The columns with the F_ prefix are qa/qc flags, `<0>` is accepted data.
## Plot and Clean Data {#sec-plotclean}
[*ggplot2*](https://ggplot2.tidyverse.org/) is the graphics package for
producing layered plots.
The underlying philosophy of *ggplot2* is to iteratively build your plots layer
by layer which allows for some sophisticated plots. I won't go into full
details of using *ggplot2* but lots of information is available in
the [ggplot2 book](https://ggplot2-book.org/index.html).
Three key things you need to learn: data, aesthetics, geoms.
- data: a data frame or tibble with the data you want to plot.
- aesthetics mappings (or aes): specify how variables will be visually depicted (x, y, color, shape, size, etc.).
- geoms: are the layers that define how each layer is rendered (points, lines, bars, etc.).
Using the data we imported from above we can quickly create a basic scatter
plot.
Note the use of the `+` symbol to iteratively add layers to our plot.
First we specify the data, then the geometry, and the aesthetic for that geom:
```{r}
p1 <- ggplot(data = df, aes(x = DateTimeStamp, y = Temp)) +
geom_point()
p1
```
The above figure demonstrates the importance of plotting your data, the
-100 value is clearly an issue.
One thing we can do is filter the data to try and track down the issue.
The dplyr package provides a number of functions to explore your data.
Here, the data is filtered on the Temp column to include data less than -99:
```{r}
df |>
filter(Temp < -99)
```
The F_Temp qa flag is `< -3 >` indicating QA rejected the data point.
Let's update the data to remove data that doesn't have the `< 0 >` data flag:
```{r}
df <- df |>
filter(F_Temp == "<0>")
df
```
This removes about 29,263 observations.
Try plotting again:
```{r}
p1 <- ggplot(data = df, aes(x = DateTimeStamp, y = Temp)) +
geom_point()
p1
```
We can use different geoms to explore the data:
```{r}
p2 <- ggplot(data = df, aes(x = Temp)) +
geom_histogram(binwidth = 1)
p2
```
```{r}
ggplot(data = df, aes(y = Temp, x = StationCode)) +
geom_boxplot()
```
To explore relationships between two variables, use geom_point and geom_smooth
with each variable mapped to x and y.
In this example, ggplot2 prints a message indicating 834 rows of data had
missing or NA values that could not be plotted.
`geom_smooth()` will plot the smooth line (using a loess or gam smooth) between
two variables.
If you want the linear regression drawn use the argument `method = "lm"`.
```{r}
p3 <- ggplot(data = df, aes(x = Temp, y = DO_mgl)) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE)
p3
```
This is a good example to introduce the other important aesthetics in *ggplot2*.
There is a clear negative relationship between Temperature and Dissolved Oxygen.
Our data includes two sites, is there a difference between the two sites?
We can map color to the site variable which will color each observation and
each smooth a different color.
Although shape and color can be used inside the `aes()` mapping function, you
can assign them a value in the geom directly. Here we asign values for the
shape and alpha properties in the point geom.
```{r}
p3 <- ggplot(data = df, aes(x = Temp, y = DO_mgl, color = StationCode)) +
geom_point(shape = 21, alpha = 0.05) +
geom_smooth(se = FALSE)
p3
```
There was no reason to expect a difference and there isn't. However the plot is
muddy because there are so many overlying data points.
We can also facet the graph by a variable. The following code also shows how
you can just add another layer to your existing ggplot object:
```{r}
p3 <- p3 +
facet_wrap(~StationCode)
p3
```
As you can see, ggplot2 allows you to rapidly iterate plots to explore data.
When exploring the data, the formatting might not matter much, but if you want
to export plots, we also need to take care of labels and general plot visual
preferences.
```{r}
p3 <- p3 +
labs(x = "Temperature [°F]", y = "Dissolved Oxygen [mg/L]") +
scale_colour_brewer(name = "Stations",
palette = "Dark2",
labels = c("Aransas Bay", "Copano East"))
p3
```
### Themes
ggplot2 has a number of built-in themes.
For most of our use cases (reports and papers), the grey background and white
lines are not great choices.
We recommend using the `theme_bw()` function at the bare minimum:
```{r}
p3 +
theme_bw()
```
The twriTemplates package is available with a custom theme and some color
palettes that are consistent with the Institute's branding:
```{r}
#| eval: false
install.packages("twriTemplates",
repos = c(txwri = 'https://txwri.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
```
```{r}
#| message: false
library(twriTemplates)
p3 +
theme_TWRI_print() +
scale_color_discrete_twri(name = "Stations",
labels = c("Aransas Bay", "Copano East"))
```
The facet labels are still not very useful. There isn't an easy way to change
this using *ggplot2*.
Instead we can create a new variable using the `mutate()` function in *dplyr*.
There is more discussion about *dplyr* in the next section.
```{r}
#| message: false
df <- df |>
mutate(StationName = case_when(
StationCode == "marabwq" ~ "Aransas Bay",
StationCode == "marcewq" ~ "Copano Bay East"
))
ggplot(data = df,
aes(x = Temp, y = DO_mgl, color = StationName)) +
geom_point(shape = 21, alpha = 0.05) +
geom_smooth(se = FALSE) +
facet_wrap(~StationName) +
labs(x = "Temperature [°F]", y = "Dissolved Oxygen [mg/L]") +
scale_color_discrete_twri(name = "Stations") +
theme_TWRI_print()
```
## Manipulate and Summarize Data
[*dplyr*](https://dplyr.tidyverse.org/articles/dplyr.html) is a powerful package
designed for working with dataframes and tibbles.
Read the [*dplyr* introduction](https://dplyr.tidyverse.org/articles/dplyr.html)
to get a better sense of all the available functions.
Essentially *dplyr* provides "verb" functions that correspond to how you want
to manipulate the dataframe.
Commonly used verbs include:
- `filter()`: subset the rows based on value
- `mutate()`: create a new column
- `select()`: subset columns based on names or types
- `group_by()`: group the data by variable values
- `summarise()`: summarise the grouped or ungrouped data based on user provided functions
There are many other functions, but these cover many use cases for people just
getting started. *dplyr* also provides functions for joining datasets based on
common variables, changing the order of a dataframe, calculating cumulative
statistics, calculating lag and leading values, and rolling or window rank
functions among others.
All of these can be very useful to become familiar with.
In this example I want to explore seasonal differences in the water quality
parameters.
`mutate()` is used to create new variables.
```{r}
summary_plot <- df |>
# returns DateTimeStamp Value as the month only
# see ?strptime() for more info
mutate(Month = as_factor(format(DateTimeStamp, "%B")),
StationName = case_when(
StationCode == "marabwq" ~ "Aransas Bay",
StationCode == "marcewq" ~ "Copano Bay East")) |>
## months have an inherent order, right now it is alphabetical
## this adds the correct order so it plots nicely
mutate(Month = lvls_revalue(Month, month.name)) |>
ggplot() +
geom_boxplot(aes(x = Month, y = DO_mgl, fill = StationName)) +
scale_fill_discrete_twri(name = "Stations") +
theme_TWRI_print()
summary_plot
```
Now we have a summary figure, but what about the data?
We can use the `summarize()` or `summarise()` function (depending on your
English preference) to calculate some summary statistics for each group.
```{r}
summary_table <- df |>
mutate(Month = as_factor(format(DateTimeStamp, "%B")),
StationName = case_when(
StationCode == "marabwq" ~ "Aransas Bay",
StationCode == "marcewq" ~ "Copano Bay East")) |>
mutate(Month = lvls_revalue(Month, month.name)) |>
## create groups to calculate summary stats on
group_by(StationName, Month) |>
## calculate the summary stats
## returns a column for each stat with a row for each group
summarize(Mean = mean(DO_mgl),
Max = max(DO_mgl),
Min = min(DO_mgl),
## no standard error function in base R
## can calculate manually here
SE = sd(DO_mgl)/sqrt(length(DO_mgl)))
summary_table
```
## Export Data
### Figures
You have created a figure and a table in R. Now you probably need to put it in
a report, website, or other self contained file for your boss to look at.
For figures created with *ggplot2*, the `ggsave()` option is the easiest
option.
First I recommend installing, the *ragg* package because if it available,
`ggsave()` will use it.
*ragg* provides higher performance and higher quality images than the
graphics devices available in base R.
```{r}
#| eval: false
# install ragg
install.packages("ragg")
library(ragg)
## ggsave defaults to the last plot displayed
## it is safer to specify which plot you want saved
ggsave(filename = "figures/boxplot_mare.png",
plot = summary_plot,
width = 8,
height = 6,
dpi = 300,
units = "in")
```
If you have a figure created from something other than a ggplot object, then we
need to use the ragg graphics device to save it as a file (this also works with
ggplot as demonstrated here).
If you don't have *ragg* installed, substitute `png()` for `agg_png()`.
```{r}
#| eval: false
# first function creates the device
agg_png(filename = "figures/boxplot_mare.png",
width = 8,
height = 6,
res = 300,
units = "in")
# copy the plot to the file
summary_plot
# turns off the device
dev.off()
```
### Tables
*readr* and [*writexl*](https://docs.ropensci.org/writexl/) packages provide
the functions required to export data frames to text delimited or Microsoft
Excel files.
```{r}
#| eval: false
## write a csv
write_csv(x = summary_table,
file = "export_data/monthly_summary_mare.csv")
## or an Excel file
## uncomment the next line if writexl is not installed
# install.packages("writexl")
library(writexl)
# one sheet
write_xlsx(x = summary_table,
path = "export_data/monthly_summary_mare.xlsx")
# or specify sheets
write_xlsx(x = list(summarysheet = summary_table,
rawdatasheet = df),
path = "export_data/monthly_summary_mare.xlsx")
```
## Workflow
:::{.callout-note}
## Incomplete
Plan to add information on proper workflow.
One script to download, clean, and save data that analysis is conducted on.
One script to run the analysis and export data/tables/figures
Metadata, readme, etc.
:::