data_exploration.qmd

---
title-block-style: default
---

# Data Exploration {#sec-dataexplore}

This is a basic introduction to importing and tidying, plotting,  summarizing, and exporting data using the *tidyverse* family of functions. This is a very basic primer, for more indepth discussion see: ...

To get started, make sure you have the [tidyverse](https://www.tidyverse.org/) installed. The tidyverse is actually a set of 8 packages designed to work together using a common set of design philosophies. You can doo all the things described below, without tidyverse packages, but I think it is easier to review code that is written using the tidyverse approach.

Install the required packages:

```{r}
#| code-overflow: wrap
#| eval: false

options(repos = c(
  txwri = 'https://txwri.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
install.packages("tidyverse")
install.packages("twriTemplates")
```

Load the packages (you will get a message indicating some function conflicts, don't worry about that for now:

```{r}
#| code-overflow: wrap

library(tidyverse)
library(twriTemplates)
```

When you load tidyverse, you will see the 8 different packages. 

  - *ggplot2* is primarily used to plot data. 
  - *tibble* is a data structure that extends the native data.frame structure in base R.
  - *tidyr* provides tools to orient or transform data from wide to narrow formats and vice versa. It also allows nesting and unnesting of datasets.
  - *readr* is used to read (and write) data into R from CSV, TSV and other delimited text files.
  - *purrr* is a functional programming toolkit (don't worry about this for now).
  - *dplyr* provides many functions (creating new variables, grouping by variables, joining data, summarizing data, etc.) for working with tibbles and dataframes.
  - *stringr* has functions for easily manipulating character strings.
  - *forcats* functions for creating and reordering "factor" variables.

## Import Data

One of the first steps in any data analysis project is importing or reading 
data into R. 
For now, we will focus on reading in data from comma separated 
value (CSV), tab seperated value (TSV), and similar delimited text files. 
For most projects, it is useful to work off a singular data snapshot. 
So once you download your data and have it in your project, *do not* manipulate 
the data file^[Later on we will discuss loading web data sources directly into 
R, but for most of our work your best bet is to store raw data files inside 
the project].

:::{.callout-tip}
## Example Data
The [`twriTemplates`](https://txwri.github.io/twriTemplates/) package includes 
the example data used here. The `path_to_file()` function returns the file path
to example data csv and text files in the package. 
The output from `path_to_file()` can be used directly in the functions used to 
read or import csv and text files.
:::

Use the `read_csv()` function from the *readr* package to import the data and 
assign it to a variable^[Try downloading and extracting this [example data](https://github.com/TxWRI/r-manual/raw/main/data/tutorial.zip) 
to your R project *data* folder. 
Instead of using the `path_to_file()` function, type to path to the data folder 
and the filename for your first argument in `read_csv()`. 
It should look something like: `read_csv("data/easterwood.csv"`]. 
The console will print some information telling you what type of variable it 
made each column of the csv and if there are any problems.

```{r}
## this is a shortcut function in the twriTemplates package
## that returns the file path to example data used in this documentation
example_file_path <- path_to_file("easterwood.csv")

## return a dataframe from the read_csv function with the file path
## as the first argument
df <- read_csv(file = example_file_path)
```

`read_csv()` guesses the column types and sometimes it can guess wrong, so the 
safest option is to tell it what to expect using the `col_types` argument. 
This argument takes a string where each character represents one column. 
Here we will tell it to expect a `character, Date, character, number, number` 
which is represented as `cDcnn`.

```{r}
df <- read_csv(file = example_file_path,
               col_types = "cDcnn")
df
```

TSV and other delimited files are read in the same way but with `read_delim()` 
or `read_tsv()`. If your data is in `.xlsx` format, the *readxl* pacakge is 
required. 
In *readxl* there is a `read_xlsx()` function that works similar to the 
`read_csv()` function except you can specify the sheet and range of cells to 
read from. 
The `col_types` argument also needs to be spelled out as a character vector 
such as `col_types = c("text", "date", "text", "numeric", "numeric")`.

If you have multiple files with the same column names you can read them into 
the same data frame. 
Here we read in multiple files and are more explicit about defining the column 
types:

```{r}
#| message: false

## returns two different file paths
example_file_paths <- path_to_file(c("marabwq2021.csv", "marcewq2021.csv"))

## returns one dataframe from two files
df <- read_csv(file = example_file_paths, 
               col_types = cols_only(
                 StationCode = col_factor(),
                 DateTimeStamp = col_datetime(format = "%m/%e/%Y %H:%M"),
                 Temp = col_number(),
                 F_Temp = col_character(),
                 SpCond = col_number(),
                 F_SpCond = col_character(),
                 Sal = col_number(),
                 F_Sal = col_character(),
                 DO_mgl = col_number(),
                 F_DO_mgl = col_character()
               ))
df
```

::: {.column-margin}
This dataset is from the NOAA NERRS Centralized Data Management Office and 
provides measures of water quality parameters in the Mission and Aransas (MAR) 
National Estuarine Research Reserve (NERR)[@noaanationalestuarineresearchreservesystemnerrsSystemwideMonitoringProgram2022].
:::

In this data we have a row for each date/time observation with associated 
columns for Station, Temperature, Specific Conductance, Salinity, and Dissolved 
Oxygen. 
The columns with the F_ prefix are qa/qc flags, `<0>` is accepted data. 

## Plot and Clean Data {#sec-plotclean}

[*ggplot2*](https://ggplot2.tidyverse.org/) is the graphics package for 
producing layered plots. 
The underlying philosophy of *ggplot2* is to iteratively build your plots layer 
by layer which allows for some sophisticated plots. I won't go into full 
details of using *ggplot2* but lots of information is available in 
the [ggplot2 book](https://ggplot2-book.org/index.html).

Three key things you need to learn: data, aesthetics, geoms.

 - data: a data frame or tibble with the data you want to plot. 
 
 - aesthetics mappings (or aes): specify how variables will be visually depicted (x, y, color, shape, size, etc.).
 
 - geoms: are the layers that define how each layer is rendered (points, lines, bars, etc.).
 
Using the data we imported from above we can quickly create a basic scatter 
plot. 
Note the use of the `+` symbol to iteratively add layers to our plot. 
First we specify the data, then the geometry, and the aesthetic for that geom: 

```{r}
p1 <- ggplot(data = df, aes(x = DateTimeStamp, y = Temp)) +
  geom_point()
p1
```

The above figure demonstrates the importance of plotting your data, the 
-100 value is clearly an issue. 
One thing we can do is filter the data to try and track down the issue.
The dplyr package provides a number of functions to explore your data. 
Here, the data is filtered on the Temp column to include data less than -99:

```{r}
df |> 
  filter(Temp < -99)
```

The F_Temp qa flag is `< -3 >` indicating QA rejected the data point. 
Let's update the data to remove data that doesn't have the `< 0 >` data flag:

```{r}
df <- df |> 
  filter(F_Temp == "<0>")
df
```
This removes about 29,263 observations. 
Try plotting again:

```{r}
p1 <- ggplot(data = df, aes(x = DateTimeStamp, y = Temp)) +
  geom_point()
p1
```

We can use different geoms to explore the data:

```{r}
p2 <- ggplot(data = df, aes(x = Temp)) +
  geom_histogram(binwidth = 1)
p2
```

```{r}
ggplot(data = df, aes(y = Temp, x = StationCode)) +
  geom_boxplot()
```


To explore relationships between two variables, use geom_point and geom_smooth 
with each variable mapped to x and y. 
In this example, ggplot2 prints a message indicating 834 rows of data had 
missing or NA values that could not be plotted. 
`geom_smooth()` will plot the smooth line (using a loess or gam smooth) between 
two variables. 
If you want the linear regression drawn use the argument `method = "lm"`.

```{r}
p3 <- ggplot(data = df, aes(x = Temp, y = DO_mgl)) +
  geom_point(alpha = 0.2) +
  geom_smooth(se = FALSE)
p3
```

This is a good example to introduce the other important aesthetics in *ggplot2*. 
There is a clear negative relationship between Temperature and Dissolved Oxygen. 
Our data includes two sites, is there a difference between the two sites? 
We can map color to the site variable which will color each observation and 
each smooth a different color. 
Although shape and color can be used inside the `aes()` mapping function, you 
can assign them a value in the geom directly. Here we asign values for the 
shape and alpha properties in the point geom.

```{r}
p3 <- ggplot(data = df, aes(x = Temp, y = DO_mgl, color = StationCode)) +
  geom_point(shape = 21, alpha = 0.05) +
  geom_smooth(se = FALSE)
p3
```

There was no reason to expect a difference and there isn't. However the plot is 
muddy because there are so many overlying data points. 
We can also facet the graph by a variable. The following code also shows how 
you can just add another layer to your existing ggplot object:

```{r}
p3 <- p3 +
  facet_wrap(~StationCode)

p3
```

As you can see, ggplot2 allows you to rapidly iterate plots to explore data. 
When exploring the data, the formatting might not matter much, but if you want 
to export plots, we also need to take care of labels and general plot visual 
preferences.

```{r}
p3 <- p3 +
  labs(x = "Temperature [°F]", y = "Dissolved Oxygen [mg/L]") +
  scale_colour_brewer(name = "Stations",
                      palette = "Dark2",
                      labels = c("Aransas Bay", "Copano East"))
p3
```

### Themes

ggplot2 has a number of built-in themes. 
For most of our use cases (reports and papers), the grey background and white 
lines are not great choices. 
We recommend using the `theme_bw()` function at the bare minimum:

```{r}
p3 +
  theme_bw()
```


The twriTemplates package is available with a custom theme and some color 
palettes that are consistent with the Institute's branding:

```{r}
#| eval: false

install.packages("twriTemplates", 
                 repos = c(txwri = 'https://txwri.r-universe.dev',
                           CRAN = 'https://cloud.r-project.org'))

```


```{r}
#| message: false

library(twriTemplates)
p3 +
  theme_TWRI_print() +
  scale_color_discrete_twri(name = "Stations",
                            labels = c("Aransas Bay", "Copano East"))
```


The facet labels are still not very useful. There isn't an easy way to change 
this using *ggplot2*. 
Instead we can create a new variable using the `mutate()` function in *dplyr*. 
There is more discussion about *dplyr* in the next section.

```{r}
#| message: false


df <- df |> 
  mutate(StationName = case_when(
    StationCode == "marabwq" ~ "Aransas Bay",
    StationCode == "marcewq" ~ "Copano Bay East"
  ))

ggplot(data = df,
       aes(x = Temp, y = DO_mgl, color = StationName)) +
  geom_point(shape = 21, alpha = 0.05) +
  geom_smooth(se = FALSE) +
  facet_wrap(~StationName) +
  labs(x = "Temperature [°F]", y = "Dissolved Oxygen [mg/L]") +
  scale_color_discrete_twri(name = "Stations") +
  theme_TWRI_print()
  
```


## Manipulate and Summarize Data

[*dplyr*](https://dplyr.tidyverse.org/articles/dplyr.html) is a powerful package 
designed for working with dataframes and tibbles. 
Read the [*dplyr* introduction](https://dplyr.tidyverse.org/articles/dplyr.html) 
to get a better sense of all the available functions. 
Essentially *dplyr* provides "verb" functions that correspond to how you want 
to manipulate the dataframe.

Commonly used verbs include:

 - `filter()`: subset the rows based on value
 - `mutate()`: create a new column
 - `select()`: subset columns based on names or types
 - `group_by()`: group the data by variable values
 - `summarise()`: summarise the grouped or ungrouped data based on user provided functions
 
There are many other functions, but these cover many use cases for people just 
getting started. *dplyr* also provides functions for joining datasets based on 
common variables, changing the order of a dataframe, calculating cumulative 
statistics, calculating lag and leading values, and rolling or window rank 
functions among others. 
All of these can be very useful to become familiar with.

In this example I want to explore seasonal differences in the water quality 
parameters. 
`mutate()` is used to create new variables. 
```{r}
summary_plot <- df |> 
  # returns DateTimeStamp Value as the month only
  # see ?strptime() for more info
  mutate(Month = as_factor(format(DateTimeStamp, "%B")),
         StationName = case_when(
           StationCode == "marabwq" ~ "Aransas Bay",
           StationCode == "marcewq" ~ "Copano Bay East")) |> 
  ## months have an inherent order, right now it is alphabetical
  ## this adds the correct order so it plots nicely
  mutate(Month = lvls_revalue(Month, month.name)) |> 
  ggplot() +
  geom_boxplot(aes(x = Month, y = DO_mgl, fill = StationName)) +
  scale_fill_discrete_twri(name = "Stations") +
  theme_TWRI_print()
summary_plot
```
 

Now we have a summary figure, but what about the data? 
We can use the `summarize()` or `summarise()` function (depending on your 
English preference) to calculate some summary statistics for each group.

```{r}
summary_table <- df |> 
  mutate(Month = as_factor(format(DateTimeStamp, "%B")),
         StationName = case_when(
           StationCode == "marabwq" ~ "Aransas Bay",
           StationCode == "marcewq" ~ "Copano Bay East")) |> 
  mutate(Month = lvls_revalue(Month, month.name)) |> 
  ## create groups to calculate summary stats on
  group_by(StationName, Month) |> 
  ## calculate the summary stats
  ## returns a column for each stat with a row for each group
  summarize(Mean = mean(DO_mgl),
            Max = max(DO_mgl),
            Min = min(DO_mgl),
            ## no standard error function in base R
            ## can calculate manually here
            SE = sd(DO_mgl)/sqrt(length(DO_mgl)))

summary_table
```


## Export Data

### Figures

You have created a figure and a table in R. Now you probably need to put it in 
a report, website, or other self contained file for your boss to look at. 
For figures created with *ggplot2*, the `ggsave()` option is the easiest 
option. 
First I recommend installing, the *ragg* package because if it available, 
`ggsave()` will use it. 
*ragg* provides higher performance and higher quality images than the 
graphics devices available in base R. 

```{r}
#| eval: false

# install ragg
install.packages("ragg")
library(ragg)

## ggsave defaults to the last plot displayed
## it is safer to specify which plot you want saved
ggsave(filename = "figures/boxplot_mare.png",
       plot = summary_plot,
       width = 8,
       height = 6,
       dpi = 300,
       units = "in")
```

If you have a figure created from something other than a ggplot object, then we 
need to use the ragg graphics device to save it as a file (this also works with 
ggplot as demonstrated here). 
If you don't have *ragg* installed, substitute `png()` for `agg_png()`. 

```{r}
#| eval: false

# first function creates the device
agg_png(filename = "figures/boxplot_mare.png",
    width = 8,
    height = 6,
    res = 300,
    units = "in")
# copy the plot to the file
summary_plot
# turns off the device
dev.off()
```


### Tables

*readr* and [*writexl*](https://docs.ropensci.org/writexl/) packages provide 
the functions required to export data frames to text delimited or Microsoft 
Excel files.

```{r}
#| eval: false

## write a csv
write_csv(x = summary_table,
          file = "export_data/monthly_summary_mare.csv")

## or an Excel file
## uncomment the next line if writexl is not installed
# install.packages("writexl")

library(writexl)
# one sheet
write_xlsx(x = summary_table,
           path = "export_data/monthly_summary_mare.xlsx")

# or specify sheets
write_xlsx(x = list(summarysheet = summary_table,
                    rawdatasheet = df),
           path = "export_data/monthly_summary_mare.xlsx")
```

## Workflow

:::{.callout-note}

## Incomplete

Plan to add information on proper workflow.
One script to download, clean, and save data that analysis is conducted on.

One script to run the analysis and export data/tables/figures

Metadata, readme, etc.

:::