3_20-data-exploration.qmd

# Exploring Data

<br>
<br>

<div class="tip">

## Key Concepts

In this chapter, we'll explore the following key concepts and functions:

* Exploratory Data Analysis (EDA) & Initial Data Analysis (IDA)
    - `str()`
    - `nrow()`
    - `ncol()`
    - `dim()`
    - `length()`
    - `rownames()`
    - `colnames()`
    - `names()`
    - `class()`
    - `levels()`
    - `head()`
    - `tail()`
    - `summary()`
    - `View()`
    - `?` & `help()`
    - `Desc()`
    - `glimpse()`
* Contingency Tables for Frequency & Proportionality
    - `table()`
    - `prop.table()`
    - `ftable()`
* Exploratory Data Visualization (EDV)
    - `hist()`
    - `boxplot()`
    - `plot()`
    - `pairs()`
    - `par()`
    - `ggpairs()` (**GGally**)
* Functions for Descriptive Stats: 
    - `mean()`
    - `median()`
    - `min()`
    - `max()`
    - `var()`
    - `quantile()`
* Tabulating Summary Output
* Grouping & Summary Operations with **dplyr**
    - `%>%`
    - `filter()`
    - `select()`
    - `group_by()`
    - `summarize()`
    - `ungroup()`

## New Packages

This chapter uses the following packages (in order of appearance):

* [DescTools](https://cran.r-project.org/web/packages/DescTools/index.html)
* [GGally](https://cran.r-project.org/web/packages/GGally/index.html)
* [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)

## Key Takeaways

Too long; didn't read? Here's what you need to know:

* Initial and Exploratory Data Analysis (IDA; EDA) are key to exploring new data
    - Key functions: `str()` for structure, `summary()` for descriptive stats
* Exploratory Data Visualization (EDV) is critical for exploratory analysis
    - Key functions: `plot()`, `hist()`, `boxplot()`, `pairs()`
* Descriptive statistics functions describe quantitative data
    - Key functions: `mean()`, `median()`, `min()`, `max()`
* Package **dplyr** makes it easy to summarize grouped variables
    - Key functions: `%>%` pipes data, `group_by()`, `summarize()`
    - Example: `data %>% group_by(variable) %>% summarize(count = n())`
    - Don't forget to use `ungroup()`

<br>
<br>
<br>

</div>

```{r echo=F}

# ATTENTION : GLOBAL CHUNK DEFAULTS

knitr::opts_chunk$set(message = FALSE, 
                      warning = FALSE)

```

```{r include=F}

tutorial::go_interactive(greedy = FALSE)

```

<br>
<br>

## Exploratory Data Analysis

**Exploratory Data Analysis** (**EDA**) is the implementation of exploratory techniques to better understand *new data*. Typically, **EDA** uses visualizing and summarizing functions to detect patterns and anomalies in data beyond initial hypotheses and research questions.

<br>

**Practice Data:** To demonstrate, we'll use [state park annual attendance](https://data.ny.gov/Recreation/State-Park-Annual-Attendance-Figures-by-Facility-B/8f3n-xj78) from the State of New York's *Office for Parks, Recreation, and Historic Preservation* (OPRHP). 

```{r cache=T}

library(readr)

url <- paste0("https://data.ny.gov/api/views/8f3n",
              "-xj78/rows.csv?accessType=DOWNLOAD")   # Assign URL: "url"

parks <- read_csv(url)                                # Read in data: "parks"

```

<br>
<br>

### Common Initial Analysis Techniques

**Base R** has a litany of functions commonly used in **Initial Data Analysis**, or **IDA**.

* **IDA** is the opening salvo of functions in **Exploratory Data Analysis**.
* **IDA** techniques aid in understanding the nuances of your data. 


<br>

**Data Structure:** Function `str()` is a go-to function for understanding:

* The *class* of the dataset, e.g. **matrix** or **data.frame**
* The *dimensions* of a dataset (rows and columns)
* The *class* of each variable in the dataset
* The first several values of each variable
* The **levels** in each **factor** variable

```{r}

str(parks)

```

<br>

**Dimensions:** Like measuring width and height, we can do the same with datasets:

* Function `nrow()` prints the total number of rows
* Function `ncol()` prints the total number of columns
* Function `dim()` prints the total number of rows and columns

Recall that in R, dimensions are printed or specified with rows first, then columns.

```{r}

nrow(parks)     # Print total rows

ncol(parks)     # Print total columns

dim(parks)      # Print rows and columns

```

<br>

**Length:** Function `length()` prints the number of values for a single variable or vector.

```{r}

length(parks$Facility)

```

<br>

**Row & Column Names:** Three functions are ideal for printing row and column names:

* Function `rownames()` prints the names of each row, though rows are rarely named
* Function `colnames()` prints the names of each column (i.e. variable)
* Function `names()` also prints the names of each variable
* Rename variables by assigning new names to their output

```{r}

rownames(parks)[1:5]                              # Print row names 1-5

colnames(parks)                                   # Print variable names

names(parks)                                      # Print variable names

names(parks) <- c("Year", "Region", "County", 
                  "Facility", "Attendance")       # Reassign new names

names(parks)                                      # Print new names

```

<br>

**Classes:** We can determine the class of any object using function `class()`.

* Determine classes of entire datasets
* Determine classes of individual variables
* Determine classes of other objects, e.g. models

```{r}

class(parks)                                      # Dataset class

class(parks$Year)                                 # Variable class

model <- lm(Attendance ~ Year + Region, 
            data = parks)                         # Assign linear model

class(model)                                      # Model class

```

<br>

**Categorical Levels:** Print each category ("level") of **factor** variables with `levels()`:

```{r}

fctr <- as.factor(parks$Region)                   # Coerce to "factor"

levels(fctr)                                      # Print levels

```

<br>

**First & Last Observations**: Functions `head()` and `tail()` print first and last rows:

* Function `head()` prints the first rows of your data
* Function `tail()` prints the last rows of your data
* Specify the number of rows with argument `n =`
* By default, six rows are printed

```{r}

head(parks, n = 3)      # Print first 3 rows

tail(parks, n = 3)      # Print last 3 rows

```

<br>

**Summaries:** Function `summary()` describes individual variables according to their class:

* Class **numeric**, **integer**, or **double** prints descriptive statistics
* Class **character** includes total values and missing values
* Class **factor** tallies the total occurences in each level

```{r}

summary(parks)

```

<br>

**View Interactively:** In RStudio, function `View()` presents data in an interactive table.

```{r eval=F}

View(parks)

```

<center>

```{r echo=F, fig.align="center", fig.cap="*An interactive table resulting from function `View()` in RStudio's IDE.*", out.width="90%"}

knitr::include_graphics("figures/function_view.jpg")

```

</center>

<br>

**Documentation:** If data are from an R package, `?` or `help()` opens documentation.

```{r eval=F}

library(ggplot2)            # Load package containing data

?economics                  # Open documentation with `?`

help(economics)             # Open documentation with help()

```

<center>

```{r echo=F, fig.align="center", fig.cap="*Interactive documentation in RStudio using `?` or `help()`.*", out.width="90%"}

knitr::include_graphics("figures/help_documentation.jpg")

```

</center>

<br>
<br>

### Techniques for Tallies & Proportions

Many functions allow tallying frequencies and proportions for **character** and **factor** variables.

<br>

**Contingency Tables:** Function `table()` prints total of occurrences for qualitative values.

These tables are also called **Contingency Tables**. 

```{r}

table(parks$Region)

```

<br>

**Proportionality:** Function `prop.table()`, with `table()` output, shows proportionality.

```{r}

regions <- table(parks$Region)      # Assign `table()` output: "regions"

prop.table(regions)                 # Print proportionality

```

<br>

Functions `table()` or `prop.table()` can also weigh variables against eachother.

```{r}

subset <- parks[, c("Year", "Region")]    # Subset two variables

table(subset)[, 1:5]                      # Frequency of "regions" 1-5

output <- table(subset)                   # Assign `table()` output

prop.table(output)[, 1:5]                 # Proportionality of "regions" 1-5

```

<br>
<br>

### Initial Analysis Techniques from Packages

Many R packages are helpful in **Initial Data Analysis**, e.g. **DescTools** and **dplyr**.

<br>

**Advanced Summaries:** In **DescTools**, function `Desc()` is an enhanced `summary()`.

```{r}

library(DescTools)

Desc(parks$Year)          # Function `Desc()` on a quantitative variable

Desc(parks$Region)        # Function `Desc()` on a qualitative variable

```

<br>

**Advanced Structures:** In **dplyr**, function `glimpse()` is a more organized `str()`.

```{r}

library(dplyr)

glimpse(parks)

```

<br>
<br>

## Exploratory Data Visualization

**Exploratory Data Visualization** or **EDV** is critical to exploratory analyses.

* Allows "quick and dirty" visualizations of your new data's variables
* Used internally to benefit yourself, collaborators, or specialized audiences
* Assists analysts in decoding and identifying patterns and anomalies in new data 

<br>
<br>

### Common Exploratory Visualization Techniques

Several functions exist for exploring data visually in base R.

<br>

**Histograms:** Quickly view the distribution of quantitative variables with `hist()`.

* Histograms are univariate and show the freqency of a range of numeric values
* Increase their resolution by increasing the number of ranges (`breaks =`)

```{r}

hist(parks$Attendance,        # Specify a single variable
     breaks = 100)            # Specify number of breaks and "bins"

```

<br>

**Box Plots:** View several distributions across categorical variables with `boxplot()`.

* The beginning and end of boxplots represent the first and third quartiles, resp.
* The width of the box, itself, is the **Interquartile Range**, or **IQR**
* The middle of each boxplot represents the median (50%)
* "Whiskers" are calculated by `1.5 * IQR`
* Outliers are demarcated beyond whiskers
* Both variables are separated with `~`

```{r}

boxplot(parks$Attendance ~ parks$Region)

```

<br>

**Scatter Plots:** View relationships between quantitative variables with `plot()`.

Since `parks` only contains one quantitative variable, we use `economics` from **ggplot2**.

```{r}

library(ggplot2)

plot(x = economics$uempmed,     # Median duration of unemployment, in weeks
     y = economics$unemploy)    # Number of unemployed, in thousands

```

<br>

**Pairs Plots:** Pairs plots create a matrix of small multiples for each variable.

* Small multiples allow multiple side-by-side comparisons of plots on common axes
* Depending on the **class** of each variable, different plot methods are used

Again, for want of class **numeric** variables, we use `economics` from **ggplot2**.

```{r}

library(ggplot2)

pairs(x = economics)

```

<br>

**Model Summaries:** Function `plot()`, used with a model, produces four summary plots.

* By adjusting the global graphics parameters of base R, we can print all four
* In function `par()`, specify total rows and columns in function `c()`
* Argument `mfrow =` accepts these two values in function `par()`

```{r}

model <- lm(Attendance ~ Year + Region, 
            data = parks)                   # Create linear model: "model"

par(mfrow = c(2, 2))                        # Specify dimensions in `par()`

plot(model)                                 # Call `plot()` on model

```

<br>

**Advanced Pairs Plots:** Use package **ggplot2** extension **GGally** and `ggpairs()`.

As a more colorful example, we'll use base R dataset `iris`.

```{r eval=F}

library(ggplot2)
library(GGally)                     # Load packages

ggpairs(iris,                       # Specify dataset
        aes(color = Species)) +     # Map colors to variable "Species"
  theme_minimal()                   # Preset theme cleans output

```

```{r include=F, cache=T}

library(ggplot2)
library(GGally)

plot <- ggpairs(iris,
                aes(color = Species)) +
  theme_minimal()

```

```{r echo=F}

library(ggplot2)
library(GGally)

plot

```

<br>
<br>

## Descriptive Statistics

**Descriptive** or **Summary Statistics** concisely *describe* datasets or individual variables with summary information, e.g. mean, median, mode, minimum value, maxium value, variance, and more.

While **descriptive statistics** can be the be-all and end-all of a descriptive analysis, they're also integral to **exploratory data analysis**. 

<br>
<br>

### Common Functions for Descriptive Statistics

Again, base R has no shortage of functions for **descriptive** or **summary statistics**.

<br>

**Mean:** The average or **mean** value of quantitative data is calculated with `mean()`.

```{r}

mean(parks$Attendance)

```

<br>

**Median:** Find the value of the 50th percentile, or **median**, with `median()`.

```{r}

median(parks$Attendance)

```

<br>

**Minima & Maxima:** Find the smallest and largest values with `min()` and `max()`.

```{r}

min(parks$Attendance)   # The smallest value in variable "Attendance"

max(parks$Attendance)   # The largest value in variable "Attendance"

```

<br>

**Variance:** Determine the **variance** of quantitative values with `var()`.

```{r}

var(parks$Attendance)

```

<br>

**Quantiles:** Get **quantiles**, or the value at 0, 25, 50, 75, and 100%, with `quantile()`.

```{r}

quantile(parks$Attendance)

```

<br>
<br>

<div class="warning">

## WARNING: SUMMARY STATISTICS & MISSING VALUES

* If the quantitative data you intend to summarize contains missing values (`NA`), the output may not appear as expected.

* To tell R that missing values exist, and to exclude them from calculation, simply set argument `na.rm =` to `TRUE`. 

</div>

<br>
<br>

### Tabulating Descriptive Statistics

Tabulate descriptive statistics from `summary()` output with `data.frame()`.

*Why?* This provides an easy method to tabulate and write summary statistics to a file.

```{r}

sumstats <- summary(parks)          # Assign summary() output: "park_stats"

sumstats <- data.frame(sumstats)    # Coerce to data frame

sumstats[, 2:3]                     # Print data frame

```

<br>
<br>

## Group-Wise Summaries in Base R

So far, we've look at a veriety of ways to explore and summarize datasets and individual variables.

However, you may often seek to summarize and compare subsets of data that are **grouped** by some common value, category, or label.  The following explores how to group and describe data by one or more specified characteristics.

<br>
<br>

### Tabulating Contingency Tables

In *Section 1.1: Exploratory Data Analysis*, we learned about **contingency tables**.

* **Contingency tables** tally the frequency of values for each category in your data
    - Calculated with function `table()`
* **Proportional contingency tables** tally the proportion of each category
    - Calulated with the output of `table()` in function `prop.table()`

<br>

In order to tabulate **contingency tables** in their own data frames: 

1. Flatten them with `ftable()` instead of `table()`
2. Convert the output to a data frame with `data.frame()`

```{r}

reg_freq <- ftable(parks$Region)    # Assign `ftable()` output: "reg_freq"

data.frame(reg_freq)                # Enter output in `data.frame()`

```

<br>

Likewise, for **proportional contingency tables**:

1. Flatten them with `ftable()` instead of `table()`
2. Call `prop.table()` on the output of `ftable()`
3. Convert to a data frame with `data.frame()`

```{r}

reg_freq <- ftable(parks$Region)    # Assign `ftable()` output: "reg_freq"

reg_prop <- prop.table(reg_freq)    # Assign `prop.table()` output: "reg_prop"

data.frame(reg_prop)                # Enter output in `data.frame()`

```

<br>
<br>

### Contingency Tables & Group-Wise Frequencies

Comign soon...

<br>
<br>


### Apply Functions & Group-Wise Operations

Comign soon...

<br>
<br>

### Aggregation & Group-Wise Operations

Comign soon...

<br>
<br>

## Group-Wise Operations with dplyr

Package **dplyr** is a unified framework built explicitly for data manipulation in R, e.g.:

* Reordering rows based on one or more variables
* Performing complex filtering and additive joins
* Selecting, reordering, and renaming variables in a data frame
* Filtering rows by specified conditional statements and logical operators
* Grouping rows by one or more specified variables and summarizing their values

We explore most of theis elsewhere. Here, we focus on group-wise operations.

But first, we'll provide a brief overview of **dplyr** syntax.

<br>
<br>

### Package dplyr Syntax

Package **dplyr** has a somewhat nuanced syntax that is easy to master. Pay attention:

<br>

**Piping:** Package **dplyr** uses the **pipe operator**, or `%>%`, which:

* Passes data frames through some new function, emerging as an altered data frame
* Begins with a data frame input in the left hand side
* Ends with a data frame output from the right hand side

```{r}

parks %>%                                         # Specify data frame and pipe
  filter(Facility == "Allegany Red House Area")   # Pass via function `filter()`

```

<br>

**Bare Variable Names:** Once the dataset is named, you need not type it again.

* R recognizes when the data frame has been called
* Therefore, variables need only be named, without the `dataset$x` notation

```{r}

names(parks)

parks %>%                             # Call dataset object name
  select(Year, Region, Attendance)    # Use bare variable names

```

<br>

**Tibbles:** When passed through **dplyr** functions, they become **tibbles**.

* **Tibbles** are truncated printouts of data frames
* Typically, **tibbles** print the first ten observations
* **Tibbles** also provide both dimensions and variable classes
* Any unprinted observations are summarized underneath the first ten
* With a large amount of variables, **tibbles** print only what fits on-screen
* Like unprinted observations, variables that do not fit on-screen are summarized

<br>
<br>

### Grouping by Variables

In **dplyr**, function `group_by()` accepts the bare names of one or more variables.

Notably, **grouping** does nothing by itself. Data must be **piped** into a new function.

```{r eval=F}

parks %>%
  group_by(Year)            # Grouping by a single variable: "Year"

parks %>%
  group_by(Region, Year)    # Grouping by two variables: "Region", "Year"

```

<br>
<br>

### Group By-Summarize Operations

As noted, we must use a function to operate on **grouped** data.

* Function `summarize()` allows us to make new variables on grouped data
* Within `summarize()`, the basic formula is: `new_variable = function(existing_variable)`

Here, we create new variable `Average` from existing variable `Attendance`:

```{r}

parks %>%                                 # Invoke "parks"
  group_by(Year) %>%                      # Group by "Year"
  summarize(Average = mean(Attendance))   # Create "Average" with `mean()`

```

<br>


**Multiple Summary Variables:** Create multiple new variables in one `summarize()` call.

```{r}

parks %>%
  group_by(Year) %>%
  summarize(Mean = mean(Attendance),
            Median = median(Attendance),
            Maximum = max(Attendance),
            Records = n())                # Create multiple new variables

```

<br>

**Multiple Grouping Variables:** We can use multiple variables in `group_by()`.

* Creates summaries for each permutation of unique values
    - Suppose we group by one variable with 5 categories and one with 10
    - Total permutations equals `5 * 10`, or 50 groups
* Note, also, that *not grouping on variables will drop ungrouped variables*
    - In other words, grouping by `X`, not `Y`, means `Y` is then excluded

```{r}

parks %>%
  group_by(Year, Region) %>%              # Group on variables "Year", "Region"
  summarize(Mean = mean(Attendance),
            Median = median(Attendance))

```

<br>

**Creating Summaries of Summaries:** Use variables from `summarize()` in the same call!

```{r}

parks %>%
  group_by(Year) %>%
  summarize(Mean = mean(Attendance),
            Total = sum(Attendance),                      # Create "Total"
            Proportion = Total / sum(parks$Attendance))   # Use "Total" in formula

```

<br>

**Assigning Summary Output:** Preface `summarize()` calles with assignment, `<-`.

```{r}

mean_att <- parks %>%                 # Assign expression to "mean_att"
  group_by(Year) %>%
  summarize(Mean = mean(Attendance))

mean_att                              # Autoprint results

```

<br>

**Ungrouping:** As a rule, consider whether to use function `ungroup()` after `group_by()`.

* If you use **grouped** summaries for later analysis, they remain grouped under the hood
* This is particularly annoying to troubleshoot when far downstream in analyses
* Don't suffer as so many have - use `ungroup()`

```{r}

mean_att <- parks %>%
  group_by(Year) %>%
  summarize(Mean = mean(Attendance)) %>%
  ungroup()                               # Don't enter a world of pain

```

<br>
<br>

<div class="warning">

WARNING: SERIOUSLY, DON'T FORGET UNGROUP()

* Want to know why your summary data aren't merging correctly?
* There's a litany of possible reasons.
* Often: You didn't use `ungroup()`.

Use `ungroup()`.

</div>

<br>
<br>

## Further Resources

The following resources may prove helpful to the curious learner.

* [Bryan, J. "Single Table dplyr Functions" STAT 545.](https://stat545.com/dplyr-single.html#group_by-is-a-mighty-weapon)
* [Crawford, J (2019). "Intro to R: Data Manipulation".](http://rpubs.com/JamisonCrawford/manipulation) 
* [RStudio (2015). "Data Wrangling Cheat Sheet".](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)