external-data.qmd

---
title: "Starting with Data"
engine: knitr
code-annotations: hover
---

## Library Loading

Begin by loading any needed libraries.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Load the `tidyverse` meta-package.

```{r r-start, warning = F, message = F}
# Load needed library
library(tidyverse)
```

## [{{< fa brands python >}} Python]{.py}

Load the `pandas` and `os` libraries.

```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::

## Data Import

We can now load an external dataset derived from the `lterdatasampler` [{{< fa brands r-project >}} R]{.r} package (see [here](https://lter.github.io/lterdatasampler/)) with both [{{< fa brands r-project >}} R]{.r} and [{{< fa brands python >}} Python]{.py}. This relatively simple operation is also a nice chance to showcase how 'namespacing' (i.e., indicating which package a given function comes from) differs between the two languages.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Namespacing in [{{< fa brands r-project >}} R]{.r} is accomplished by doing `package_name::function_name` and is optional (though, in my opinion, good practice!). Note that we use the assignment operator (`<-`) to assign the contents of the CSV to an object.

```{r r-ext-data}
# Read in vertebrate data CSV
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))
```

Recall from our file path module that the `file.path` function accounts for computer operating system differences.

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py}, namespacing is required and is done via `package_name.function_name`. Note that we use the assignment operator (`=`) to assign the contents of the CSV to a variable.

```{python py-ext-data}
# Read in vertebrate data CSV
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
```

Recall from our file path module that the `join` function accounts for computer operating system differences.
:::

## Tabular Data [Type]{.py}/[Class]{.r}

Data stored in CSVs (and similar data formats like Microsoft Excel, etc.) has a unique [type]{.py}/[class]{.r} that differs from some of the categories we covered in the "Fundamentals" section.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

In [{{< fa brands r-project >}} R]{.r}, such data are class `data.frame`

```{r r-class}
# Check class of a data object
class(vert_r)
```

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py}, such data are type `DataFrame` and this variable type is defined by the `pandas` library. This is the standard type returned by the `pandas` `read_csv` function.

```{python py-class}
# Check type of a data object
type(vert_py)
```
:::

## Making Heads or Tails of Data

Checking the 'head' or 'tail' of the data (i.e., the first or last few rows of the data respectively) is a nice way of getting a sense for the general format of the dataframe being assessed.

::: panel-tabset
## `r fontawesome::fa("r-project", fill = "#FF9B00", a11y = "sem")` R

In [{{< fa brands r-project >}} R]{.r}, we use the `head` or `tail` function to return the first or last rows respectively.

```{r r-head}
# Check out head
utils::head(vert_r, n = 2) # <1>
```
1. Note that we're using the _optional_ `n` argument to specify the number of rows to return

```{r r-tail}
# Check out tail
utils::tail(vert_r, n = 3)
```

## `r fontawesome::fa("python", fill = "#077DC2", a11y ="sem")` Python

In [{{< fa brands python >}} Python]{.py}, we use the `head` or `tail` method to return the first or last rows respectively. Note that these methods are only available to variables of type `DataFrame`. All methods are appended to the end of the variable of the appropriate type separated by a period.

```{python py-head}
# Check out head
vert_py.head(3)
```

```{python py-tail}
# Check out tail
vert_py.tail(2)
```
:::

## Data Structure

While it is nice to know the [type]{.py}/[class]{.r} of the data table generally, we often need to know the [type]{.py}/[class]{.r} of the columns within those data tables. Our operations are typically aimed at modifying particular columns and pre-requisite to that is knowing the [type]{.py}/[class]{.r} of the column to know what actions are available to us.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} uses the `str` function to assess data structure. Structure includes the dimensions of the data (i.e., number of rows and columns) as well as the class of the data (`data.frame`) and the class of each column.

```{r r-structure}
utils::str(vert_r)
```

Be careful to not confuse this with the [{{< fa brands python >}} Python]{.py} function `str` that coerces values to type "string"!

## [{{< fa brands python >}} Python]{.py}

When we want to know the type of each column in a [{{< fa brands python >}} Python]{.py} `DataFrame`, we can use the `dtypes` "attribute". Attributes are akin to a method but they completely lack arguments that might modify their behavior. As a consequence they are extremely precisely defined. The `dtypes` attribute returns the type of each column.

```{python py-structure}
vert_py.dtypes
```
:::

## Data Summaries

We often want to begin our exploration of a given dataset by getting a summary of each column. This is rarely what we actually need for statistics or visualization but it is a nice high-level way of getting a sense for the composition of the data.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} provides the `summary` function to summarize all columns in a dataset. Note that it is not terribly informative for columns of class character.

```{r r-describe}
# Get summary of data
summary(vert_r)
```

## [{{< fa brands python >}} Python]{.py}

[{{< fa brands python >}} Python]{.py} has the `describe` method for getting similar information about each column of a `DataFrame`.

```{python py-describe}
# Describe data
vert_py.describe()
```

[{{< fa brands python >}} Python]{.py} also offers the `info` method to identify only the count of non-null entries in each column as well as the type of each column.

```{python py-describe2}
# Check info of data
vert_py.info()
```
:::

## Identifying Columns

If we want to skip these steps and just identify the full set of column [labels]{.py}/[names]{.r} there is a small operation for stripping that information out in both languages.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} has the `names` function to quickly return a vector of the column names in a given `data.frame` object.

```{r r-names}
# Check column names
names(vert_r)
```

## [{{< fa brands python >}} Python]{.py}

[{{< fa brands python >}} Python]{.py} has the `columns` attributes for returning a simple list of column labels in a given `DataFrame` variable.

```{python py-names}
# Check column labels
vert_py.columns
```
:::

## Accessing Column(s)

We may want to access a particular column or set of columns. There are several approaches we might use to access a particular column in either [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r}. They are:

- Indexing the column by its location
- Indexing the column by its [label]{.py}/[name]{.r}
- Indexing the column with an operator

In all of the following examples we'll use the `head` [method]{.py}/[function]{.r} to simplify the output.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

If we know the order of the columns we can use the same syntax as when we indexed a vector.

```{r r-cols1}
# Use column indexing
head(vert_r[1])
```

Note that if we also want to account for rows we would put the desired column number on the right and the desired row number on the left.

```{r r-cols1b}
# Use row/column indexing
head(vert_r[, 1])
```

If preferred we could instead substitute the number for the column name.

```{r r-cols2}
head(vert_r["year"])
```

In [{{< fa brands r-project >}} R]{.r}, the column operator is a `$` and we place that character between the object and column names (e.g., `data_name$column_name`).

```{r r-cols3}
head(vert_r$year)
```

## [{{< fa brands python >}} Python]{.py}

If we know the index position of the desired column we can use the `iloc` method (short for "integer location"). Note that the `iloc` method uses square brackets instead of parentheses (typical methods use parentheses). We must include a colon with either nothing or the start/stop bounds (see "Slicing" in "Fundamentals"). A colon without specifying bounds returns all rows.

```{python py-cols1}
# Use row/column indexing
vert_py.iloc[: , 0].head()
```

If preferred we could instead substitute the number for the column label.

```{python py-cols2}
vert_py["year"].head()
```

In [{{< fa brands python >}} Python]{.py}, the column operator is a `.` and we place that character between the variable name and column label (e.g., `data_name.column_name`).

```{python py-cols3}
vert_py.year.head()
```
:::

When we want to access multiple columns we can still use either index positions or column [labels]{.py}/[names]{.r} but we _cannot_ use the column operator.

We'll continue to use the `head` [method]{.py}/[function]{.r} to simplify outputs.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

In [{{< fa brands r-project >}} R]{.r} we could use a concatenated vector of index positions to specify particular columns by their positions.

```{r r-multicols1}
# Use multiple column index positions
head(vert_r[, c(1:2, 13)])
```

Instead we could use a vector of column names for extra precision.

```{r r-multicols2}
# Use multiple column names
head(vert_r[, c("year", "sitecode", "weight_g")])
```

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py} we could use the `iloc` method for multiple columns and simply supply a list of the column index positions in which we are interested. Note that this method is different from many of the other methods we've covered so far because it uses square brackets instead of parentheses.

```{python py-multicols1}
# Use multiple column index positions
vert_py.iloc[:, [0, 1, 12]].head()
```

Or we could use the `.loc` method and supply a list of column labels. Note that the `loc` method also uses square brackets instead of parentheses and requires a colon to the left of the comma in order to return all rows.

```{python py-multicols2}
vert_py.loc[:, ["year", "sitecode", "weight_g"]].head()
```
:::