40_Environmental_Data.Rmd

# Environmental data

## Bowerbird/blueant

Very commonly, we want to know about the environmental conditions at our points of interest. For the remote and vast Southern Ocean these data typically come from satellite or model sources. Some data centres provide extraction tools that will pull out a subset of data to suit your requirements, but often it makes more sense to cache entire data collections locally first and then work with them from there.

[bowerbird](https://github.com/ropensci/bowerbird) provides a framework for downloading data files to a local collection, and keeping it up to date. The companion [blueant](https://github.com/AustralianAntarcticDivision/blueant) package provides a suite of definitions for Southern Ocean and Antarctic data sources that can be used with `bowerbird`. It encompasses data such as sea ice, bathymetry and land topography, oceanography, and atmospheric reanalysis and weather predictions, from providers such as NASA, NOAA, Copernicus, NSIDC, and Ifremer.

Why might you want to maintain local copies of entire data sets, instead of just fetching subsets of data from providers as needed?

- many analyses make use of data from a variety of providers (in which case there may not be dynamic extraction tools for all of them),
- analyses might need to crunch through a whole collection of data in order to calculate appropriate statistics (temperature anomalies with respect to a long-term mean, for example),
- different parts of the same data set are used in different analyses, in which case making one copy of the whole thing may be easier to manage than having different subsets for different projects,
- a common suite of data are routinely used by a local research community, in which case it makes more sense to keep a local copy for everyone to use, rather than multiple copies being downloaded by different individuals.

In these cases, maintaining local copies of a range of data from third-party providers can be extremely beneficial, especially if that collection is hosted with a fast connection to local compute resources (virtual machines or high-performance computational facilities).


Install from GitHub:

```{r ed_install_pkg, eval = FALSE}
remotes::install_github("AustralianAntarcticDivision/blueant")
```

And load the package before use.

```{r ed1, cache = FALSE}
library(blueant)
```
### Available data sets

First, we can see the available data sets via the `sources` function.

```{r ed2}
srcs <- blueant::sources()
## the names of the first few
head(srcs$name)
## the full details of the first one
srcs[1, ]
```

### Usage

Choose a directory into which to download the data. Usually this would be a persistent directory on your machine so that data sets downloaded in one session would remain available for use in later sessions, and not need re-downloading. A persistent directory could be something like `c:\data\` (on Windows), or you could use the `rappdirs` package (the `user_cache_dir` function) to suggest a suitable directory (cross-platform).

```{r ed0hidden, echo = FALSE, results = "asis", cache = FALSE}
if (grepl("ben_ray", tempdir())) {
  cat("Here we'll use the `c:/data/cache` directory:\n")
  my_data_dir <-  "/data/cache"
} else {
  cat("Here we'll use a temporary directory:\n")
  my_data_dir <-  tempdir()
}
```

```{r ed0a, include = grepl("ben_ray", tempdir()), eval = FALSE, cache = FALSE}
my_data_dir <-  "/data/cache"
```
```{r ed0b, include = !grepl("ben_ray", tempdir()), eval = FALSE, cache = FALSE}
my_data_dir <-  tempdir()
```

Select the data source that we want:

```{r ed4}
data_source <- sources("Southern Ocean marine environmental data")
```

Note that it's a good idea to check the dataset size before downloading it, as some are quite large! (Though if you are running the download interactively, it will ask you before downloading a large data set).

```{r ed5}
data_source$collection_size ## size in GB
```

And fetch the data:
```{r ed6}
result <- bb_get(data_source, local_file_root = my_data_dir, verbose = TRUE)
```

Now we have a local copy of our data. The sync can be run daily so that the local collection is always up to date - it will only download new files, or files that have changed since the last download. For more information on `bowerbird`, see the [package vignette](https://ropensci.github.io/bowerbird/articles/bowerbird.html).

The `result` object holds information about the data that we downloaded:

```{r ed7}
result
```

The `result$files` element tells us about the files:
```{r ed8}
head(result$files[[1]])
```

These particular files are netCDF, and so could be read using e.g. the `raster` or `ncdf4` packages. However, different data from different providers will be different in terms of grids, resolutions, projections, variable-naming conventions, and other facets, which tends to complicate these operations. In the next section we'll look at the `raadtools` package, which provides a set of tools for doing common operations on these types of data.


## RAADtools

The [`raadtools`](https://github.com/AustralianAntarcticDivision/raadtools) package provides a consistent interface to a range of environmental and similar data, and tools for working with them. It is designed to work data with collections maintained by the `bowerbird`/`blueant` packages, and builds on R's existing ecosystem of packages for working with spatial, raster, and multidimensional data.

Here we'll use two different environmental data sets: sea ice and water depth. Water depth does not change with time but sea ice is provided at daily time resolution.

First download daily sea ice data (from 2013 only), and the ETOPO2 bathymetric data set. ETOPO2 is somewhat dated and low resolution compared to more recent data, but will do as a small dataset for demo purposes. This may take a few minutes, depending on your connection speed:

```{r raadt1, eval = FALSE}
src <- bind_rows(
    sources("NSIDC SMMR-SSM/I Nasateam sea ice concentration", hemisphere = "south", time_resolutions = "day",
            years = 2013),
    sources("ETOPO2 bathymetry"))
result <- bb_get(src, local_file_root = my_data_dir, clobber = 0, verbose = TRUE, confirm = NULL)
```

```{r raadt2, echo = FALSE, message = FALSE}
## code not shown in output: capture the output and trim it down a bit
src <- bind_rows(
    sources("NSIDC SMMR-SSM/I Nasateam sea ice concentration", hemisphere = "south", time_resolutions = "day", years = 2013),
    sources("ETOPO2 bathymetry"))
op <- capture.output(result <- bb_get(src, local_file_root = my_data_dir, clobber = 0, verbose = TRUE, confirm = NULL))
op[4] <- ""
op[5] <- " [... output truncated]"
for (oo in op[1:5]) cat(oo, "\n")
```


Now load the `raadtools` package and tell it where our data collection has been stored:

```{r raadt3, results = "hold", cache = FALSE}
library(raadtools)
set_data_roots(my_data_dir)
```

Let's say that we have some points of interest in the Southern Ocean &mdash; perhaps a ship track, or some stations where we took marine samples, or as we'll use here, the [track of an elephant seal](http://www.meop.net/) as it moves from the Kerguelen Islands to Antarctica and back again (Data from IMOS 2018[^1], provided as part of the `SOmap` package).

```{r get_track_data, message = FALSE, warning = FALSE, cache = FALSE}
data("SOmap_data", package = "SOmap")
ele <- SOmap_data$mirounga_leonina %>% dplyr::filter(id == "ct96-05-13")
```

Define our spatial region of interest and extract the bathymetry data from this region, using the ETOPO2 files we just downloaded:

```{r raadt4}
roi <- round(c(range(ele$lon), range(ele$lat)) + c(-2, 2, -2, 2))
bx <- readtopo("etopo2", xylim = roi)
```

And now we can make a simple plot of our our track superimposed on the bathymetry:

```{r raadt5}
plot(bx)
lines(ele$lon, ele$lat)
```

The real power of `raadtools` comes from its extraction functions. We can extract the depth values along our track using the `raadtools::extract()` function. We pass it the data-reader function to use (`readtopo`), the data to apply it to (`ele[, c("lon", "lat")]`), and any other options to pass to the reader function (in this case, specifying the topographic data source `topo = "etopo2"`):

```{r raadt6}
ele$depth <- raadtools::extract(readtopo, ele[, c("lon", "lat")], topo = "etopo2")
```

Plot the histogram of depth values, showing that most of the track points are located in relatively shallow waters:

```{r raadt7}
with(ele, hist(depth, breaks = 20))
```

This type of extraction will also work with time-varying data &mdash; for example, we can extract the sea-ice conditions along our track, based on each track point's location and time:

```{r raadt8, results = "none", message = FALSE, warning = FALSE}
ele$ice <- raadtools::extract(readice, ele[, c("lon", "lat", "date")])
```
```{r raadt9}
## points outside the ice grid will have missing ice values, so fill them with zeros
ele$ice[is.na(ele$ice)] <- 0
with(ele, plot(date, ice, type = "l"))
```

## Other useful packages

- the [PolarWatch](https://polarwatch.noaa.gov/) project aims to enable data discovery and broader use of high-latitude ocean remote sensing data sets. The dedicated ERDDAP server (https://polarwatch.noaa.gov/erddap) is accessible to R users with [rerddap](https://cran.r-project.org/package=rerddap).

- [rsoi](https://cran.r-project.org/package=rsoi) downloads the most up to date Southern Oscillation Index, Oceanic Nino Index, and North Pacific Gyre Oscillation data.

- satellite reflectance data are a common basis for estimating chlorophyll-*a* and other phytoplankton parameters at ocean-basin scales. Global products are widely available; however, Southern-Ocean specific algorithms are likely to provide better estimates in these regions. [croc](https://github.com/sosoc/croc) implements the Johnson et al. (2013) Southern Ocean algorithm.

- more broadly, [oce](https://cran.r-project.org/package=oce) provides a wide range of tools for reading, processing, and displaying oceanographic data, including measurements from Argo floats and CTD casts, sectional data, sea-level time series, and coastline and topographic data.

- [fda.oce](https://github.com/EPauthenet/fda.oce) provides functional data analysis of oceanographic profiles for front detection, water mass identification, unsupervised or supervised classification, model comparison, data calibration, and more.

- [distancetocoast](https://github.com/mdsumner/distancetocoast) provides "distance to coastline" data for longitude and latitude coordinates.

- [geodist](https://github.com/hypertidy/geodist) for very fast calculation of geodesic distances.