diff --git a/notebooks/R/remove_records_on_land.ipynb b/notebooks/R/remove_records_on_land.ipynb new file mode 100644 index 0000000..82ea611 --- /dev/null +++ b/notebooks/R/remove_records_on_land.ipynb @@ -0,0 +1,396 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Removing records on land\n", + "\n", + "When you download data from OBIS, one of the first steps in the data cleaning is usually to remove records on land. A species might be recorded on land by many reasons: positional uncertainty, misidentification, records on museums or aquariums, etc.\n", + "\n", + "In this short notebook we will explore how to remove records on land using two different tools:\n", + "\n", + "1. The `obistools` package\n", + "2. A raster file containing environmental information\n", + "\n", + "We will also explore how we can approximate records on land (what might be useful in situations like intertidal species)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The `obistools` package\n", + "\n", + "The `obistools` R package provide tools for data enhancement and quality control. You can find it [here]() and install using:\n", + "\n", + "`devtools::install_github(\"iobis/obistools\")`\n", + "\n", + "Many functions are intended for improving data **before** submitting it to OBIS, but others can be used after data download. We will specifically use the function `obistools::check_onland()`\n", + "\n", + "You need to supply only one argument (`data`), a `data.frame` with columns `decimalLongitude` and `decimalLatitude`. Additionally you can supply your own polygon for the land (in `sf` format).\n", + "\n", + "We will first download some data to play with. We will use the fiddler crab _Minuca rapax_ as an example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "library(obistools)\n", + "library(patchwork)\n", + "library(ggplot2)\n", + "\n", + "crab <- robis::occurrence(\"Minuca rapax\")\n", + "\n", + "head(crab)\n", + "\n", + "plot_map(crab)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can check which records are on land." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_on_land <- check_onland(crab, report = TRUE)\n", + "print(crab_on_land)\n", + "nrow(crab_on_land)\n", + "nrow(crab)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Almost all records, but that is expected for an intertidal species. That's why we can also consider a buffer. But first, let's see how we can remove those records on land:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_on_sea <- crab[-1 * crab_on_land$row,]\n", + "nrow(crab_on_sea)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function also has a `buffer` argument, expressed in meters from the shoreline. Let's try with a buffer of 200m" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_on_land_buff <- check_onland(crab, report = TRUE, buffer = 200)\n", + "nrow(crab_on_land_buff)\n", + "nrow(crab)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A big difference. Let's also produce a version with those records removed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_on_sea_buff <- crab[-1 * crab_on_land_buff$row,]\n", + "nrow(crab_on_sea_buff)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "p1 <- plot_map(crab) + ggtitle(\"Original\") + coord_sf(xlim = c(-100, -20))\n", + "p2 <- plot_map(crab_on_sea) + ggtitle(\"No buffer\") + coord_sf(xlim = c(-100, -20))\n", + "p3 <- plot_map(crab_on_sea_buff) + ggtitle(\"Buffer\") + coord_sf(xlim = c(-100, -20))\n", + "\n", + "p1+p2+p3+plot_layout(nrow = 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using a raster layer\n", + "\n", + "We can also use a raster layer to remove records on land. This is actually a very easy way to clean up the data and ensure you have only points for which you have coverage in your environmental dataset.\n", + "\n", + "We start by downloading a raster layer. We will download an SST layer from the NOAA CoralTemp product available [here](https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/).\n", + "\n", + "All processing will be done using the package `terra`, but you can achieve the same results with package `stars`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "library(terra)\n", + "\n", + "# Download file\n", + "download.file(\"https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/ct5km_sst_ssta_monthly_v31_202401.nc\",\n", + "\"sst_202401.nc\", method = \"wget\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "sst <- rast(\"sst_202401.nc\")\n", + "\n", + "sst\n", + "\n", + "sst <- sst$sea_surface_temperature\n", + "\n", + "plot(sst)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To remove records with no valid data, we just need to extract the values according to the longitude and latitude." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# It is always interesting to add terra:: before the extract function\n", + "# because extract is a common name for function on the tidyverse and some other packages.\n", + "crab_extract <- terra::extract(sst, crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", + "\n", + "# If you encounter an error, convert crab[,c(\"decimalLongitude\", \"decimalLatitude\")] into\n", + "# a data.frame using data.frame(crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", + "\n", + "head(crab_extract)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that we have many NAs, that is points for which there is no environmental information. We remove those." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_on_sea_sst <- crab[!is.na(crab_extract$sea_surface_temperature),]\n", + "\n", + "nrow(crab_on_sea_sst)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Plot\n", + "crab_plot <- crab\n", + "crab_plot$status <- ifelse(is.na(crab_extract$sea_surface_temperature), \"On land\", \"On sea\")\n", + "\n", + "sst_crop <- crop(sst, ext(vect(crab_plot, geom = c(\"decimalLongitude\", \"decimalLatitude\"))))\n", + "sst_crop <- as.data.frame(sst_crop, xy = T)\n", + "\n", + "ggplot() +\n", + " geom_raster(data = sst_crop, aes(x = x, y = y, fill = sea_surface_temperature)) +\n", + " geom_point(data = crab_plot,\n", + " aes(x = decimalLongitude, y = decimalLatitude, \n", + " color = status), size = 2) +\n", + " scale_fill_viridis_c() +\n", + " scale_color_manual(values = c(\"#120101\", \"#b100cc\")) +\n", + " theme_light() +\n", + " coord_equal()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Approximating points to land\n", + "\n", + "We can use that same raster layer to approximate points to land, that is, bring an invalid point to the closest valid cell.\n", + "\n", + "There are many ways of doing that, and in fact you can also approximate points to the nearest point on a vector (shoreline). Here we will use a very simple approach based on nearby cells.\n", + "\n", + "We start by getting the cell indices of the NA points. Then we see if any of the adjacent cells is valid (using the queen movement, which looks into the 8 adjacent cells). If so, we chose the one that is geographically close (considering distance)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab$ID <- 1:nrow(crab)\n", + "coordnames <- c(\"decimalLongitude\", \"decimalLatitude\")\n", + "\n", + "non_valid_coord <- crab[is.na(crab_extract$sea_surface_temperature), c(coordnames, \"ID\")]\n", + "\n", + "for (i in seq_len(nrow(non_valid_coord))) {\n", + " tcell <- cellFromXY(sst, data.frame(non_valid_coord[i,1:2]))\n", + " adj_cells <- adjacent(sst, tcell, \"queen\")\n", + " adj_values <- terra::extract(sst, as.vector(adj_cells))\n", + " valid_cells <- adj_cells[!is.na(as.vector(adj_values[,1]))]\n", + " if (length(valid_cells) > 0) {\n", + " if (length(valid_cells) == 1) {\n", + " non_valid_coord[i, coordnames] <- xyFromCell(sst, valid_cells)\n", + " } else {\n", + " valid_xy <- xyFromCell(sst, valid_cells)\n", + " valid_dists <- distance(\n", + " vect(data.frame(non_valid_coord[i,1:2]), geom = coordnames, crs = \"EPSG:4326\"),\n", + " vect(data.frame(valid_xy), geom = c(\"x\", \"y\"), crs = \"EPSG:4326\")\n", + " )\n", + " nearest_pt <- valid_xy[order(valid_dists),][1,]\n", + " non_valid_coord[i, coordnames] <- data.frame(x = nearest_pt[1], y = nearest_pt[2])\n", + " }\n", + " } \n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "crab_merged <- merge(crab, non_valid_coord, by = \"ID\", all.x = TRUE, suffixes = c(\"\", \"_updated\"))\n", + "\n", + "crab_merged$decimalLongitude <- ifelse(is.na(crab_merged$decimalLongitude_updated),\n", + " crab_merged$decimalLongitude,\n", + " crab_merged$decimalLongitude_updated)\n", + "crab_merged$decimalLatitude <- ifelse(is.na(crab_merged$decimalLatitude_updated),\n", + " crab_merged$decimalLatitude,\n", + " crab_merged$decimalLatitude_updated)\n", + "\n", + "crab_merged <- crab_merged[, !grepl(\"_updated\", colnames(crab_merged))]\n", + "\n", + "head(crab_merged)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Extract the data again\n", + "crab_extract_prox <- terra::extract(sst, crab_merged[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", + "\n", + "crab_on_sea_sst_prox <- crab[!is.na(crab_extract_prox$sea_surface_temperature),]\n", + "\n", + "nrow(crab_on_sea_sst_prox)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.3.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}