Create remove_records_on_land.ipynb

iobis · Oct 9, 2024 · 55e55c3 · 55e55c3
1 parent d37a40c
commit 55e55c3
Showing 1 changed file with 396 additions and 0 deletions.
diff --git a/notebooks/R/remove_records_on_land.ipynb b/notebooks/R/remove_records_on_land.ipynb
@@ -0,0 +1,396 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Removing records on land\n",
+    "\n",
+    "When you download data from OBIS, one of the first steps in the data cleaning is usually to remove records on land. A species might be recorded on land by many reasons: positional uncertainty, misidentification, records on museums or aquariums, etc.\n",
+    "\n",
+    "In this short notebook we will explore how to remove records on land using two different tools:\n",
+    "\n",
+    "1. The `obistools` package\n",
+    "2. A raster file containing environmental information\n",
+    "\n",
+    "We will also explore how we can approximate records on land (what might be useful in situations like intertidal species)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The `obistools` package\n",
+    "\n",
+    "The `obistools` R package provide tools for data enhancement and quality control. You can find it [here]() and install using:\n",
+    "\n",
+    "`devtools::install_github(\"iobis/obistools\")`\n",
+    "\n",
+    "Many functions are intended for improving data **before** submitting it to OBIS, but others can be used after data download. We will specifically use the function `obistools::check_onland()`\n",
+    "\n",
+    "You need to supply only one argument (`data`), a `data.frame`  with columns `decimalLongitude` and `decimalLatitude`. Additionally you can supply your own polygon for the land (in `sf` format).\n",
+    "\n",
+    "We will first download some data to play with. We will use the fiddler crab _Minuca rapax_ as an example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "library(obistools)\n",
+    "library(patchwork)\n",
+    "library(ggplot2)\n",
+    "\n",
+    "crab <- robis::occurrence(\"Minuca rapax\")\n",
+    "\n",
+    "head(crab)\n",
+    "\n",
+    "plot_map(crab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can check which records are on land."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_on_land <- check_onland(crab, report = TRUE)\n",
+    "print(crab_on_land)\n",
+    "nrow(crab_on_land)\n",
+    "nrow(crab)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Almost all records, but that is expected for an intertidal species. That's why we can also consider a buffer. But first, let's see how we can remove those records on land:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_on_sea <- crab[-1 * crab_on_land$row,]\n",
+    "nrow(crab_on_sea)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The function also has a `buffer` argument, expressed in meters from the shoreline. Let's try with a buffer of 200m"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_on_land_buff <- check_onland(crab, report = TRUE, buffer = 200)\n",
+    "nrow(crab_on_land_buff)\n",
+    "nrow(crab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A big difference. Let's also produce a version with those records removed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_on_sea_buff <- crab[-1 * crab_on_land_buff$row,]\n",
+    "nrow(crab_on_sea_buff)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "p1 <- plot_map(crab) + ggtitle(\"Original\") + coord_sf(xlim = c(-100, -20))\n",
+    "p2 <- plot_map(crab_on_sea) + ggtitle(\"No buffer\") + coord_sf(xlim = c(-100, -20))\n",
+    "p3 <- plot_map(crab_on_sea_buff) + ggtitle(\"Buffer\") + coord_sf(xlim = c(-100, -20))\n",
+    "\n",
+    "p1+p2+p3+plot_layout(nrow = 1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using a raster layer\n",
+    "\n",
+    "We can also use a raster layer to remove records on land. This is actually a very easy way to clean up the data and ensure you have only points for which you have coverage in your environmental dataset.\n",
+    "\n",
+    "We start by downloading a raster layer. We will download an SST layer from the NOAA CoralTemp product available [here](https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/).\n",
+    "\n",
+    "All processing will be done using the package `terra`, but you can achieve the same results with package `stars`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "library(terra)\n",
+    "\n",
+    "# Download file\n",
+    "download.file(\"https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/ct5km_sst_ssta_monthly_v31_202401.nc\",\n",
+    "\"sst_202401.nc\", method = \"wget\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "sst <- rast(\"sst_202401.nc\")\n",
+    "\n",
+    "sst\n",
+    "\n",
+    "sst <- sst$sea_surface_temperature\n",
+    "\n",
+    "plot(sst)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To remove records with no valid data, we just need to extract the values according to the longitude and latitude."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# It is always interesting to add terra:: before the extract function\n",
+    "# because extract is a common name for function on the tidyverse and some other packages.\n",
+    "crab_extract <- terra::extract(sst, crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
+    "\n",
+    "# If you encounter an error, convert crab[,c(\"decimalLongitude\", \"decimalLatitude\")] into\n",
+    "# a data.frame using data.frame(crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
+    "\n",
+    "head(crab_extract)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that we have many NAs, that is points for which there is no environmental information. We remove those."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_on_sea_sst <- crab[!is.na(crab_extract$sea_surface_temperature),]\n",
+    "\n",
+    "nrow(crab_on_sea_sst)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Plot\n",
+    "crab_plot <- crab\n",
+    "crab_plot$status <- ifelse(is.na(crab_extract$sea_surface_temperature), \"On land\", \"On sea\")\n",
+    "\n",
+    "sst_crop <- crop(sst, ext(vect(crab_plot, geom = c(\"decimalLongitude\", \"decimalLatitude\"))))\n",
+    "sst_crop <- as.data.frame(sst_crop, xy = T)\n",
+    "\n",
+    "ggplot() +\n",
+    "    geom_raster(data = sst_crop, aes(x = x, y = y, fill = sea_surface_temperature)) +\n",
+    "    geom_point(data = crab_plot,\n",
+    "     aes(x = decimalLongitude, y = decimalLatitude, \n",
+    "     color = status), size = 2) +\n",
+    "    scale_fill_viridis_c() +\n",
+    "    scale_color_manual(values = c(\"#120101\", \"#b100cc\")) +\n",
+    "    theme_light() +\n",
+    "    coord_equal()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Approximating points to land\n",
+    "\n",
+    "We can use that same raster layer to approximate points to land, that is, bring an invalid point to the closest valid cell.\n",
+    "\n",
+    "There are many ways of doing that, and in fact you can also approximate points to the nearest point on a vector (shoreline). Here we will use a very simple approach based on nearby cells.\n",
+    "\n",
+    "We start by getting the cell indices of the NA points. Then we see if any of the adjacent cells is valid (using the queen movement, which looks into the 8 adjacent cells). If so, we chose the one that is geographically close (considering distance)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab$ID <- 1:nrow(crab)\n",
+    "coordnames <- c(\"decimalLongitude\", \"decimalLatitude\")\n",
+    "\n",
+    "non_valid_coord <- crab[is.na(crab_extract$sea_surface_temperature), c(coordnames, \"ID\")]\n",
+    "\n",
+    "for (i in seq_len(nrow(non_valid_coord))) {\n",
+    "    tcell <- cellFromXY(sst, data.frame(non_valid_coord[i,1:2]))\n",
+    "    adj_cells <- adjacent(sst, tcell, \"queen\")\n",
+    "    adj_values <- terra::extract(sst, as.vector(adj_cells))\n",
+    "    valid_cells <- adj_cells[!is.na(as.vector(adj_values[,1]))]\n",
+    "    if (length(valid_cells) > 0) {\n",
+    "        if (length(valid_cells) == 1) {\n",
+    "            non_valid_coord[i, coordnames] <- xyFromCell(sst, valid_cells)\n",
+    "        } else {\n",
+    "            valid_xy <- xyFromCell(sst, valid_cells)\n",
+    "            valid_dists <- distance(\n",
+    "                vect(data.frame(non_valid_coord[i,1:2]), geom = coordnames, crs = \"EPSG:4326\"),\n",
+    "                vect(data.frame(valid_xy), geom = c(\"x\", \"y\"), crs = \"EPSG:4326\")\n",
+    "            )\n",
+    "            nearest_pt <- valid_xy[order(valid_dists),][1,]\n",
+    "            non_valid_coord[i, coordnames] <- data.frame(x = nearest_pt[1], y = nearest_pt[2])\n",
+    "        }\n",
+    "    } \n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "crab_merged <- merge(crab, non_valid_coord, by = \"ID\", all.x = TRUE, suffixes = c(\"\", \"_updated\"))\n",
+    "\n",
+    "crab_merged$decimalLongitude <- ifelse(is.na(crab_merged$decimalLongitude_updated),\n",
+    "    crab_merged$decimalLongitude,\n",
+    "    crab_merged$decimalLongitude_updated)\n",
+    "crab_merged$decimalLatitude <- ifelse(is.na(crab_merged$decimalLatitude_updated),\n",
+    "    crab_merged$decimalLatitude,\n",
+    "    crab_merged$decimalLatitude_updated)\n",
+    "\n",
+    "crab_merged <- crab_merged[, !grepl(\"_updated\", colnames(crab_merged))]\n",
+    "\n",
+    "head(crab_merged)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Extract the data again\n",
+    "crab_extract_prox <- terra::extract(sst, crab_merged[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
+    "\n",
+    "crab_on_sea_sst_prox <- crab[!is.na(crab_extract_prox$sea_surface_temperature),]\n",
+    "\n",
+    "nrow(crab_on_sea_sst_prox)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "4.3.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}