Skip to content

Commit

Permalink
Create remove_records_on_land.ipynb
Browse files Browse the repository at this point in the history
  • Loading branch information
silasprincipe committed Oct 9, 2024
1 parent d37a40c commit 55e55c3
Showing 1 changed file with 396 additions and 0 deletions.
396 changes: 396 additions & 0 deletions notebooks/R/remove_records_on_land.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,396 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Removing records on land\n",
"\n",
"When you download data from OBIS, one of the first steps in the data cleaning is usually to remove records on land. A species might be recorded on land by many reasons: positional uncertainty, misidentification, records on museums or aquariums, etc.\n",
"\n",
"In this short notebook we will explore how to remove records on land using two different tools:\n",
"\n",
"1. The `obistools` package\n",
"2. A raster file containing environmental information\n",
"\n",
"We will also explore how we can approximate records on land (what might be useful in situations like intertidal species)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `obistools` package\n",
"\n",
"The `obistools` R package provide tools for data enhancement and quality control. You can find it [here]() and install using:\n",
"\n",
"`devtools::install_github(\"iobis/obistools\")`\n",
"\n",
"Many functions are intended for improving data **before** submitting it to OBIS, but others can be used after data download. We will specifically use the function `obistools::check_onland()`\n",
"\n",
"You need to supply only one argument (`data`), a `data.frame` with columns `decimalLongitude` and `decimalLatitude`. Additionally you can supply your own polygon for the land (in `sf` format).\n",
"\n",
"We will first download some data to play with. We will use the fiddler crab _Minuca rapax_ as an example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"library(obistools)\n",
"library(patchwork)\n",
"library(ggplot2)\n",
"\n",
"crab <- robis::occurrence(\"Minuca rapax\")\n",
"\n",
"head(crab)\n",
"\n",
"plot_map(crab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can check which records are on land."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_on_land <- check_onland(crab, report = TRUE)\n",
"print(crab_on_land)\n",
"nrow(crab_on_land)\n",
"nrow(crab)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost all records, but that is expected for an intertidal species. That's why we can also consider a buffer. But first, let's see how we can remove those records on land:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_on_sea <- crab[-1 * crab_on_land$row,]\n",
"nrow(crab_on_sea)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function also has a `buffer` argument, expressed in meters from the shoreline. Let's try with a buffer of 200m"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_on_land_buff <- check_onland(crab, report = TRUE, buffer = 200)\n",
"nrow(crab_on_land_buff)\n",
"nrow(crab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A big difference. Let's also produce a version with those records removed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_on_sea_buff <- crab[-1 * crab_on_land_buff$row,]\n",
"nrow(crab_on_sea_buff)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"p1 <- plot_map(crab) + ggtitle(\"Original\") + coord_sf(xlim = c(-100, -20))\n",
"p2 <- plot_map(crab_on_sea) + ggtitle(\"No buffer\") + coord_sf(xlim = c(-100, -20))\n",
"p3 <- plot_map(crab_on_sea_buff) + ggtitle(\"Buffer\") + coord_sf(xlim = c(-100, -20))\n",
"\n",
"p1+p2+p3+plot_layout(nrow = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using a raster layer\n",
"\n",
"We can also use a raster layer to remove records on land. This is actually a very easy way to clean up the data and ensure you have only points for which you have coverage in your environmental dataset.\n",
"\n",
"We start by downloading a raster layer. We will download an SST layer from the NOAA CoralTemp product available [here](https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/).\n",
"\n",
"All processing will be done using the package `terra`, but you can achieve the same results with package `stars`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"library(terra)\n",
"\n",
"# Download file\n",
"download.file(\"https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/ct5km_sst_ssta_monthly_v31_202401.nc\",\n",
"\"sst_202401.nc\", method = \"wget\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"sst <- rast(\"sst_202401.nc\")\n",
"\n",
"sst\n",
"\n",
"sst <- sst$sea_surface_temperature\n",
"\n",
"plot(sst)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To remove records with no valid data, we just need to extract the values according to the longitude and latitude."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# It is always interesting to add terra:: before the extract function\n",
"# because extract is a common name for function on the tidyverse and some other packages.\n",
"crab_extract <- terra::extract(sst, crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
"\n",
"# If you encounter an error, convert crab[,c(\"decimalLongitude\", \"decimalLatitude\")] into\n",
"# a data.frame using data.frame(crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
"\n",
"head(crab_extract)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that we have many NAs, that is points for which there is no environmental information. We remove those."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_on_sea_sst <- crab[!is.na(crab_extract$sea_surface_temperature),]\n",
"\n",
"nrow(crab_on_sea_sst)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Plot\n",
"crab_plot <- crab\n",
"crab_plot$status <- ifelse(is.na(crab_extract$sea_surface_temperature), \"On land\", \"On sea\")\n",
"\n",
"sst_crop <- crop(sst, ext(vect(crab_plot, geom = c(\"decimalLongitude\", \"decimalLatitude\"))))\n",
"sst_crop <- as.data.frame(sst_crop, xy = T)\n",
"\n",
"ggplot() +\n",
" geom_raster(data = sst_crop, aes(x = x, y = y, fill = sea_surface_temperature)) +\n",
" geom_point(data = crab_plot,\n",
" aes(x = decimalLongitude, y = decimalLatitude, \n",
" color = status), size = 2) +\n",
" scale_fill_viridis_c() +\n",
" scale_color_manual(values = c(\"#120101\", \"#b100cc\")) +\n",
" theme_light() +\n",
" coord_equal()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Approximating points to land\n",
"\n",
"We can use that same raster layer to approximate points to land, that is, bring an invalid point to the closest valid cell.\n",
"\n",
"There are many ways of doing that, and in fact you can also approximate points to the nearest point on a vector (shoreline). Here we will use a very simple approach based on nearby cells.\n",
"\n",
"We start by getting the cell indices of the NA points. Then we see if any of the adjacent cells is valid (using the queen movement, which looks into the 8 adjacent cells). If so, we chose the one that is geographically close (considering distance)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab$ID <- 1:nrow(crab)\n",
"coordnames <- c(\"decimalLongitude\", \"decimalLatitude\")\n",
"\n",
"non_valid_coord <- crab[is.na(crab_extract$sea_surface_temperature), c(coordnames, \"ID\")]\n",
"\n",
"for (i in seq_len(nrow(non_valid_coord))) {\n",
" tcell <- cellFromXY(sst, data.frame(non_valid_coord[i,1:2]))\n",
" adj_cells <- adjacent(sst, tcell, \"queen\")\n",
" adj_values <- terra::extract(sst, as.vector(adj_cells))\n",
" valid_cells <- adj_cells[!is.na(as.vector(adj_values[,1]))]\n",
" if (length(valid_cells) > 0) {\n",
" if (length(valid_cells) == 1) {\n",
" non_valid_coord[i, coordnames] <- xyFromCell(sst, valid_cells)\n",
" } else {\n",
" valid_xy <- xyFromCell(sst, valid_cells)\n",
" valid_dists <- distance(\n",
" vect(data.frame(non_valid_coord[i,1:2]), geom = coordnames, crs = \"EPSG:4326\"),\n",
" vect(data.frame(valid_xy), geom = c(\"x\", \"y\"), crs = \"EPSG:4326\")\n",
" )\n",
" nearest_pt <- valid_xy[order(valid_dists),][1,]\n",
" non_valid_coord[i, coordnames] <- data.frame(x = nearest_pt[1], y = nearest_pt[2])\n",
" }\n",
" } \n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"crab_merged <- merge(crab, non_valid_coord, by = \"ID\", all.x = TRUE, suffixes = c(\"\", \"_updated\"))\n",
"\n",
"crab_merged$decimalLongitude <- ifelse(is.na(crab_merged$decimalLongitude_updated),\n",
" crab_merged$decimalLongitude,\n",
" crab_merged$decimalLongitude_updated)\n",
"crab_merged$decimalLatitude <- ifelse(is.na(crab_merged$decimalLatitude_updated),\n",
" crab_merged$decimalLatitude,\n",
" crab_merged$decimalLatitude_updated)\n",
"\n",
"crab_merged <- crab_merged[, !grepl(\"_updated\", colnames(crab_merged))]\n",
"\n",
"head(crab_merged)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Extract the data again\n",
"crab_extract_prox <- terra::extract(sst, crab_merged[,c(\"decimalLongitude\", \"decimalLatitude\")])\n",
"\n",
"crab_on_sea_sst_prox <- crab[!is.na(crab_extract_prox$sea_surface_temperature),]\n",
"\n",
"nrow(crab_on_sea_sst_prox)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.3.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 55e55c3

Please sign in to comment.