-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d37a40c
commit 55e55c3
Showing
1 changed file
with
396 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,396 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Removing records on land\n", | ||
"\n", | ||
"When you download data from OBIS, one of the first steps in the data cleaning is usually to remove records on land. A species might be recorded on land by many reasons: positional uncertainty, misidentification, records on museums or aquariums, etc.\n", | ||
"\n", | ||
"In this short notebook we will explore how to remove records on land using two different tools:\n", | ||
"\n", | ||
"1. The `obistools` package\n", | ||
"2. A raster file containing environmental information\n", | ||
"\n", | ||
"We will also explore how we can approximate records on land (what might be useful in situations like intertidal species)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## The `obistools` package\n", | ||
"\n", | ||
"The `obistools` R package provide tools for data enhancement and quality control. You can find it [here]() and install using:\n", | ||
"\n", | ||
"`devtools::install_github(\"iobis/obistools\")`\n", | ||
"\n", | ||
"Many functions are intended for improving data **before** submitting it to OBIS, but others can be used after data download. We will specifically use the function `obistools::check_onland()`\n", | ||
"\n", | ||
"You need to supply only one argument (`data`), a `data.frame` with columns `decimalLongitude` and `decimalLatitude`. Additionally you can supply your own polygon for the land (in `sf` format).\n", | ||
"\n", | ||
"We will first download some data to play with. We will use the fiddler crab _Minuca rapax_ as an example." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"library(obistools)\n", | ||
"library(patchwork)\n", | ||
"library(ggplot2)\n", | ||
"\n", | ||
"crab <- robis::occurrence(\"Minuca rapax\")\n", | ||
"\n", | ||
"head(crab)\n", | ||
"\n", | ||
"plot_map(crab)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we can check which records are on land." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_on_land <- check_onland(crab, report = TRUE)\n", | ||
"print(crab_on_land)\n", | ||
"nrow(crab_on_land)\n", | ||
"nrow(crab)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Almost all records, but that is expected for an intertidal species. That's why we can also consider a buffer. But first, let's see how we can remove those records on land:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_on_sea <- crab[-1 * crab_on_land$row,]\n", | ||
"nrow(crab_on_sea)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The function also has a `buffer` argument, expressed in meters from the shoreline. Let's try with a buffer of 200m" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_on_land_buff <- check_onland(crab, report = TRUE, buffer = 200)\n", | ||
"nrow(crab_on_land_buff)\n", | ||
"nrow(crab)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A big difference. Let's also produce a version with those records removed." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_on_sea_buff <- crab[-1 * crab_on_land_buff$row,]\n", | ||
"nrow(crab_on_sea_buff)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"p1 <- plot_map(crab) + ggtitle(\"Original\") + coord_sf(xlim = c(-100, -20))\n", | ||
"p2 <- plot_map(crab_on_sea) + ggtitle(\"No buffer\") + coord_sf(xlim = c(-100, -20))\n", | ||
"p3 <- plot_map(crab_on_sea_buff) + ggtitle(\"Buffer\") + coord_sf(xlim = c(-100, -20))\n", | ||
"\n", | ||
"p1+p2+p3+plot_layout(nrow = 1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Using a raster layer\n", | ||
"\n", | ||
"We can also use a raster layer to remove records on land. This is actually a very easy way to clean up the data and ensure you have only points for which you have coverage in your environmental dataset.\n", | ||
"\n", | ||
"We start by downloading a raster layer. We will download an SST layer from the NOAA CoralTemp product available [here](https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/).\n", | ||
"\n", | ||
"All processing will be done using the package `terra`, but you can achieve the same results with package `stars`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"library(terra)\n", | ||
"\n", | ||
"# Download file\n", | ||
"download.file(\"https://coastwatch.pfeg.noaa.gov/erddap/files/NOAA_DHW_monthly/ct5km_sst_ssta_monthly_v31_202401.nc\",\n", | ||
"\"sst_202401.nc\", method = \"wget\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"sst <- rast(\"sst_202401.nc\")\n", | ||
"\n", | ||
"sst\n", | ||
"\n", | ||
"sst <- sst$sea_surface_temperature\n", | ||
"\n", | ||
"plot(sst)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To remove records with no valid data, we just need to extract the values according to the longitude and latitude." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# It is always interesting to add terra:: before the extract function\n", | ||
"# because extract is a common name for function on the tidyverse and some other packages.\n", | ||
"crab_extract <- terra::extract(sst, crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", | ||
"\n", | ||
"# If you encounter an error, convert crab[,c(\"decimalLongitude\", \"decimalLatitude\")] into\n", | ||
"# a data.frame using data.frame(crab[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", | ||
"\n", | ||
"head(crab_extract)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We see that we have many NAs, that is points for which there is no environmental information. We remove those." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_on_sea_sst <- crab[!is.na(crab_extract$sea_surface_temperature),]\n", | ||
"\n", | ||
"nrow(crab_on_sea_sst)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Plot\n", | ||
"crab_plot <- crab\n", | ||
"crab_plot$status <- ifelse(is.na(crab_extract$sea_surface_temperature), \"On land\", \"On sea\")\n", | ||
"\n", | ||
"sst_crop <- crop(sst, ext(vect(crab_plot, geom = c(\"decimalLongitude\", \"decimalLatitude\"))))\n", | ||
"sst_crop <- as.data.frame(sst_crop, xy = T)\n", | ||
"\n", | ||
"ggplot() +\n", | ||
" geom_raster(data = sst_crop, aes(x = x, y = y, fill = sea_surface_temperature)) +\n", | ||
" geom_point(data = crab_plot,\n", | ||
" aes(x = decimalLongitude, y = decimalLatitude, \n", | ||
" color = status), size = 2) +\n", | ||
" scale_fill_viridis_c() +\n", | ||
" scale_color_manual(values = c(\"#120101\", \"#b100cc\")) +\n", | ||
" theme_light() +\n", | ||
" coord_equal()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Approximating points to land\n", | ||
"\n", | ||
"We can use that same raster layer to approximate points to land, that is, bring an invalid point to the closest valid cell.\n", | ||
"\n", | ||
"There are many ways of doing that, and in fact you can also approximate points to the nearest point on a vector (shoreline). Here we will use a very simple approach based on nearby cells.\n", | ||
"\n", | ||
"We start by getting the cell indices of the NA points. Then we see if any of the adjacent cells is valid (using the queen movement, which looks into the 8 adjacent cells). If so, we chose the one that is geographically close (considering distance)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab$ID <- 1:nrow(crab)\n", | ||
"coordnames <- c(\"decimalLongitude\", \"decimalLatitude\")\n", | ||
"\n", | ||
"non_valid_coord <- crab[is.na(crab_extract$sea_surface_temperature), c(coordnames, \"ID\")]\n", | ||
"\n", | ||
"for (i in seq_len(nrow(non_valid_coord))) {\n", | ||
" tcell <- cellFromXY(sst, data.frame(non_valid_coord[i,1:2]))\n", | ||
" adj_cells <- adjacent(sst, tcell, \"queen\")\n", | ||
" adj_values <- terra::extract(sst, as.vector(adj_cells))\n", | ||
" valid_cells <- adj_cells[!is.na(as.vector(adj_values[,1]))]\n", | ||
" if (length(valid_cells) > 0) {\n", | ||
" if (length(valid_cells) == 1) {\n", | ||
" non_valid_coord[i, coordnames] <- xyFromCell(sst, valid_cells)\n", | ||
" } else {\n", | ||
" valid_xy <- xyFromCell(sst, valid_cells)\n", | ||
" valid_dists <- distance(\n", | ||
" vect(data.frame(non_valid_coord[i,1:2]), geom = coordnames, crs = \"EPSG:4326\"),\n", | ||
" vect(data.frame(valid_xy), geom = c(\"x\", \"y\"), crs = \"EPSG:4326\")\n", | ||
" )\n", | ||
" nearest_pt <- valid_xy[order(valid_dists),][1,]\n", | ||
" non_valid_coord[i, coordnames] <- data.frame(x = nearest_pt[1], y = nearest_pt[2])\n", | ||
" }\n", | ||
" } \n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"crab_merged <- merge(crab, non_valid_coord, by = \"ID\", all.x = TRUE, suffixes = c(\"\", \"_updated\"))\n", | ||
"\n", | ||
"crab_merged$decimalLongitude <- ifelse(is.na(crab_merged$decimalLongitude_updated),\n", | ||
" crab_merged$decimalLongitude,\n", | ||
" crab_merged$decimalLongitude_updated)\n", | ||
"crab_merged$decimalLatitude <- ifelse(is.na(crab_merged$decimalLatitude_updated),\n", | ||
" crab_merged$decimalLatitude,\n", | ||
" crab_merged$decimalLatitude_updated)\n", | ||
"\n", | ||
"crab_merged <- crab_merged[, !grepl(\"_updated\", colnames(crab_merged))]\n", | ||
"\n", | ||
"head(crab_merged)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "r" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Extract the data again\n", | ||
"crab_extract_prox <- terra::extract(sst, crab_merged[,c(\"decimalLongitude\", \"decimalLatitude\")])\n", | ||
"\n", | ||
"crab_on_sea_sst_prox <- crab[!is.na(crab_extract_prox$sea_surface_temperature),]\n", | ||
"\n", | ||
"nrow(crab_on_sea_sst_prox)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "R", | ||
"language": "R", | ||
"name": "ir" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": "r", | ||
"file_extension": ".r", | ||
"mimetype": "text/x-r-source", | ||
"name": "R", | ||
"pygments_lexer": "r", | ||
"version": "4.3.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |