src/dwc_mapping.Rmd

---
title: "Darwin Core mapping of VMM macrophytes data"
author:
- Damiano Oldoni
- Peter Desmet
date: "`r Sys.Date()`"
output:
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
---

# Setup 

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = TRUE)
```

Install required libraries (if not yet installed):

```{r install_pkgs}
installed <- rownames(installed.packages())
required <- c("dplyr", "here", "readr", "glue", 
              "readxl", "DBI", "RSQLite", "sf", "digest")
if (!all(required %in% installed)) {
  install.packages(required[!required %in% installed])
}
```

Load libraries:

```{r load_pkgs, message = FALSE}
library(dplyr)          # To transform data
library(readr)          # To read and write tabular text files
library(here)           # To find files
library(glue)           # To insert variables in strings
library(readxl)         # To read Excel files
library(DBI)            # To create and query databases
library(RSQLite)        # To work with SQLite databases
library(sf)             # To work with spatial data
library(digest)         # To generate hashes
```

# Read source data

The source data were provided manually and stored in `data/raw`. If this workflow can be fully automated, replace with querying from VMM endpoints.

Read source files:

```{r read_raw_data}
raw_data_filename <- "macrofyten 2022-10-04T08_39_42.720Z.xlsx"
observations <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "opnames_lijst") %>%
  mutate(across(everything(), as.character))
vegetations <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "vegetatie-ontwikkeling") %>%
  mutate(across(everything(), as.character))
locations <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "meetplaatsen") %>%
  mutate(across(everything(), as.character))
features <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "kenmerken") %>%
  mutate(across(everything(), as.character))
measurements <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "veldmetingen (FC)") %>%
  mutate(across(everything(), as.character))
ecologic_coeffs <-
  readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "EKC MAF") %>%
  mutate(across(everything(), as.character))
```

## Calculate geographical coordinates in WGS 84

We retrieved the coordinates in [WGS 84](https://epsg.io/4326) (EPSG:4326) based on the original coordinates in [Belgian Lambert 72](https://epsg.io/31370) (EPSG:31370):

```{r transform_to_wgs84}
locations <-
  locations %>%
  st_as_sf(coords = c("X", "Y"), crs = 31370) %>%
  st_transform(crs = 4326) %>%
  st_coordinates() %>%
  as.data.frame() %>%
  rename(decimalLongitude = X, decimalLatitude = Y) %>%
  bind_cols(locations) %>%
  relocate(colnames(locations))
head(locations)
```

## Generate hashes for species names and growth forms

We generate a hash based on the combination of species name as saved in column `Macrofyt Naam` and growth form as saved in column `groeivorm`. This is needed to create a unique `occurrenceID` of the form `eventID:hash` where `eventID` is the unique identifier of the event as defined in column `deelmonster_id`. As long as the combination of species name and growth form doesn't change, the hash and so the `occurrenceID` will be stable:

```{r generate_hashes}
vdigest <- Vectorize(digest)
# Generate hashes
observations <-
  observations %>% 
  mutate(species_name_hash = vdigest(paste0(.data$`Macrofyt Naam`,
                                            .data$groeivorm),
                                     algo = "md5"))
```

# Create database

Create a SQLite database with the source data, so it can be queried with SQL in the next steps:

```{r create_db}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Import data
DBI::dbWriteTable(con, "observations", observations)
DBI::dbWriteTable(con, "vegetations", vegetations)
DBI::dbWriteTable(con, "locations", locations)
DBI::dbWriteTable(con, "features", features)
DBI::dbWriteTable(con, "measurements", measurements)
DBI::dbWriteTable(con, "ecologic_coeffs", ecologic_coeffs)
```

# Darwin Core mapping

Create [Event](https://rs.gbif.org/core/dwc_event_2022-02-02.xml) core:

```{r event}
dwc_event_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_event.sql")), .con = con)
dwc_event <- DBI::dbGetQuery(con, dwc_event_sql)
```

Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:

```{r occurrence}
dwc_occurrence_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_occurrence.sql")), .con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
```

Create [Measurement Or Facts](https://rs.gbif.org/extension/dwc/measurements_or_facts_2022-02-02.xml) extension:

```{r}
dwc_mof_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_mof.sql")), .con = con)
dwc_mof <- DBI::dbGetQuery(con, dwc_mof_sql)
```

Disconnect from SQLite database:

```{r disconnect}
DBI::dbDisconnect(con)
```

# Save data to CSV

```{r save_csv}
write_csv(dwc_event, here::here("data", "processed", "event.csv"), na = "")
write_csv(dwc_occurrence, here::here("data", "processed", "occurrence.csv"), na = "")
write_csv(dwc_mof, here::here("data", "processed", "mof.csv"), na = "")
```

# Test output

Load tests and run them to validate the DwC mapping:

```{r run_tests, message=FALSE}
source(here::here("tests", "test_dwc_event_occurrence_mof.R"))
```