generated from trias-project/checklist-recipe
-
Notifications
You must be signed in to change notification settings - Fork 0
/
dwc_mapping.Rmd
164 lines (131 loc) · 5.44 KB
/
dwc_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Darwin Core mapping of VMM macrophytes data"
author:
- Damiano Oldoni
- Peter Desmet
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
number_sections: yes
toc: yes
toc_depth: 3
toc_float: yes
---
# Setup
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = TRUE)
```
Install required libraries (if not yet installed):
```{r install_pkgs}
installed <- rownames(installed.packages())
required <- c("dplyr", "here", "readr", "glue",
"readxl", "DBI", "RSQLite", "sf", "digest")
if (!all(required %in% installed)) {
install.packages(required[!required %in% installed])
}
```
Load libraries:
```{r load_pkgs, message = FALSE}
library(dplyr) # To transform data
library(readr) # To read and write tabular text files
library(here) # To find files
library(glue) # To insert variables in strings
library(readxl) # To read Excel files
library(DBI) # To create and query databases
library(RSQLite) # To work with SQLite databases
library(sf) # To work with spatial data
library(digest) # To generate hashes
```
# Read source data
The source data were provided manually and stored in `data/raw`. If this workflow can be fully automated, replace with querying from VMM endpoints.
Read source files:
```{r read_raw_data}
raw_data_filename <- "macrofyten 2022-10-04T08_39_42.720Z.xlsx"
observations <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "opnames_lijst") %>%
mutate(across(everything(), as.character))
vegetations <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "vegetatie-ontwikkeling") %>%
mutate(across(everything(), as.character))
locations <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "meetplaatsen") %>%
mutate(across(everything(), as.character))
features <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "kenmerken") %>%
mutate(across(everything(), as.character))
measurements <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "veldmetingen (FC)") %>%
mutate(across(everything(), as.character))
ecologic_coeffs <-
readxl::read_xlsx(here::here("data", "raw", raw_data_filename), sheet = "EKC MAF") %>%
mutate(across(everything(), as.character))
```
## Calculate geographical coordinates in WGS 84
We retrieved the coordinates in [WGS 84](https://epsg.io/4326) (EPSG:4326) based on the original coordinates in [Belgian Lambert 72](https://epsg.io/31370) (EPSG:31370):
```{r transform_to_wgs84}
locations <-
locations %>%
st_as_sf(coords = c("X", "Y"), crs = 31370) %>%
st_transform(crs = 4326) %>%
st_coordinates() %>%
as.data.frame() %>%
rename(decimalLongitude = X, decimalLatitude = Y) %>%
bind_cols(locations) %>%
relocate(colnames(locations))
head(locations)
```
## Generate hashes for species names and growth forms
We generate a hash based on the combination of species name as saved in column `Macrofyt Naam` and growth form as saved in column `groeivorm`. This is needed to create a unique `occurrenceID` of the form `eventID:hash` where `eventID` is the unique identifier of the event as defined in column `deelmonster_id`. As long as the combination of species name and growth form doesn't change, the hash and so the `occurrenceID` will be stable:
```{r generate_hashes}
vdigest <- Vectorize(digest)
# Generate hashes
observations <-
observations %>%
mutate(species_name_hash = vdigest(paste0(.data$`Macrofyt Naam`,
.data$groeivorm),
algo = "md5"))
```
# Create database
Create a SQLite database with the source data, so it can be queried with SQL in the next steps:
```{r create_db}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Import data
DBI::dbWriteTable(con, "observations", observations)
DBI::dbWriteTable(con, "vegetations", vegetations)
DBI::dbWriteTable(con, "locations", locations)
DBI::dbWriteTable(con, "features", features)
DBI::dbWriteTable(con, "measurements", measurements)
DBI::dbWriteTable(con, "ecologic_coeffs", ecologic_coeffs)
```
# Darwin Core mapping
Create [Event](https://rs.gbif.org/core/dwc_event_2022-02-02.xml) core:
```{r event}
dwc_event_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_event.sql")), .con = con)
dwc_event <- DBI::dbGetQuery(con, dwc_event_sql)
```
Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:
```{r occurrence}
dwc_occurrence_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_occurrence.sql")), .con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
```
Create [Measurement Or Facts](https://rs.gbif.org/extension/dwc/measurements_or_facts_2022-02-02.xml) extension:
```{r}
dwc_mof_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_mof.sql")), .con = con)
dwc_mof <- DBI::dbGetQuery(con, dwc_mof_sql)
```
Disconnect from SQLite database:
```{r disconnect}
DBI::dbDisconnect(con)
```
# Save data to CSV
```{r save_csv}
write_csv(dwc_event, here::here("data", "processed", "event.csv"), na = "")
write_csv(dwc_occurrence, here::here("data", "processed", "occurrence.csv"), na = "")
write_csv(dwc_mof, here::here("data", "processed", "mof.csv"), na = "")
```
# Test output
Load tests and run them to validate the DwC mapping:
```{r run_tests, message=FALSE}
source(here::here("tests", "test_dwc_event_occurrence_mof.R"))
```