update data formatting pages & vocab

* added an intro page to orient users to the data formatting section * moved info on mapping fields to DwC to its own page earlier in the ToC * updated taxon maching guidelines with code and clarified text * fixed MBON broken link
iobis · Dec 2, 2024 · a742596 · a742596
1 parent 7483d81
commit a742596
Show file tree

Hide file tree

Showing 9 changed files with 111 additions and 68 deletions.
diff --git a/_bookdown.yml b/_bookdown.yml
@@ -14,10 +14,12 @@ rmd_files:
   - relational_db.md
   - eml.md
   - nodes.md
+  - formatting_intro.md
   - formatting.md
   - identifiers.md
-  - checklist.md
   - name_matching.md
+  - data_map.md
+  - checklist.md
   - format_occurrence.md
   - format_event.md
   - format_emof.md

diff --git a/data_map.md b/data_map.md
@@ -0,0 +1,39 @@
+## Map data fields to Darwin Core
+
+There are many possible ways of setting up your datasheets, and if you are new to OBIS you likely did not use standardized Darwin Core (DwC) or BODC vocabulary before samples were collected. In mapping your data fields to DwC we recommend documenting your choices so you have a reference to go back to should the need arise. In such a document you should take notes on the choices you made, as well as any actions you had to take (e.g. separate one column into many, convert dates or coordinates, etc.).
+
+For example, a DwC mapping reference table could look like the following:
+
+| Verbatim field name | Mapped DwC term | Actions taken | Notes |
+|:-------|:-------|:------------|:--------------|
+| date | eventDate | convert dates to ISO |  |
+| coordinates| decimalLongitude, decimalLatitude | convert ddmmss to decimal degrees, separated one column into 2 for longitude and latitude | put original coordinates into verbatimCoordinates |
+
+In order to help you map your data to DwC terms, we have provided the table below which outlines some common data fields, their associated Darwin Core vocabulary, and which data table the field is likely to go in:
+
+| Common Raw Terms | DwC Field | Data table |
+|:----------------- |:----------------------------- |:---------|
+| Date, Time | eventDate | Event, Occurrence |
+| Species, g_s, taxa | scientificName | Occurrence |
+| Any biotic/abiotic measurements* | measurementType, measurementValue, measurementUnit* | eMoF |
+| Depth | maximumDepthInMeters or minimumDepthInMeters | Event, Occurrence |
+| Lat/Latitude, Lon/Long/Longitude, dd | decimalLatitude, decimalLongitude | Event, Occurrence |
+| Sampling method | samplingProtocol | Event, eMoF |
+| Sample size, N, #, No. | sampleSizeValue | Event, eMoF |
+| Location | locality | Event |
+| Presence, absence | occurrenceStatus | Occurrence |
+| Type of record/ specimen | basisofRecord | Occurrence |
+| Person/ people that recorded the original Occurrence | recordedBy | Occurrence |
+| OrcID of person/ people that recorded the original Occurrence | recordedByID | Occurrence |
+| Person/ people that identified the organism | identifiedBy | Occurrence |
+| OrcID of person/ people that identified the organism | identifiedByID | Occurrence |
+| Data collector, data creator | recordedBy | Event, Occurrence |
+| Taxonomist, identifier | identifiedBy | Occurrence |
+| Record number, sample number, observation number | occurrenceID (either ID or incorporated into ID) | Occurrence |
+
+<div class=callbox-blue>
+
+`r fontawesome::fa("flag", fill="darkblue", prefer_type="solid")` Note that mapping abiotic/biotic measurement fields (sex, temperature, abundance, lengths, etc.) will occur within the [extendedMeasurementOrFact extension](format_emof.html). Here this data will go from being a separate column to being condensed into the `measurementType` and `measurementValue` fields.
+</div>
+
+The `obistools` R package also has the [`map_fields` function](https://github.com/iobis/obistools#map-column-names-to-darwin-core-terms) that you can use to map your dataset fields to a DwC term.
diff --git a/data_qc.md b/data_qc.md
@@ -60,7 +60,7 @@ If you have difficulty installing `obistools`, please try updating your R packag
 
 To use `obistools` to conduct quality control, you can follow the general order below. Please see the [`obistools` GitHub](https://github.com/iobis/obistools) for examples of how to use the functions.
 
-1. Check that the taxa match with WoRMS
+1. Check that the taxa names [match with WoRMS](name_matching.html)
     * [`obistools::match_taxa`](https://github.com/iobis/obistools#taxon-matching)
 2. Check that all required fields are present in the occurrence table
     * [`obistools::check_fields`](https://github.com/iobis/obistools#check-required-fields)

diff --git a/formatting.md b/formatting.md
@@ -1,5 +1,3 @@
-# (PART\*) Data Formatting {-}
-
 # Dataset structure
 
 <div class="callbox-blue">
@@ -8,8 +6,6 @@
 
 </div>
 
-Formatting data can be challenging. This section of the manual deals with how to format data for OBIS, beginning with an overview of dataset structure.
-
 Determining how your dataset will be structured is one of the first steps towards getting your data ready for publishing. At this first step it is important to determine which structure best suits your dataset before proceeding because it will determine which Darwin Core fields will need to be included in your data. Once you have decided on the dataset structure, you can continue formatting the dataset.
 
 We have created the following flow chart for an overview on how to determine what structure best suits your data.

diff --git a/formatting_intro.md b/formatting_intro.md
@@ -0,0 +1,24 @@
+# (PART\*) Data Formatting {-}
+
+# Data formatting workflow
+
+Preparing data can be a challenging process at first. This section of the manual provides guidance on formatting data for OBIS so that it complies to Darwin Core.
+
+The general data formatting workflow you can follow is:
+
+1. Identify your [dataset structure](formatting.html)
+   * Understand the structure of your data and how it fits into Darwin Core
+2. Create [unqiue identifiers](identifiers.html)
+   * Assign unique identifiers to distinguish between events, nested events, and biological occurrences
+3. [Match](name_matching.html) taxon names
+   * Align taxonomic names to [World Register of Marine Species (WoRMS)](https://www.marinespecies.org/) to ensure consistency and retrieve their associated identifiers
+4. [Map data column](data_map.html) names to Darwin Core
+   * Rename data columns to match [Darwin Core terms](https://dwc.tdwg.org/terms/)
+5. Organize measurements, facts, and information
+   * Structure measurement and fact data in long format  in the [extendedMeasurementOrFact table](format_emof.html)
+6. Identify [controlled vocabularies](vocabulary.md) to include with your measurements
+   * Ensure measurements and facts reference appropriate controlled vocabularies for interoperability and clarity
+7. Standardize [other fields](common_formatissues.html)
+   * Verify that all other fields, such as dates and coordinates, conform to standards
+
+The following pages provide a detailed breakdown of each step, including examples and tips to help you through the formatting process. Remember, the OBIS Helpdesk and OBIS Nodes are available to help you.
diff --git a/identifiers.md b/identifiers.md
@@ -1,4 +1,4 @@
-## Constructing and using identifier codes
+# Constructing and using identifiers
 
 **Content**
 

diff --git a/name_matching.md b/name_matching.md
@@ -1,4 +1,4 @@
-## Name matching for taxonomic quality control 
+# Match Taxonomic Names
 
 OBIS requires all your specimens to be classified and matched against an authoritative taxonomic register. This effectively attaches unique stable identifiers (and digitally traceable) to each of your species. Meaning, if a taxonomic ranking or a species name changes in the future, there will be no question as to which species your dataset is actually referring to. Matching to registers also helps to avoid misspelled or unused terms.
 
@@ -67,7 +67,7 @@ A complete online manual is available at [http://www.marinespecies.org/tutorial/
 
 **R script for attaching Taxon Lists to ID Lists:**
 
-If you are familiar enough with R, you can use the [`merge`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function to attach the two lists to your data. We provide a short example of how to use this function below.
+If you are familiar enough with R, you can use the [`merge`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function to attach the two lists to your data. See a short example of how to use this function below.
 
 ```R
 #Generate example data table with species occurences, for this example we will only have one column with the scientificName
@@ -78,13 +78,39 @@ lsids<- data.frame(scientificName=c("Ginglymostoma cirratum","Luidia maculata","
  LSID = c("urn:lsid:marinespecies.org:taxname:105846", "urn:lsid:marinespecies.org:taxname:213112","urn:lsid:marinespecies.org:taxname:127029","urn:lsid:marinespecies.org:taxname:105847"))
 
 #merge data frames together
-matched_data<-merge(data, lsids, by = "scientificName")
+matched_data<-merge(data, lsids, by = "scientificName", all=TRUE)
 matched_data
 ```
 
+Including `"all=T"` in the merge function ensures that all rows from both data frames are retained in the resulting merged object (`matched_data`), even if the `scientificName` values in `data` and `lsids` do not match perfectly. This approach is particularly helpful in cases where there may be typos in `scientificName`. For example, a mismatch like "Thunus" instead of "Thunnus" would prevent proper linking of data to the corresponding LSID. By including `all = TRUE`, unmatched rows will still appear in the output, making it easier to identify and review any discrepancies or extra rows that may need correction.
+
+### R packages for taxon matching {.unlisted .unnumbered}
+
+There are several R packages available to assist you with taxon matching:
+
+1. [obistools](https://github.com/iobis/obistools#taxon-matching): use the `match_taxa` function to conduct taxon matching for a dataset in R
+2. [worrms](https://cran.r-project.org/web/packages/worrms/index.html): use the `wm_records_taxamatch` function to access the WoRMS API for taxon matching
+
+Both packages provide tools for resolving taxonomic names and ensuring your dataset aligns with accepted nomenclature. However, always verify ambiguous matches using other registers or manual checks to confirm accuracy.
+
+See below for example R code.
+
+```r
+library(obistools)
+
+# Read in occurrence table
+occur<-read.csv("occurence_table.csv")
+# Conduct taxon matching on only the unique instances of each taxa's name
+worms<-match_taxa(unique(occur$scientificName), ask=T)
+# Merge the matched names back with occurrence data
+occur_match<-merge(occur, worms, by="scientificName", all= T)
+```
+
+Note we have incldued the `ask` parameter in `match_taxa`. This parameter triggers interactive prompts during the taxon matching process, which is useful when multiple matches are found and allows you to manually select the correct taxon.
+
 #### How to fetch a full classification for a list of species from WoRMS?
 
-When setting up your WoRMS taxon match, to obtain the full classification for your list of species, simply check the box labeled “Classification”. This will add classification output in addition to the requested identifiers to your taxon match file, including Kingdom, Phylum, Class, Order, Family, Genus, Subgenus, Species, and Subspecies.
+When setting up your WoRMS taxon match from the web interface, to obtain the full classification for your list of species, simply check the box labeled “Classification”. This will add classification output in addition to the requested identifiers to your taxon match file, including Kingdom, Phylum, Class, Order, Family, Genus, Subgenus, Species, and Subspecies.
 
 ![WoRMS classification box](images/WoRMS_classification.png){width=70%}
 
@@ -136,12 +162,6 @@ Currently, this web service matches the scientific names with the following taxo
 
 The Interim Register of Marine and Non-marine Genera (IRMNG) matching services are available through [http://www.irmng.org/](http://www.irmng.org/), as well as through the [LifeWatch taxon match](http://www.lifewatch.be/data-services/). This service allows you to search for a genus (or other taxonomic rank when you uncheck the “genera” box) to check if it is known to be marine, brackish, freshwater, or terrestrial. You can find this information in the row labeled “Environment”. If the taxa is marine, you may have to contact the WoRMS data management team (<[email protected]>) to have the taxon added to the WoRMS register (note you may have to provide supporting information confirming taxonomic and marine status).
 
-### R packages for taxon matching {.unlisted .unnumbered}
-
-If you are familiar with R, you may use the [obistools](https://github.com/iobis/obistools#taxon-matching) function `match_taxa` to conduct taxon matching for your dataset. There is also a WoRMS package called [worrms](https://cran.r-project.org/web/packages/worrms/index.html) that has a function called `wm_records_taxamatch` you can use to conduct taxon matching.
-
-The output will be the same as that from the WoRMS tool, so you should check ambiguous matches as described above, confirming with other registers as necessary.
-
 ### Taxon Match Tools Overview {.unlisted .unnumbered}
 
 See the table below for a summary of the different tools available.

diff --git a/other_resources.md b/other_resources.md
@@ -8,7 +8,7 @@ In this section we highlight useful resources created by collaborators and other
 
 - <https://www.youtube.com/watch?v=teJhfsSWonE>
 
-This tutorial was created by the [MBON Pole to Pole project](https://marinebon.org/p2p/index.html) to help guide people through the process of transforming datasets to Darwin Core using [tools](https://marinebon.org/p2p/methods_data_science.html) MBON Pole to Pole has developed.
+This tutorial was created by the [MBON Pole to Pole project](https://marinebon.github.io/p2p/index.html) to help guide people through the process of transforming datasets to Darwin Core using [tools](https://marinebon.github.io/p2p/methods_data_science.html) MBON Pole to Pole has developed.
 
 ## IOOS Darwin Core Guide
 
@@ -28,7 +28,6 @@ There is an [Excel template generator](https://www.nordatanet.no/aen/template-ge
 
 There is also an [Excel to Darwin Core macro tool](https://zenodo.org/record/6453921#.Y9KsQkHMKmU) developed by GBIF Norway that you can download for use in Microsoft Excel. This macro can help you set up Event, Occurrence, and eMoF tables by selecting all relevant DwC fields from a list, or by importing data from another spreadsheet. It allows for auto-generation of identifiers (e.g. eventID, occurrenceID) if macros are enabled, and can also auto-populate the eMoF when measurement fields in the Occurrence table are populated.
 
-
 ## Bionomia
 
 - <https://bionomia.net/>