-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update data formatting pages & vocab
* added an intro page to orient users to the data formatting section * moved info on mapping fields to DwC to its own page earlier in the ToC * updated taxon maching guidelines with code and clarified text * fixed MBON broken link
- Loading branch information
1 parent
7483d81
commit a742596
Showing
9 changed files
with
111 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
## Map data fields to Darwin Core | ||
|
||
There are many possible ways of setting up your datasheets, and if you are new to OBIS you likely did not use standardized Darwin Core (DwC) or BODC vocabulary before samples were collected. In mapping your data fields to DwC we recommend documenting your choices so you have a reference to go back to should the need arise. In such a document you should take notes on the choices you made, as well as any actions you had to take (e.g. separate one column into many, convert dates or coordinates, etc.). | ||
|
||
For example, a DwC mapping reference table could look like the following: | ||
|
||
| Verbatim field name | Mapped DwC term | Actions taken | Notes | | ||
|:-------|:-------|:------------|:--------------| | ||
| date | eventDate | convert dates to ISO | | | ||
| coordinates| decimalLongitude, decimalLatitude | convert ddmmss to decimal degrees, separated one column into 2 for longitude and latitude | put original coordinates into verbatimCoordinates | | ||
|
||
In order to help you map your data to DwC terms, we have provided the table below which outlines some common data fields, their associated Darwin Core vocabulary, and which data table the field is likely to go in: | ||
|
||
| Common Raw Terms | DwC Field | Data table | | ||
|:----------------- |:----------------------------- |:---------| | ||
| Date, Time | eventDate | Event, Occurrence | | ||
| Species, g_s, taxa | scientificName | Occurrence | | ||
| Any biotic/abiotic measurements* | measurementType, measurementValue, measurementUnit* | eMoF | | ||
| Depth | maximumDepthInMeters or minimumDepthInMeters | Event, Occurrence | | ||
| Lat/Latitude, Lon/Long/Longitude, dd | decimalLatitude, decimalLongitude | Event, Occurrence | | ||
| Sampling method | samplingProtocol | Event, eMoF | | ||
| Sample size, N, #, No. | sampleSizeValue | Event, eMoF | | ||
| Location | locality | Event | | ||
| Presence, absence | occurrenceStatus | Occurrence | | ||
| Type of record/ specimen | basisofRecord | Occurrence | | ||
| Person/ people that recorded the original Occurrence | recordedBy | Occurrence | | ||
| OrcID of person/ people that recorded the original Occurrence | recordedByID | Occurrence | | ||
| Person/ people that identified the organism | identifiedBy | Occurrence | | ||
| OrcID of person/ people that identified the organism | identifiedByID | Occurrence | | ||
| Data collector, data creator | recordedBy | Event, Occurrence | | ||
| Taxonomist, identifier | identifiedBy | Occurrence | | ||
| Record number, sample number, observation number | occurrenceID (either ID or incorporated into ID) | Occurrence | | ||
|
||
<div class=callbox-blue> | ||
|
||
`r fontawesome::fa("flag", fill="darkblue", prefer_type="solid")` Note that mapping abiotic/biotic measurement fields (sex, temperature, abundance, lengths, etc.) will occur within the [extendedMeasurementOrFact extension](format_emof.html). Here this data will go from being a separate column to being condensed into the `measurementType` and `measurementValue` fields. | ||
</div> | ||
|
||
The `obistools` R package also has the [`map_fields` function](https://github.com/iobis/obistools#map-column-names-to-darwin-core-terms) that you can use to map your dataset fields to a DwC term. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# (PART\*) Data Formatting {-} | ||
|
||
# Data formatting workflow | ||
|
||
Preparing data can be a challenging process at first. This section of the manual provides guidance on formatting data for OBIS so that it complies to Darwin Core. | ||
|
||
The general data formatting workflow you can follow is: | ||
|
||
1. Identify your [dataset structure](formatting.html) | ||
* Understand the structure of your data and how it fits into Darwin Core | ||
2. Create [unqiue identifiers](identifiers.html) | ||
* Assign unique identifiers to distinguish between events, nested events, and biological occurrences | ||
3. [Match](name_matching.html) taxon names | ||
* Align taxonomic names to [World Register of Marine Species (WoRMS)](https://www.marinespecies.org/) to ensure consistency and retrieve their associated identifiers | ||
4. [Map data column](data_map.html) names to Darwin Core | ||
* Rename data columns to match [Darwin Core terms](https://dwc.tdwg.org/terms/) | ||
5. Organize measurements, facts, and information | ||
* Structure measurement and fact data in long format in the [extendedMeasurementOrFact table](format_emof.html) | ||
6. Identify [controlled vocabularies](vocabulary.md) to include with your measurements | ||
* Ensure measurements and facts reference appropriate controlled vocabularies for interoperability and clarity | ||
7. Standardize [other fields](common_formatissues.html) | ||
* Verify that all other fields, such as dates and coordinates, conform to standards | ||
|
||
The following pages provide a detailed breakdown of each step, including examples and tips to help you through the formatting process. Remember, the OBIS Helpdesk and OBIS Nodes are available to help you. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
## Constructing and using identifier codes | ||
# Constructing and using identifiers | ||
|
||
**Content** | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
## Name matching for taxonomic quality control | ||
# Match Taxonomic Names | ||
|
||
OBIS requires all your specimens to be classified and matched against an authoritative taxonomic register. This effectively attaches unique stable identifiers (and digitally traceable) to each of your species. Meaning, if a taxonomic ranking or a species name changes in the future, there will be no question as to which species your dataset is actually referring to. Matching to registers also helps to avoid misspelled or unused terms. | ||
|
||
|
@@ -67,7 +67,7 @@ A complete online manual is available at [http://www.marinespecies.org/tutorial/ | |
|
||
**R script for attaching Taxon Lists to ID Lists:** | ||
|
||
If you are familiar enough with R, you can use the [`merge`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function to attach the two lists to your data. We provide a short example of how to use this function below. | ||
If you are familiar enough with R, you can use the [`merge`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function to attach the two lists to your data. See a short example of how to use this function below. | ||
|
||
```R | ||
#Generate example data table with species occurences, for this example we will only have one column with the scientificName | ||
|
@@ -78,13 +78,39 @@ lsids<- data.frame(scientificName=c("Ginglymostoma cirratum","Luidia maculata"," | |
LSID = c("urn:lsid:marinespecies.org:taxname:105846", "urn:lsid:marinespecies.org:taxname:213112","urn:lsid:marinespecies.org:taxname:127029","urn:lsid:marinespecies.org:taxname:105847")) | ||
|
||
#merge data frames together | ||
matched_data<-merge(data, lsids, by = "scientificName") | ||
matched_data<-merge(data, lsids, by = "scientificName", all=TRUE) | ||
matched_data | ||
``` | ||
|
||
Including `"all=T"` in the merge function ensures that all rows from both data frames are retained in the resulting merged object (`matched_data`), even if the `scientificName` values in `data` and `lsids` do not match perfectly. This approach is particularly helpful in cases where there may be typos in `scientificName`. For example, a mismatch like "Thunus" instead of "Thunnus" would prevent proper linking of data to the corresponding LSID. By including `all = TRUE`, unmatched rows will still appear in the output, making it easier to identify and review any discrepancies or extra rows that may need correction. | ||
|
||
### R packages for taxon matching {.unlisted .unnumbered} | ||
|
||
There are several R packages available to assist you with taxon matching: | ||
|
||
1. [obistools](https://github.com/iobis/obistools#taxon-matching): use the `match_taxa` function to conduct taxon matching for a dataset in R | ||
2. [worrms](https://cran.r-project.org/web/packages/worrms/index.html): use the `wm_records_taxamatch` function to access the WoRMS API for taxon matching | ||
|
||
Both packages provide tools for resolving taxonomic names and ensuring your dataset aligns with accepted nomenclature. However, always verify ambiguous matches using other registers or manual checks to confirm accuracy. | ||
|
||
See below for example R code. | ||
|
||
```r | ||
library(obistools) | ||
|
||
# Read in occurrence table | ||
occur<-read.csv("occurence_table.csv") | ||
# Conduct taxon matching on only the unique instances of each taxa's name | ||
worms<-match_taxa(unique(occur$scientificName), ask=T) | ||
# Merge the matched names back with occurrence data | ||
occur_match<-merge(occur, worms, by="scientificName", all= T) | ||
``` | ||
|
||
Note we have incldued the `ask` parameter in `match_taxa`. This parameter triggers interactive prompts during the taxon matching process, which is useful when multiple matches are found and allows you to manually select the correct taxon. | ||
|
||
#### How to fetch a full classification for a list of species from WoRMS? | ||
|
||
When setting up your WoRMS taxon match, to obtain the full classification for your list of species, simply check the box labeled “Classification”. This will add classification output in addition to the requested identifiers to your taxon match file, including Kingdom, Phylum, Class, Order, Family, Genus, Subgenus, Species, and Subspecies. | ||
When setting up your WoRMS taxon match from the web interface, to obtain the full classification for your list of species, simply check the box labeled “Classification”. This will add classification output in addition to the requested identifiers to your taxon match file, including Kingdom, Phylum, Class, Order, Family, Genus, Subgenus, Species, and Subspecies. | ||
|
||
![WoRMS classification box](images/WoRMS_classification.png){width=70%} | ||
|
||
|
@@ -136,12 +162,6 @@ Currently, this web service matches the scientific names with the following taxo | |
|
||
The Interim Register of Marine and Non-marine Genera (IRMNG) matching services are available through [http://www.irmng.org/](http://www.irmng.org/), as well as through the [LifeWatch taxon match](http://www.lifewatch.be/data-services/). This service allows you to search for a genus (or other taxonomic rank when you uncheck the “genera” box) to check if it is known to be marine, brackish, freshwater, or terrestrial. You can find this information in the row labeled “Environment”. If the taxa is marine, you may have to contact the WoRMS data management team (<[email protected]>) to have the taxon added to the WoRMS register (note you may have to provide supporting information confirming taxonomic and marine status). | ||
|
||
### R packages for taxon matching {.unlisted .unnumbered} | ||
|
||
If you are familiar with R, you may use the [obistools](https://github.com/iobis/obistools#taxon-matching) function `match_taxa` to conduct taxon matching for your dataset. There is also a WoRMS package called [worrms](https://cran.r-project.org/web/packages/worrms/index.html) that has a function called `wm_records_taxamatch` you can use to conduct taxon matching. | ||
|
||
The output will be the same as that from the WoRMS tool, so you should check ambiguous matches as described above, confirming with other registers as necessary. | ||
|
||
### Taxon Match Tools Overview {.unlisted .unnumbered} | ||
|
||
See the table below for a summary of the different tools available. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.