Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advice for sharing ORCIDs via dwc - question submitted via dwchour form 7/31/2019 12:47:41 #144

Open
iDigBioBot opened this issue Jul 31, 2019 · 16 comments
Labels
answered form submission term - Occurrence Pertaining to a term organized in the Occurrence class.

Comments

@iDigBioBot
Copy link
Collaborator

A user submitted this information via the Darwin Core Hour webform:
Timestamp: 7/31/2019 12:47:41
Please provide a topic of interest: How to put an ORCID in dwc for dwc:recordedBy
Are you capable of and interested in participating: Yes
Who else would you recommend to participate in the presentation: David Shorthouse, Rod Page, John Wieczorek, Quentin Groom, Steve Baskauf, Stan Blum
What resources can you point to: See https://docs.google.com/spreadsheets/d/1E9SZCb8Yvjf4xLlSDW6JHxV971eNV1CDcni8OaAGOFI/edit#gid=0 and this tweet https://twitter.com/FrostMuseum/status/1153732132591853570
Your name: Debbie Paul
Your email: [email protected]
Your GitHub username: @debpaul

@debpaul
Copy link
Contributor

debpaul commented Jul 31, 2019

Hi @tdwg/dwc-qa @stanblum @dshorthouse Rod Page, @baskaufs please see great move toward better standard of practice for sharing/documenting vouchering expectations and guidelines with researchers -- in this tweet from entomologist Andy Deans at the Frost Museum (Penn State) https://twitter.com/FrostMuseum/status/1153732132591853570 in the guidelines, Andy shares a link to a sample data collection sheet he recommends. Note this string in the spreadsheet "orchidID":"https://orcid.org/0000-0002-2119-4663" My question is the o-r-c-h-i-d part? is this standard ("orchidID")?

@debpaul debpaul changed the title Darwin Core Hour Input Form 7/31/2019 12:47:41 advice for sharing ORCIDs via dwc - question submitted via dwchour form 7/31/2019 12:47:41 Jul 31, 2019
@kcopas
Copy link

kcopas commented Jul 31, 2019

Not standard. The issuing organization (facing a clear branding/ identity challenge) are clear that these identifiers are intended to be referred to as ‘ORCID IDs’.

Sent with GitHawk

@debpaul
Copy link
Contributor

debpaul commented Jul 31, 2019

So what recommendation do we make to Andy Deans for his protocol form? He's encouraging use of an ORCID, yay. Where does it go in DwC? Does this require dwciri @baskaufs?

@dshorthouse
Copy link

There's the ORCID branding issue - they prefer it be called ORCID ID when referring to the identifier and not the organization - but there's also how to best express the content of data cells for our own uses.

A key:value pair as indicated in Andy's sample spreadsheet in dwc:recordedBy or dwc:identifiedBy would be buried because we don't expect it (many as an array?) to be present. It's unlikely anyone or any machine would take action on these unless a consumer were to write a custom regex.

We do have dwciri:recordedBy and dwciri:identifiedBy, a namespace for non-literal objects in which we CAN put content like https://orcid.org/0000-0002-2119-4663. That allows us to have many collectors per specimen record and subsequently permits something like a JSON-LD representation, eg https://bloodhound-tracker.net/occurrence/477976412.json. AFAIK, no one is doing anything with these dwciri non-literal equivalents, including GBIF.

But...we're slipping into 1:many territory here. Andy desires a single spreadsheet view so as to simplify the task for users and processors of his spreadsheet. I recommend that he stick with the literal strings as is always done in recordedBy, eg "Andy Deans; Daniel H. Janzen". If he desires more to help push along our need for formal, machine-readable recognition thru ORCID IDs (or other), a spreadsheet is not the best place to capture this.

As it happens, I have finally been (quietly) plodding on a DwC-A extension for AgentActions to be used in an IPT. Anne Thessen @diatomsRcool would like to see this task completed as a product of the RDA/TDWG Interest Group. This is nowhere near ready for prime-time. You'll at least see where we're going with: https://github.com/tdwg/attribution/tree/master/dwc. I'm hoping this will be done in time for a demo at the biodiversity_next pre-conference workshop on Authority Management of People Names, WT65 https://biodiversitynext.org/pre-conference/

@debpaul
Copy link
Contributor

debpaul commented Jul 31, 2019 via email

@MattBlissett
Copy link
Member

MattBlissett commented Aug 1, 2019

It doesn't answer your questions, but iNaturalist and GBIF have some discussion on this here: gbif/occurrence#89, and a plan for an interim GBIF-namespace term.

Today, iNaturalist's exported research-grade observations contain 93 unique ORCIDs across around 74,000 observations.

@qgroom
Copy link
Member

qgroom commented Aug 1, 2019

One of the tasks for the pre-conference workshop on people's names is to scope out the need for a TDWG task group on this subject. The question being, do we need to make changes to the existing standards to better support name information? It seems from this conversation that the answer is yes.

@debpaul
Copy link
Contributor

debpaul commented Aug 1, 2019

Thanks @MattBlissett @timrobertson100. I'm wondering if we need an "interim" protocol/method for TDWG standards. I get why GBIF came up with an interim solution using GBIF-namespace term (we did similar at iDigBio. But, seems that TDWG could have a protocol for doing just this when needed. (Something to discuss).

@debpaul
Copy link
Contributor

debpaul commented Aug 1, 2019

To @adeans, note the above conversation relevant to your ORCID ID capture in your spreadsheet. Also see gbif/occurrence#89 and gbif/portal16#342 for some insights on ORCID ID data capture and potential use.

@dshorthouse
Copy link

dshorthouse commented Aug 1, 2019

@MattBlissett I'm surprised that iNaturalist and GBIF have taken this route. While I think incorporation of ORCID IDs in occurrence data is exactly what we want, a branded, DwC look-alike is best avoided (why wasn't this term called "ORCID ID"?!). It may have set a precedent here for Andy Dean's approach and it confuses the DwC standard.

We can perhaps get away with it with iNaturalist => GBIF because the former makes use of OAuth2 & the response from ORCID transparently provides that ORCID ID on the user's behalf. iNaturalist users need not ever know their ORCID ID - it's a horrible string to type. Plus, I'm assuming that there will only ever be one agent in iNaturalist's recordedByOrcid. But, what about that which makes iNaturalist observations "research grade"? - wouldn't it be great if those ORCID IDs for people who confirmed the identification of others' observations also flow to GBIF? Would you then make a 1:many identifiedByOrcid for the 3+ confirmed determinations? This may get messy in a real hurry.

In Andy's case here, users will need to copy/paste their ORCID IDs & so mistakes will be made. Andy will also have to deal with 1:many in recordedBy and identifiedBy. And, he'll also then have to deal with collector numbers when botanists want to play too. What then about others who want credit for other ways specimens have been handled or prepared?

I realize the above is messy and there isn't a solution now. But, I think we best do this with care. A DwC extension appears to be the best way to do this even though the 1:many issue is not pleasing for a spreadsheet tdwg/dwc#101

@debpaul
Copy link
Contributor

debpaul commented Aug 1, 2019

Great insights @dshorthouse, thanks for elaborating on the (current) workflow and potential issues. see also my related comments gbif/occurrence#89 (comment)

@adeans
Copy link

adeans commented Aug 1, 2019

Cool, cool. Watching this and other spaces for recommendations. I think for now I will use names: "Andrew R. Deans | D. H. Janzen", etc.

@debpaul
Copy link
Contributor

debpaul commented Aug 1, 2019

To @adeans, don't stop collecting ORCID IDs though! Just create a new column for now. And if possible, store these in your collection mgmt software. Each person (Agent) in your CMS could have a ORCID ID. That way, when applications can make use of them, you'll have them to share. Maybe your column name for now is ORCID_ID (no spaces), and you can put multiple ORCID IDs in there too (separated by | as well). (Yes, copy/paste errors might happen as @dshorthouse points out, but a better problem to have than no ORCID_ID at all). The other challenges (how to share them on export, for example) can be addressed serially. You can't share ORCID IDs if you aren't collecting them. See tdwg/dwc#101 for even more of the challenges surrounding gathering and using people IDs like ORCID ID (or other similar).

@baskaufs
Copy link

baskaufs commented Aug 1, 2019

I am happy to read these comments on this interesting and important topic.

With respect to use of dwciri:recordedBy, the relevant specification is Section 2.5.1 of the Darwin Core RDF Guide. To paraphrase:

  1. The value of a dwciri: property will be a single resource identified by an IRI (a.k.a. URI, blank nodes without an IRI are also allowed).
  2. If there are multiple values for the property, it can be repeated for each value (see Example 20). However, this approach does not provide any easy way to indicate ordering, or the exact role of the person identified by the IRI.
  3. "Alternatively, a single triple can be used to describe the subject if the object is a single resource composed of component resources described using additional RDF triples.", i.e. there is a single value for the property and that value can represent a group. The composition of the group and relationship among the members of the group would be described by other machine-readable statements (not specified by the guide).

So I think it's this last option that we are talking about here: using dwciri:recordedBy to link to a description of a group of people and their relationships. There are probable a number of ways to create that description - the important thing is to get consensus on how we will all do it. I'm of the general opinion that if there's a way to create a DwC extension for something, there's probably an easy way to convert it to machine readable RDF or JSON-LD (but not necessarily the reverse). So it would make sense to me to develop the spreadsheet extension simultaneously with the graph model for linked data.

One final note about the dwciri: properties. Despite their name, the value of those terms doesn't have to be identified by an IRI. They can link to a blank (a.k.a. anonymous) node that is then subsequently described. See Example 16 in the guide. The important point here is that if we create a process for converting something like a DwC extension using spreadsheets to Linked Data, that doesn't necessarily require creating an infrastructure for minting and maintaining identifiers for groups of people. The links and relationships among the (hopefully ORCID ID-identified) people can be established without requiring that.

@dshorthouse
Copy link

The important point here is that if we create a process for converting something like a DwC extension using spreadsheets to Linked Data, that doesn't necessarily require creating an infrastructure for minting and maintaining identifiers for groups of people.

Phew! Thank goodness. We may still need to declare one's role in the context of the action executed. Botanists have a primary collector and others listed in recordedBy sat in the truck :)~ We could use a "role" here to contain an integer that describes that pecking order.

@baskaufs
Copy link

baskaufs commented Aug 1, 2019

As an example of an approach for handling the problem of ordering of machine-readable data, we can look at the Getty Thesaurus of Geographic Names (TGN) record for China. The TGN maintains a particular order in which names should be displayed, and also notes whether a name is preferred. This is kind of an analog for order lists of collectors or authors, where there is a special note of the primary collector or first author. In the RDF we can see that there is a skosxl:prefLabel link to each of the names, and the RDF describing those names includes a gvp:displayOrder property that has a positive integer as its value. For example, tgn_term:159-zh-Latn which has the literal form "Zhongguo" has displayOrder = 1 and we see it as the first item on the list on the human-readable page. You can also see that in the description of the subject resource tgn:1000111, there is the property:value pair gvp:prefLabelGVP tgn_term:159-zh-Latn and that's how we can know that "Zhongguo" should get labeled as "preferred" on the human-readable page.

The point here is that there are relatively simple approaches to making up for the deficiencies that RDF has in describing the order and special characteristics of items on a list. The TGN doesn't live natively as RDF - the RDF is generated from a relational database (I believe). But the Getty has nevertheless managed to expose a relatively large dataset as Linked Data and via a public SPARQL endpoint. With a bit of effort, we could, too.

@tucotuco tucotuco added answered term - Occurrence Pertaining to a term organized in the Occurrence class. labels Sep 6, 2019
@tucotuco tucotuco removed the new label Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered form submission term - Occurrence Pertaining to a term organized in the Occurrence class.
Projects
None yet
Development

No branches or pull requests

9 participants