Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable search by ORCID ID #251

Closed
timrobertson100 opened this issue Jan 27, 2020 · 22 comments
Closed

Enable search by ORCID ID #251

timrobertson100 opened this issue Jan 27, 2020 · 22 comments
Assignees

Comments

@timrobertson100
Copy link
Member

timrobertson100 commented Jan 27, 2020

This text is in draft, external review will be sought and this description updated as comments are made

GBIF wishes to promote wider use of ORCID IDs; the first step is to enable search and download capabilities by ORCID IDs for those datasets that follow guidelines in how to use ORCID IDs.

I propose that these guidelines include:

  1. Support of new fields of recordedByOrcidID and identifiedByOrcidID for use in CSV (i.e. header row) and in the Occurrence files in DwC-A files (in the meta.xml in a GBIF namespace). These fields will allow for multivalue using a delimiter (recommended as |). Where multiple values exist, these are required to align with recordedBy and identifiedBy. If proven useful, these terms can be proposed to DwC.
  2. Support for the proposed DwC-A agent-actions extension. I propose that if 3 institutions demonstrate data mapped to this structure, it be promoted to production as a version 1 extension. It can then be revised if necessary, knowing significant changes may require early adopters to remap the data.
  3. Support for mapping the ABCD version 3 will be explored by the BGBM. Discussed and will be reconsidered in 2021

iNaturalist currently share data using a recordedByOrcid (note: not recordedByOrcidID). Once enabled we should request they change that, or we accommodate that as meaning recordedByOrcidID.

The initial functionality needs to support:

  • Find by ORCID ID
  • Find by ORCID ID and Action X
@debpaul
Copy link

debpaul commented Jan 27, 2020

I believe ARCTOS folks may have ORCID data for at least some of their collectors/determiners, as potential people who would provide this data in a first go round.

@dshorthouse
Copy link

Where multiple values exist, these are required to align with recordedBy and identifiedBy

Does this statement suggest that alignment will be verified and enforced or is it merely a suggestion? If the former, you're going to bump into a heap of problems, requiring a very robust parser. There will be plenty of examples for these terms with strings of many people, some of whom will have ORCID IDs and others not and so placement of pipes would have to be judicious.

recordedBy: Deb Paul, Q. Groom, Michael Smith, D.P. Shorthouse
recordedByOrcidID: https://orcid.org/0000-0003-2639-7520|||https://orcid.org/0000-0001-7618-5230

Because you can use ORCID to retrieve a name, I question the immediate value of attempting to align. If it's ordering of names that is of interest, I doubt this first pass will get us there.

Are values for identifiedByOrcidID and recordedByOrcidID meant to be URIs?

@timrobertson100
Copy link
Member Author

timrobertson100 commented Jan 27, 2020

Does this statement suggest that alignment will be verified and enforced or is it merely a suggestion?

I had intended clear guidance and people would have be to be judicious. I'm happy to also relax the guidelines and treat them as independent terms or just recommend them that they align.

Are values for identifiedByOrcidID and recordedByOrcidID meant to be URIs?

ORCID recommends full URI as far as I can tell, but at GBIF we'd parse both formats on indexing (we aim to accommodate variable data). We could also change the term to recordedByID if we wanted to support more identifier schemes.

Bear in mind those terms are for the Excel crowd (probably the largest community of data users) and I'd also expect a statement along the lines of "For more complex needs use the Agent-Action extension".

@debpaul
Copy link

debpaul commented Jan 27, 2020

So with recordedByID we could use Wikidata Q# too, yes? Or do you envision these must go in the Agent-Action extension?

@timrobertson100
Copy link
Member Author

So with recordedByID we could use Wikidata Q# too, yes? Or do you envision these must go in the Agent-Action extension?

Correct. VIAF, HUH, Wikidata, etc. It can become harder to parse but an ORCID could still be found if people used the full URI. We are free to propose what we want in the guidelines.

@debpaul
Copy link

debpaul commented Jan 27, 2020

To all, please add to our to-do list, a webinar (Darwin Core Hour, or Biodiversity Data Standards Hour (new and coming soon), or joint Alliance Webinar, ...) on the topics of ORCID ID, Wikidata Q#, VIAF, HUH, etc, and this new feature / function / implementation, so that we can foster adoption.

@dshorthouse
Copy link

Do we need to mention what relation this will have (if any) to the identifiedBy term in the Identification History extension?

@qgroom
Copy link

qgroom commented Jan 28, 2020

Hi Tim and all,
I'm very supportive!
I prefer recordedByOrcidID to recordedByID, because we would get all sorts of IDs otherwise. Sometimes you can have teams where each member has only a different kind of ID. Also, sometimes you have several IDs for the same person and want to expose that. recordedByOrcidID might also be useful within the attribution extension where there could be a series of recordedBy......IDs.
For contemporary data Orcid IDs are very important and it helps build the argument for the attribution extension and the wider adoption of people IDs.
Furthermore, implementing this a piece of cake!

@ben3000
Copy link

ben3000 commented Jan 28, 2020

I think I would prefer that the terms avoid specifying the digital identifier type (e.g. recordedByID rather than recordedByOrcidID) because that locks the usage down to only ORCID IDs and we'd have to invent similar terms for the other IDs such as Wikidata Q IDs. This would then require that they are full URIs (or IRIs) so that it was clear what type of identifier each URI was.

In Western Australia, we could better support the general move to person identifiers if we could supply Wikidata Q IDs (as well as ORCID IDs) because some former staff have those, but now that they've retired, will likely never have an ORCID ID.

I'd like it if applications were told that they should be dereferencing these URIs to gather the full data, rather than rely on the person's name as it was supplied in the data file.

It would be nice if it was eventually possible to supply a set of IDs instead of values in recordedBy to avoid the need for alignment, e.g. using David's earlier example) as follows:

recordedBy: Q. Groom, Michael Smith
recordedByID: https://orcid.org/0000-0003-2639-7520|https://orcid.org/0000-0001-7618-5230

But that is probably a discussion for another time because it makes it more difficult to know which person is the primary collector.

@qgroom
Copy link

qgroom commented Jan 28, 2020

@ben3000 what you want to do is also our end game with the an attribution extension to Darwin Core. However, we are a way off getting this extension finalized and ratified as a standard. So this small step is just a proof of concept and its simplicity is a major advantage. Practically anyone could implement recordedByOrcidID.
@dshorthouse has just had a Task Group on person identifiers approved by the TDWG Executive. Perhaps we can add you to the mailing list of this group?
Quentin

@dshorthouse
Copy link

I'll put this out there, merely to re-express my concern with this approach. I appreciate that GBIF is under pressure to do something on this front & it is taking far too long to resolve this with a workable solution. I'll remind everyone that we started down this path more than 4 years ago! Egads! See tdwg/dwc#101 as one such thread. Both @tucotuco and @mdoering raised their concerns over exactly the approach we seem to have settled on here. They recommended that we put some effort in an extension. Although an extension to DwC-A also has its shortcomings because of the limitations in DwC-A's star schema, it is closer to what the collections community desires. Continuing to use a series of terms like identifiedBy and recordedBy that do not separate the action from the agent & now making it even more ropy by additionally holding identifiers like ORCID feels like piling on new problems. Why not instead put all our efforts in the extension now?

Although we're couching this as an interim solution, one that is specific to those who share flat(-ish) data such as iNaturalist and providers who have spreadsheets, I fear that it will (a) set a precedent for other DwC fields that could also do with URIs and not strings or lists of strings; and (b) become near impossible to extricate ourselves when a more thorough solution finally materializes. Do we really want to maintain backward compatibility with this? Will we ignore this in the IPT as a solution that is not (yet?) endorsed by TDWG? For how long?

Closer to home for me, this proposed approach will obligate me to write code around it for Bloodhound. Although this will not be particularly difficult, people will ask why their claims do not appear there. I cannot ignore it.

I'd be convinced that this easy first step is moving us in the right direction if I could see this as a step forward.

@timrobertson100
Copy link
Member Author

Thanks, @dshorthouse - I genuinely appreciate your candor.

...concerns over exactly the approach we seem to have settled on here...

Just to clarify - I'm not sure we've settled, hence my reaching out to you all to comment.

Some of my thoughts:

My personal feeling is that enabling the simplest solution for someone to capture their ORCID ID in CSV/Excel for their observations is beneficial. From my own networking, I suspect Excel-like applications still represent some of the most used tools in the community and large amounts of data are flat tables. recordedByOrcidID is to my mind the simplest way to achieve this.

For more complex data (multiple people, actions/roles etc) I agree that shoe-horning into flat CSV is not advisable, and in DwC the extension approach is better. I was following precedent from other terms to accommodate multi-values. I'd be supportive of promoting the extension approach for collections-based data if it is your recommendation (but still believe recordedByOrcidID to be valuable).

DwC has explicit ID fields for collections, datasets, organism, material sample, event, location, identification, variety of name-related oncepts, resources etc. Some of these identify a digital object and others identify the output of a process (an identification). This is enough precedent to me to add one for people (or the result of an action from a person).

Closer to home for me, this proposed approach will obligate me to write code around it for Bloodhound. Although this will not be particularly difficult, people will ask why their claims do not appear there. I cannot ignore it.

In-part this work is exactly to help Bloodhound by removing the need to claim data if already referenced... so this is important. Can you elaborate on why this would be complex though?

@dshorthouse
Copy link

dshorthouse commented Jan 28, 2020

In-part this work is exactly to help Bloodhound by removing the need to claim data if already referenced... so this is important. Can you elaborate on why this would be complex though?

It will in effect jump-start what I need to eventually do so this is a good thing. I record who made the assertion that collector with ORCID 'x' (or Wikidata Q number 'x') collected specimen 'y' + when they did it and then share this along with the specimen data in frictionless data downloads for every GBIF dataset. My expectation was that collections managers need some handle on how to trust the assertions from Bloodhound and how to discover when they were made before deciding to incorporate into their in-house collections data. Many of the people who made the assertions could be staff members & so those may be more 'trustworthy' than an unknown attributor. When these collector <=> occurrence linkages via recordedByOrcidID are eventually included in GBIF DwC-A downloads, I'll have no means of reflecting any of those assertions in the same way & so have to differentiate them for users in Bloodhound. The latter permits anyone to correct assertions. Do I lock those records and prevent that activity when the assertions originate from recordedByOrcidID? Do I verify that existing assertions in Bloodhound align exactly with what's in recordedByOrcidID? What if they conflict? What if the assertions in Bloodhound are made by the collector him/herself whereas what comes from the collections is incorrect because it was made by a volunteer? This is not your problem, but is nonetheless a window into my machinations and gymnastics. As I said, I have to eventually deal with this anyway. Or not.

But...

When/if we do have an agreed extension for attributions/actions, that too will appear as an additional csv file in each of GBIF's DwC-A downloads, raising the possibility of internal conflicts or ambiguity when there could be use of recordedByOrcidID in the core that does not align with what is in the extension. And so we must make a decision. Ignore the content in recordedByOrcidID when functionally equivalent entries exist for that occurrence record in the extension? Will that arbitrary decision result in someone not receiving due credit for their efforts?

All the above is rather theoretical and admittedly overblown. But, remember the goal and bear in mind the outcome. It's not merely another class of data object we're learning how to play with. We're walking down a path that will lead to a new credit system for people.

@timrobertson100
Copy link
Member Author

timrobertson100 commented Jan 29, 2020

Thanks again for the context. What you outline is one reason why GBIF never introduced an annotation system - determining if an annotation is still valid when the source data changes is a challenge.

raising the possibility of internal conflicts or ambiguity

This is everywhere already (higher classifications and scientificName, geography/country and coordinates, scientificNameID and scientificName, eventDate and day,month,year) and disambiguating that is a large part of what GBIF.org flags. For people won't there be possible conflicts and uncertainties surrounding values found in recordedBy and recordedbyID no matter where the content is captured?

that will lead to a new credit system for people

This is something to aspire to but I propose a more modest objective: those who make the effort to add an ORCID ID to their data can subsequently find it using their ORCID ID (possibly for other ID schemes too).

@timrobertson100
Copy link
Member Author

We are asked again by iNaturalist what our recommendation is and I am concerned this thread has stalled similar to previous attempts to progress this topic.

I propose we move to enable users of spreadsheets and DwC-A to provide a column containing IDs for a person, or people in recordedByID and identifiedByID fields. It will not satisfy all scenarios but will allow for the majority of simple cases where a recording system captures the ORCID of the user (e.g. tied to their login account) and GBIF can then enable them to use that to find their content unambiguously. The Agent extension, when ready, will enable all the more complex scenarios for those who wish them and would like to work on that solution.

Contradicting opinions have been expressed on whether a generic field (e.g. recordByID) or specific (e.g. recordedByOrcidID) would be preferable. Due to the simplicity of recognizing ORCID and Wikidata identifiers, I am inclined to propose recordedByID and identifiedByID terms in a GBIF namespace for the reasons @ben3000 outline. I expect in time the terms will be replaced by more formal homes.

Is this something that people can live with, please?

@kueda @ben3000 @dshorthouse @qgroom

@dshorthouse
Copy link

As for the draft "AgentActions" extension for this (see https://github.com/tdwg/attribution/tree/master/dwc), we've been trying to stitch our action terms into the VIVO ontology & have been subject to that group's workflow and timeframe. The intent is to have some URIs to reference in our vocabs & not do this in a vacuum. See vivo-project/VIVO#141 in which @diatomsRcool is the point person. In the meantime, recordedByID and identifiedByID are logical choices & I'd recommend http URIs separated by pipes. When it comes to display on GBIF, you'll need to additionally grab people names via ORCID's API (or not) w/o reference to the mess of strings in recordedBy or identifiedBy.

@debpaul
Copy link

debpaul commented Mar 6, 2020 via email

@timrobertson100
Copy link
Member Author

Thanks @debpaul

Since there are many fields that support multi-value already I'd propose we just follow the same process e.g.

{
  "recordedByID" : "https://orcid.org/0000-0001-6215-3617",
  "identifiedByID" : "https://orcid.org/0000-0002-8442-8025 | https://www.wikidata.org/entity/Q553155" 
}

I anticipate GBIF processing would recognise those without the full URI (Q55315 and 0000-0001-6215-3617) but recommendations would be to include them.

@kueda confirms iNaturalist only seek single values in recordedByID for the time being and thought would be needed on what to put into identifiedBy and identifiedByID fields anyway.

@debpaul
Copy link

debpaul commented Mar 6, 2020 via email

@qgroom
Copy link

qgroom commented Mar 6, 2020

Hi Tim,
yes, I can certainly live with this and strongly encourage it.

Regarding the fine details. Dealing with teams would be nice, but it presents so many difficulties I recommend not including it for now.

@timrobertson100
Copy link
Member Author

For the GBIF pipeline parsing, we will then expect:

http://rs.gbif.org/terms/1.0/recordedByID
http://rs.gbif.org/terms/1.0/identifiedByID

and parse them for ORCID and wikidata Q IDs (expecting full URI, but accepting others where possible) and support | (pipe) delimiters for multi-values. The guidelines should recommend full URI always, and for more complex scenarios we'll await the agent extension.

muttcg added a commit to gbif/dwc-api that referenced this issue Mar 10, 2020
muttcg added a commit to gbif/dwc-api that referenced this issue Mar 10, 2020
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 10, 2020
Added id types
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 12, 2020
Added UserIdentifier valueOf string parser
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 13, 2020
Removed plural form
muttcg added a commit to gbif/occurrence that referenced this issue Mar 13, 2020
Added recordedByID into es query builder
muttcg added a commit that referenced this issue Mar 13, 2020
Use list of strings for hdfs avro
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 13, 2020
Removed UserIdentifier valueOf string parser
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 13, 2020
Added empty constructor
muttcg added a commit to gbif/occurrence that referenced this issue Mar 13, 2020
Added recordedByID into hdfs query builder and etc.
muttcg added a commit to gbif/dwc-api that referenced this issue Mar 13, 2020
Added agentID term
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 13, 2020
Remaned to AgentIdentifier and etc.
muttcg added a commit that referenced this issue Mar 13, 2020
Renamed User to Agent*
muttcg added a commit to gbif/occurrence that referenced this issue Mar 13, 2020
Added agentId term insted of recordedById
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 16, 2020
Use ID instead of Id
muttcg added a commit to gbif/occurrence that referenced this issue Mar 17, 2020
Use latest dwc-io
muttcg added a commit to gbif/occurrence that referenced this issue Mar 17, 2020
GbifTerm.recordedByID, GbifTerm.identifiedByID added in verbatim list
muttcg added a commit to gbif/occurrence that referenced this issue Mar 17, 2020
Changed terms order
muttcg added a commit that referenced this issue Mar 17, 2020
Use schema generated by gbif/occurrence
muttcg added a commit to gbif/occurrence that referenced this issue Mar 18, 2020
Use release versions
muttcg added a commit that referenced this issue Mar 18, 2020
Use release versions
muttcg added a commit that referenced this issue Mar 19, 2020
Use release versions
muttcg added a commit that referenced this issue Mar 19, 2020
Use release versions
muttcg added a commit to gbif/dwc-api that referenced this issue Mar 20, 2020
Removed agentID
muttcg added a commit to gbif/gbif-api that referenced this issue Mar 20, 2020
Removed agentId as a term
muttcg added a commit that referenced this issue Mar 20, 2020
Removed agentId as a term
muttcg added a commit to gbif/occurrence that referenced this issue Mar 20, 2020
Removed agentId as a term
muttcg added a commit to gbif/occurrence that referenced this issue Mar 23, 2020
Removed terms from simple downloads terms list
muttcg added a commit to gbif/dwc-api that referenced this issue Mar 24, 2020
identifiedByID changed group to GROUP_IDENTIFICATION
muttcg added a commit to gbif/dwc-api that referenced this issue Mar 24, 2020
identifiedByID changed group to GROUP_IDENTIFICATION
@muttcg
Copy link
Member

muttcg commented Mar 25, 2020

In production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants