-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable search by ORCID ID #251
Comments
I believe ARCTOS folks may have ORCID data for at least some of their collectors/determiners, as potential people who would provide this data in a first go round. |
Does this statement suggest that alignment will be verified and enforced or is it merely a suggestion? If the former, you're going to bump into a heap of problems, requiring a very robust parser. There will be plenty of examples for these terms with strings of many people, some of whom will have ORCID IDs and others not and so placement of pipes would have to be judicious.
Because you can use ORCID to retrieve a name, I question the immediate value of attempting to align. If it's ordering of names that is of interest, I doubt this first pass will get us there. Are values for identifiedByOrcidID and recordedByOrcidID meant to be URIs? |
I had intended clear guidance and people would have be to be judicious. I'm happy to also relax the guidelines and treat them as independent terms or just recommend them that they align.
ORCID recommends full URI as far as I can tell, but at GBIF we'd parse both formats on indexing (we aim to accommodate variable data). We could also change the term to Bear in mind those terms are for the Excel crowd (probably the largest community of data users) and I'd also expect a statement along the lines of "For more complex needs use the Agent-Action extension". |
So with recordedByID we could use Wikidata Q# too, yes? Or do you envision these must go in the Agent-Action extension? |
Correct. VIAF, HUH, Wikidata, etc. It can become harder to parse but an ORCID could still be found if people used the full URI. We are free to propose what we want in the guidelines. |
To all, please add to our to-do list, a webinar (Darwin Core Hour, or Biodiversity Data Standards Hour (new and coming soon), or joint Alliance Webinar, ...) on the topics of ORCID ID, Wikidata Q#, VIAF, HUH, etc, and this new feature / function / implementation, so that we can foster adoption. |
Do we need to mention what relation this will have (if any) to the identifiedBy term in the Identification History extension? |
Hi Tim and all, |
I think I would prefer that the terms avoid specifying the digital identifier type (e.g. In Western Australia, we could better support the general move to person identifiers if we could supply Wikidata Q IDs (as well as ORCID IDs) because some former staff have those, but now that they've retired, will likely never have an ORCID ID. I'd like it if applications were told that they should be dereferencing these URIs to gather the full data, rather than rely on the person's name as it was supplied in the data file. It would be nice if it was eventually possible to supply a set of IDs instead of values in
But that is probably a discussion for another time because it makes it more difficult to know which person is the primary collector. |
@ben3000 what you want to do is also our end game with the an attribution extension to Darwin Core. However, we are a way off getting this extension finalized and ratified as a standard. So this small step is just a proof of concept and its simplicity is a major advantage. Practically anyone could implement recordedByOrcidID. |
I'll put this out there, merely to re-express my concern with this approach. I appreciate that GBIF is under pressure to do something on this front & it is taking far too long to resolve this with a workable solution. I'll remind everyone that we started down this path more than 4 years ago! Egads! See tdwg/dwc#101 as one such thread. Both @tucotuco and @mdoering raised their concerns over exactly the approach we seem to have settled on here. They recommended that we put some effort in an extension. Although an extension to DwC-A also has its shortcomings because of the limitations in DwC-A's star schema, it is closer to what the collections community desires. Continuing to use a series of terms like identifiedBy and recordedBy that do not separate the action from the agent & now making it even more ropy by additionally holding identifiers like ORCID feels like piling on new problems. Why not instead put all our efforts in the extension now? Although we're couching this as an interim solution, one that is specific to those who share flat(-ish) data such as iNaturalist and providers who have spreadsheets, I fear that it will (a) set a precedent for other DwC fields that could also do with URIs and not strings or lists of strings; and (b) become near impossible to extricate ourselves when a more thorough solution finally materializes. Do we really want to maintain backward compatibility with this? Will we ignore this in the IPT as a solution that is not (yet?) endorsed by TDWG? For how long? Closer to home for me, this proposed approach will obligate me to write code around it for Bloodhound. Although this will not be particularly difficult, people will ask why their claims do not appear there. I cannot ignore it. I'd be convinced that this easy first step is moving us in the right direction if I could see this as a step forward. |
Thanks, @dshorthouse - I genuinely appreciate your candor.
Just to clarify - I'm not sure we've settled, hence my reaching out to you all to comment. Some of my thoughts: My personal feeling is that enabling the simplest solution for someone to capture their ORCID ID in CSV/Excel for their observations is beneficial. From my own networking, I suspect Excel-like applications still represent some of the most used tools in the community and large amounts of data are flat tables. For more complex data (multiple people, actions/roles etc) I agree that shoe-horning into flat CSV is not advisable, and in DwC the extension approach is better. I was following precedent from other terms to accommodate multi-values. I'd be supportive of promoting the extension approach for collections-based data if it is your recommendation (but still believe DwC has explicit ID fields for collections, datasets, organism, material sample, event, location, identification, variety of name-related oncepts, resources etc. Some of these identify a digital object and others identify the output of a process (an identification). This is enough precedent to me to add one for people (or the result of an action from a person).
In-part this work is exactly to help Bloodhound by removing the need to claim data if already referenced... so this is important. Can you elaborate on why this would be complex though? |
It will in effect jump-start what I need to eventually do so this is a good thing. I record who made the assertion that collector with ORCID 'x' (or Wikidata Q number 'x') collected specimen 'y' + when they did it and then share this along with the specimen data in frictionless data downloads for every GBIF dataset. My expectation was that collections managers need some handle on how to trust the assertions from Bloodhound and how to discover when they were made before deciding to incorporate into their in-house collections data. Many of the people who made the assertions could be staff members & so those may be more 'trustworthy' than an unknown attributor. When these collector <=> occurrence linkages via recordedByOrcidID are eventually included in GBIF DwC-A downloads, I'll have no means of reflecting any of those assertions in the same way & so have to differentiate them for users in Bloodhound. The latter permits anyone to correct assertions. Do I lock those records and prevent that activity when the assertions originate from recordedByOrcidID? Do I verify that existing assertions in Bloodhound align exactly with what's in recordedByOrcidID? What if they conflict? What if the assertions in Bloodhound are made by the collector him/herself whereas what comes from the collections is incorrect because it was made by a volunteer? This is not your problem, but is nonetheless a window into my machinations and gymnastics. As I said, I have to eventually deal with this anyway. Or not. But... When/if we do have an agreed extension for attributions/actions, that too will appear as an additional csv file in each of GBIF's DwC-A downloads, raising the possibility of internal conflicts or ambiguity when there could be use of recordedByOrcidID in the core that does not align with what is in the extension. And so we must make a decision. Ignore the content in recordedByOrcidID when functionally equivalent entries exist for that occurrence record in the extension? Will that arbitrary decision result in someone not receiving due credit for their efforts? All the above is rather theoretical and admittedly overblown. But, remember the goal and bear in mind the outcome. It's not merely another class of data object we're learning how to play with. We're walking down a path that will lead to a new credit system for people. |
Thanks again for the context. What you outline is one reason why GBIF never introduced an annotation system - determining if an annotation is still valid when the source data changes is a challenge.
This is everywhere already (higher classifications and scientificName, geography/country and coordinates, scientificNameID and scientificName, eventDate and day,month,year) and disambiguating that is a large part of what GBIF.org flags. For people won't there be possible conflicts and uncertainties surrounding values found in
This is something to aspire to but I propose a more modest objective: those who make the effort to add an ORCID ID to their data can subsequently find it using their ORCID ID (possibly for other ID schemes too). |
We are asked again by iNaturalist what our recommendation is and I am concerned this thread has stalled similar to previous attempts to progress this topic. I propose we move to enable users of spreadsheets and DwC-A to provide a column containing IDs for a person, or people in Contradicting opinions have been expressed on whether a generic field (e.g. Is this something that people can live with, please? |
As for the draft "AgentActions" extension for this (see https://github.com/tdwg/attribution/tree/master/dwc), we've been trying to stitch our action terms into the VIVO ontology & have been subject to that group's workflow and timeframe. The intent is to have some URIs to reference in our vocabs & not do this in a vacuum. See vivo-project/VIVO#141 in which @diatomsRcool is the point person. In the meantime, |
Tim,
To be clear, would you envision recordedByID be a single-value (because
you say "simiplicity of [automatically] recognizing ORCID and Wikidata
identifiers)? or would you anticipate this to be another field in which
an specific separator (like pipe?) might be used because of the 1:many
situation?
I'm for whatever means we can start doing this soonest. Although, I
really would like an extension so that we can do this more elegantly.
The 1:many is clearly an issue for more than iNat.
?
Deb
…On 2020-03-06 4:48 AM, Tim Robertson wrote:
We are asked again by iNaturalist
<https://urldefense.com/v3/__https://github.com/gbif/occurrence/issues/89__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1YbTC7ZXg$>
what our recommendation is and I am concerned this thread has stalled
similar to previous attempts to progress this topic.
I propose we move to enable users of spreadsheets and DwC-A to provide
a column containing IDs for a person, or people in |recordedByID| and
|identifiedByID| fields. It will not satisfy all scenarios but will
allow for the majority of simple cases where a recording system
captures the ORCID of the user (e.g. tied to their login account) and
GBIF can then enable them to use that to find their content
unambiguously. The Agent extension, when ready, will enable all the
more complex scenarios for those who wish them and would like to work
on that solution.
Contradicting opinions have been expressed on whether a generic field
(e.g. |recordByID|) or specific (e.g. |recordedByOrcidID|) would be
preferable. Due to the simplicity of recognizing ORCID and Wikidata
identifiers, I am inclined to propose |recordedByID| and
|identifiedByID| terms in a GBIF namespace for the reasons @ben3000
<https://urldefense.com/v3/__https://github.com/ben3000__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1arzBF4cA$>
outline. I expect in time the terms will be replaced by more formal homes.
Is this something that people can live with, please?
@kueda
<https://urldefense.com/v3/__https://github.com/kueda__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1bflRE8KQ$>
@ben3000
<https://urldefense.com/v3/__https://github.com/ben3000__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1arzBF4cA$>
@dshorthouse
<https://urldefense.com/v3/__https://github.com/dshorthouse__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1YblT5cLA$>
@qgroom
<https://urldefense.com/v3/__https://github.com/qgroom__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1YaeaZjdA$>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/gbif/pipelines/issues/251?email_source=notifications&email_token=AAW2AS4CEBKXG7D3J5EO3ETRGDBILA5CNFSM4KMDFDD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOAX57A*issuecomment-595689212__;Iw!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1ariOvYow$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAW2ASY2AQURWB2MPYCARILRGDBILANCNFSM4KMDFDDQ__;!!PhOWcWs!nTMvnwmRovQVsVY_G30XoaVx4UgBIM26yvJcVfNfaCrCjyMtCsQEg1aaNpatyA$>.
--
-- Upcoming iDigBio Events https://www.idigbio.org/calendar
-- Deborah Paul, iDigBio Digitization and Workforce Development Manager
iDigBio -- Steering Committee Member
SPNHC Liaison, Member-At-Large and Member International Relations Committee
ICEDIG External Advisory Board Member https://icedig.eu/
Vice Chair, Biodiversity Information Standards Organisation (TDWG)(2019-2021)
Managing Editor, Biodiversity Information Science and Standards (BISS) https://biss.pensoft.net/board/
Institute for Digital Information, 234 LSB
Florida State University
Tallahassee, Florida 32306
850-644-6366
|
Thanks @debpaul Since there are many fields that support multi-value already I'd propose we just follow the same process e.g.
I anticipate GBIF processing would recognise those without the full URI ( @kueda confirms iNaturalist only seek single values in |
Thanks Tim,
I was hoping, in that providing some sort of (even if clunky pipe)
method for 1:many was offered, it would support a smoother transition to
providing this richer data to a more elegant extension in the future. In
other words, give people a way to provide more information if they have
it, rather than sticking it in a notes field in perpetuity.
…On 2020-03-06 12:48 PM, Tim Robertson wrote:
Thanks @debpaul
<https://urldefense.com/v3/__https://github.com/debpaul__;!!PhOWcWs!iGanvG_OctGglbIMqKW8az0N1Eo0OKOI2LVhl3vpcNeT4vlnm86JSkAsOa4BAQ$>
Since there are many fields that support multi-value already I'd
propose we just follow the same process e.g.
|{ "recordedByID" : "https://orcid.org/0000-0001-6215-3617",
"identifiedByID" : "https://orcid.org/0000-0002-8442-8025 |
https://www.wikidata.org/entity/Q553155" } |
I anticipate GBIF processing would recognise those without the full
URI (|Q55315| and |0000-0001-6215-3617|) but recommendations would be
to include them.
@kueda
<https://urldefense.com/v3/__https://github.com/kueda__;!!PhOWcWs!iGanvG_OctGglbIMqKW8az0N1Eo0OKOI2LVhl3vpcNeT4vlnm86JSkBAqzO2RA$>
confirms
<https://urldefense.com/v3/__https://github.com/gbif/occurrence/issues/89*issuecomment-595836145__;Iw!!PhOWcWs!iGanvG_OctGglbIMqKW8az0N1Eo0OKOI2LVhl3vpcNeT4vlnm86JSkA52W-9IQ$>
iNaturalist only seek single values in |recordedByID| for the time
being and thought would be needed on what to put into |identifiedBy|
and |identifiedByID| fields anyway.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/gbif/pipelines/issues/251?email_source=notifications&email_token=AAW2AS3IG5PL2VVPCEDG6YTRGEZNPA5CNFSM4KMDFDD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOCHMHQ*issuecomment-595883550__;Iw!!PhOWcWs!iGanvG_OctGglbIMqKW8az0N1Eo0OKOI2LVhl3vpcNeT4vlnm86JSkCssb6x3A$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAW2ASZD6TGTOZ7WTATQONDRGEZNPANCNFSM4KMDFDDQ__;!!PhOWcWs!iGanvG_OctGglbIMqKW8az0N1Eo0OKOI2LVhl3vpcNeT4vlnm86JSkCbiDUvHg$>.
--
-- Upcoming iDigBio Events https://www.idigbio.org/calendar
-- Deborah Paul, iDigBio Digitization and Workforce Development Manager
iDigBio -- Steering Committee Member
SPNHC Liaison, Member-At-Large and Member International Relations Committee
ICEDIG External Advisory Board Member https://icedig.eu/
Vice Chair, Biodiversity Information Standards Organisation (TDWG)(2019-2021)
Managing Editor, Biodiversity Information Science and Standards (BISS) https://biss.pensoft.net/board/
Institute for Digital Information, 234 LSB
Florida State University
Tallahassee, Florida 32306
850-644-6366
|
Hi Tim, Regarding the fine details. Dealing with teams would be nice, but it presents so many difficulties I recommend not including it for now. |
For the GBIF pipeline parsing, we will then expect: http://rs.gbif.org/terms/1.0/recordedByID and parse them for ORCID and wikidata Q IDs (expecting full URI, but accepting others where possible) and support |
Added UserIdentifier valueOf string parser
Added recordedByID into es query builder
Removed UserIdentifier valueOf string parser
Added empty constructor
Added recordedByID into hdfs query builder and etc.
Remaned to AgentIdentifier and etc.
Added agentId term insted of recordedById
GbifTerm.recordedByID, GbifTerm.identifiedByID added in verbatim list
Changed terms order
Use release versions
Removed agentId as a term
Removed agentId as a term
Removed terms from simple downloads terms list
identifiedByID changed group to GROUP_IDENTIFICATION
identifiedByID changed group to GROUP_IDENTIFICATION
In production |
This text is in draft, external review will be sought and this description updated as comments are made
GBIF wishes to promote wider use of ORCID IDs; the first step is to enable search and download capabilities by ORCID IDs for those datasets that follow guidelines in how to use ORCID IDs.
I propose that these guidelines include:
recordedByOrcidID
andidentifiedByOrcidID
for use in CSV (i.e. header row) and in theOccurrence
files in DwC-A files (in themeta.xml
in a GBIF namespace). These fields will allow for multivalue using a delimiter (recommended as|
). Where multiple values exist, these are required to align withrecordedBy
andidentifiedBy
. If proven useful, these terms can be proposed to DwC.version 1
extension. It can then be revised if necessary, knowing significant changes may require early adopters to remap the data.Support for mapping the ABCD version 3 will be explored by the BGBM.Discussed and will be reconsidered in 2021iNaturalist currently share data using a
recordedByOrcid
(note: notrecordedByOrcidID
). Once enabled we should request they change that, or we accommodate that as meaningrecordedByOrcidID
.The initial functionality needs to support:
The text was updated successfully, but these errors were encountered: