Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recordedBy and identifiedByID using OrcID, Wikidata, Library of Congress #3623

Closed
Jegelewicz opened this issue May 27, 2021 · 25 comments
Closed
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-Agents NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.
Milestone

Comments

@Jegelewicz
Copy link
Member

See tdwg/dwc#102 (comment)

We should be prepared to share unique IDs for agents as much as possible. It would be a great project to have an intern add these to agents.

@Jegelewicz Jegelewicz added Aggregator issues e.g., GBIF, iDigBio, etc Function-Agents NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-Normal (Not urgent) Normal because this needs to get done but not immediately. labels May 27, 2021
@mkoo
Copy link
Member

mkoo commented May 27, 2021

Good suggestion-- what's the timeframe on this do you think? is this part of the current DwC makeover? (sorry didnt dig into the thread too much). This could be part of Genna's work this summer as she's reviewing projects

@Jegelewicz
Copy link
Member Author

This is currently in the comment period for TDWG. https://www.tdwg.org/news/2021/public-review-of-darwin-core-maintenance-proposals/

@dustymc
Copy link
Contributor

dustymc commented May 27, 2021

share unique IDs for agents

#2141, although that doesn't seem to quite align with what's being proposed for DWC.

Burying that in all of the places we might share Agents could get really big really fast.

We should revisit #2131 in light of #2141 (comment) before we get too crazy with JSON, should our response be getting crazy with JSON.

@dustymc dustymc added this to the Needs Discussion milestone May 27, 2021
@campmlc

This comment was marked as off-topic.

@dustymc

This comment was marked as off-topic.

@Jegelewicz

This comment was marked as off-topic.

@campmlc

This comment was marked as off-topic.

@Jegelewicz
Copy link
Member Author

This would be one way we could collaborate with @dshorthouse If we pass ORCiD or Wikidata identifier in recordedByID and identifiedByID, could he magically assign our records to the correct people in Bionomia?

@ewommack
Copy link

could he magically assign our records to the correct people in Bionomia?

I like magic. Making magic happen would be great.

@dshorthouse
Copy link

dshorthouse commented Apr 13, 2022

Yes! I like magic too. There are ~2.5M examples from Harvard, University of Oslo (HT @rukayaj), and elsewhere that now use recordedByID and identifiedByID. Take a close look at the definitions for those terms and the examples, http://rs.tdwg.org/dwc/terms/recordedByID.

Besides ORCID and wikidata, Bionomia also recognizes VIAF, ISNI, ZooBank person IDs, BHL creator IDs, and Library of Congress IDs if shared in recordedByID or identifiedByID and resolves these against wikidata. I flush all these and rebuild them every two weeks, which is the cleanest way I could think to ensure that there are no persistent oopsies. This part of the existing claims processing is getting slower and slower at my end because of the many, many calls to ORCID and wikidata so this gives me incentive to find a faster way to do this.

Addendum: Forgot to mention that if a wikidata URI comes in, I also check for a DOB >120 years ago or a DOD there before it's allowed to slip through. This doesn't mean you cannot use wikidata URIs for the living if you so desired, it's just one of Bionomia's soft rules. Another soft rule is that if an ORCID ID is used in recordedByID or identifiedByID but the profile is not yet made public in Bionomia, the profile remains private until that person logins and chooses to flip the switch.

@Jegelewicz
Copy link
Member Author

@dustymc @dbloom how quickly could we add these two things to our DwC archives on the IPT?

  1. We need to populate the fields (any collector or identification determiner agent with Wikidata, ORCiD or LoC)
  2. Do we need to do anything on VertNet IPT end once we add them?

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Apr 13, 2022

Since Harvard is doing this already (Arctos), it should be a thing....

https://www.gbif.org/occurrence/3095854930

@Jegelewicz Jegelewicz changed the title recordedBy and using OrcID, Wikidata recordedBy and using OrcID, Wikidata, Library of Congress Apr 13, 2022
@Jegelewicz
Copy link
Member Author

Brendan,

I am the Arctos Community Coordinator and we have been discussing adding recordedByID and identifiedByID. I am curious to know where you store this data and how you prepare it for publication to IPT.

We have been discussing how the data should appear when there are more than one recordedBy or identifiedBy agent. How do you handle this situation?

If an in person conversation would be easier, let me know. I'm happy to meet up.

Adios,

Teresa J. Mayfield-Meyer

@Jegelewicz
Copy link
Member Author

#4548 (comment)

@Jegelewicz
Copy link
Member Author

From Paul at MCZ

Starting with where the information is stored:

We added pairs of standardized fields to hold GUIDs for agent, taxonomy, and geog_auth_rec. One field holds the guid, the other field holds the authority. In AGENT this is AGENTGUID and AGENTGUID_GUID_TYPE:

https://github.com/MCZbase/DDL/blob/master/TABLE/AGENT.sql

The behavior of these pairs of fields is controlled using values in a code table, CTGUID_TYPE

https://github.com/MCZbase/DDL/blob/master/TABLE/CTGUID_TYPE.sql

We allow two and only two guid athorities for agents, ORCID and VIAF, the corresponding entries in the code table are:

"GUID_TYPE","DESCRIPTION","APPLIES_TO","PLACEHOLDER","PATTERN_REGEX","RESOLVER_REGEX","RESOLVER_REPLACEMENT","SEARCH_URI"
"VIAF","OCLC's VIAF (Virtual International Authority File)","agent.agentguid","http://viaf.org/viaf/nnnnn","^[http://viaf.org/viaf/[0-9]+$](http://viaf.org/viaf/%5B0-9%5D+$)","","","https://viaf.org/viaf/search?sortKeys=holdingscount&recordSchema=BriefVIAF&query=local.personalNames all "
"ORCiD","Open Researcher and Contributor ID","agent.agentguid","https://orcid.org/9999-9999-9999-9999","^[https://orcid.org/[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]$](https://orcid.org/%5B0-9%5D%7B4%7D-%5B0-9%5D%7B4%7D-%5B0-9%5D%7B4%7D-%5B0-9%5D%7B3%7D%5B0-9X%5D$)","","","https://orcid.org/orcid-search/search?searchQuery="

The GUID controls are presented in the user interface for adding/editing as a set, one control to pick which guid authority is to be used (on selection, another control links to a search on that authority for the relevant entity), a control into which to paste the guid, which requires it to match the expected pattern, and a control which shows a current guid, linked out to the resolving authority.

https://github.com/MCZbase/MCZbase/blob/0bced3993c1bfee890afdd316cfdd8f09df50980/editAllAgent.cfm#L193

Thus we store a guid in a form that fits the pattern_regex for the selected guid type, and know from the resolver_regex/resolver_replacement how to translate the stored value into a resolvable reference.

The value of AGENTGUID for a collector and a determiner are mapped into FLAT as RECORDEDBYID and IDENTIFIEDBYID

https://github.com/MCZbase/DDL/blob/cec4447d35c2bd44d07cec2b8cf470777568cb6b/TABLE/FLAT.sql#L147

by invoking a pair of functions, each of which returns the guid for an agent if that agent is the sole collector or the sole determiner:

https://github.com/MCZbase/DDL/blob/master/FUNCTION/GET_SOLE_COLLECTOR_GUID.sql
https://github.com/MCZbase/DDL/blob/master/FUNCTION/GET_SOLE_DETERMINER_GUID.sql

This is done in UPDATE_FLAT, e.g.
https://github.com/MCZbase/DDL/blob/cec4447d35c2bd44d07cec2b8cf470777568cb6b/PROCEDURE/UPDATE_FLAT.sql#L167

And these two fields are carried into FILTERED_FLAT and DIGIR_FILTERED_FLAT, and queried from there in IPT and mapped onto dwc:recordedByID and dwc:identifiedByID in the Occurrence core.

Neither dwc:recordedByID nor dwc:identifiedByID allow for multiplicity within the term, so in the flat darwin core of the Occurrence core in IPT, we chose to map the agent guids if there was only one collector agent or only one determiner agent. It is a much larger task to get agent guids populated, so we decided that it was better to focus on filling in that information where we could do so cleanly than worrying about multiplicity in terms that aren't intended to handle multiple values.

We haven't (yet) mapped the agent guid for the determier into the identification history extension, but that would have the same concern, an identification row has one identifiedByID which takes only a single value.

To handle multiplicity of agents, we could, but haven't yet, map multiple instances of recordedByID and identifiedByID into the (currently minimal, proof of concept) RDF representation of the occurrence that we provide via content negotiaion if a mczbase.mcz.harvard.edu/guid/ IRI is requested with an accept header of text/turtle, application/rdf-xml, or application/json-ld having priority over text/html:

https://github.com/MCZbase/MCZbase/blob/master/rdf/Occurrence.cfm
https://github.com/MCZbase/MCZbase/blob/0bced3993c1bfee890afdd316cfdd8f09df50980/errors/missing.cfm#L33

There we are free to follow the open world and repeat the recordedByID term for an occurrence. What we would likely do for a list of collector agents is return one dwc:recordedBy with the human readable string list of collectors as its value, and then a list of dwc:recordedByID properties for the Occurrence, one for each agent in the list of collectors that has a guid.

Other approaches are possible, particularly if you reference a guid authority (such as the HUH Botanist index) which mints guids for team agents, and then link to team agents as the collector. This approach does risk not retaining the order of collectors. In the HUH, and I believe generally in Botany, the first collector in a sequence is treated as primary, and then the order of subsequent collectors in the list doesn't matter.

-Paul

@Jegelewicz
Copy link
Member Author

Check out the CETAF Botany Pilot!

https://www.youtube.com/watch?v=W-LFUOKlpe8

@Jegelewicz

This comment was marked as abuse.

@Jegelewicz
Copy link
Member Author

From Bionomia webinar today - From D. Shorthouse - Please do include full URIs for people identifiers in recordedByID and identifiedByID.

Also - why can't we DO this?

@ewommack

This comment was marked as off-topic.

@dshorthouse
Copy link

Please do include full URIs for people identifiers in recordedByID and identifiedByID.

What does this mean? Include the full agent profile link rather than a name?

See definitions of these Darwin Core terms that complement their non-ID (string-based) counterparts:

http://rs.tdwg.org/dwc/terms/recordedByID

http://rs.tdwg.org/dwc/terms/identifiedByID

Full URI for ORCID IDs are is in the examples there, but if you were to use Wikidata, the "entity" URI is http://www.wikidata.org/entity/Q5331679 (note the absence of the 's' in http)

@Jegelewicz

This comment was marked as off-topic.

@ewommack

This comment was marked as off-topic.

@Jegelewicz

This comment was marked as off-topic.

@Jegelewicz Jegelewicz changed the title recordedBy and using OrcID, Wikidata, Library of Congress recordedBy and identifiedByID using OrcID, Wikidata, Library of Congress Dec 7, 2023
@dustymc dustymc modified the milestones: Needs Discussion, DWC Apr 9, 2024
@mkoo
Copy link
Member

mkoo commented Nov 26, 2024

Tagging the IPT mapping project. Enthusiasm for this and maybe can be included in any updated mapping

@dustymc
Copy link
Contributor

dustymc commented Dec 2, 2024

Merge --> #7348

@dustymc dustymc closed this as completed Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-Agents NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.
Projects
None yet
Development

No branches or pull requests

6 participants