Clustering: Improve NCBI (iBOL) record links #733

timrobertson100 · 2022-06-14T14:19:48Z

https://www.gbif.org/occurrence/search?q=KJ139497

These Arctos and NCBI records should link I think (see catalogNumber)

timrobertson100 · 2022-06-21T07:58:38Z

The records are in the candidates table, but are dropped in final comparison as the NCBI record has no location or date information so the rules don't trigger. Given how sparse the NCBI data often is, we might consider a more relaxed rule that allows a link when 1. the accepted species and identifiers overlap and 2. the date/location doesn't contradict and 3. one side is NCBI/EMBL or an iBOL dataset.

What do you think @ManonGros please?

An alternative might be to try and unpack associatedSequences and relatedResources which will be populated in some cases (in this record for example) but looking at the variation in those fields suggest it won't be all that easy.

ManonGros · 2022-06-21T08:37:27Z

I agree with your suggestion. It makes sense when it comes to NCBI/EMBL and iBOL datasets. It might help up pick up a lot more clusters.

The associatedSequences field isn't that reliable. But on the other hand, if there was the same exact information in the field for two occurrences, it is a pretty good clue that these occurrences are related.

timrobertson100 · 2022-06-21T14:17:41Z

I've implemented this as described and created the Hive table, but not deployed this in Elasticsearch.

Currently 792k NCBI specimen-related records link (with 4.6M edges in total).

After running this we see 1,155,931 records from NCBI connect (with 5.04M edges in total) including the examples here above.

SELECT count(distinct id1) AS recs, count(*) AS links 
FROM prod_h.occurrence_relationships 
WHERE dataset1='d8cd16ba-bb74-4420-821e-083f2bac17c2'

ManonGros · 2022-06-22T07:10:03Z

Let me know when this is implemented in production, I will update the blogpost (https://data-blog.gbif.org/post/clustering-occurrences/)

timrobertson100 · 2022-06-23T08:07:11Z

@ManonGros this is live now and implemented as described.

If a record comes from iBOL or EMBL, then we link anything with the same accepted scientific name and sharing an identifier (i.e. allow both date and location can be non_conflicting)

gbif/pipelines#733

ManonGros · 2022-06-23T12:17:41Z

Thanks! I updated the blogpost

abubelinha · 2023-06-14T12:39:15Z

Related to this clustering issue I was looking for info about how to link sequences to specimens or the other way round and I found this nice clustering example.

I didn't know about that INSDC Sequences dataset.
As INSDC is a collaboration of the 3 big Japan/USA/Europe sequence databases, should I understand the dataset contains (or will at some point contain) a darwincore-styled version of those nucleotide sequence databases ... including NCBI's Genbank? (that's the one mostly used by my institution staff).

EDIT: it seems to be a gbif-generated dataset, using provider's API: tons of information here, thanks a lot!!

@abubelinha

timrobertson100 · 2023-06-15T12:22:32Z

Thanks @abubelinha - yes this should be pulling from the Europe mirror which - I understand - includes all the data submitted through the NCBI genbank too.

If you find we miss something, please do let us know.

abubelinha · 2023-06-15T19:03:48Z

Thanks for answering.
I have very little knowledge of the internals of those genetic databases, but researchers do use them a lot to publish molecular analysis of preserved-specimens, so I try to link our specimens to their "related" sequences.

"related"? I can't believe the low quality requirements of NCBI on how voucher specimens (if any) are cited in their database.

Different researchers do it in different ways, many don't even cite catalogNumber, some other cite their personal field record ids, and so on. No way to solve those numbers on GBIF's end.

Perhaps a little effort could be done trying to extract institution/collection codes info (and map it to the dwc version) when GBIF pulls data from the source api.

timrobertson100 · 2023-06-16T09:39:52Z

Thanks @abubelinha - that reflects my own experience too.

I know there are some discussions to improve guidelines/requirements for submission in the 3 Genbank mirrors. I'm not sure how far those have progressed.

Thanks for taking the time to write up those ideas in the other issue - I'll respond to that next week when I have time to digest it.

timrobertson100 self-assigned this Jun 20, 2022

timrobertson100 changed the title ~~Explore missing cluster~~ Clustering: Improve NCBI (iBOL) record links Jun 21, 2022

timrobertson100 added a commit that referenced this issue Jun 21, 2022

#733 Improve NCBI clustering

393f624

MattBlissett pushed a commit that referenced this issue Jun 21, 2022

#733 Improve NCBI clustering (#746)

509df46

timrobertson100 closed this as completed Jun 23, 2022

ManonGros added a commit to gbif/data-blog that referenced this issue Jun 23, 2022

Updated documentation to reflects changes

3526414

gbif/pipelines#733

ManonGros added a commit to gbif/data-blog that referenced this issue Jun 23, 2022

Update doc

39fe884

gbif/pipelines#733

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering: Improve NCBI (iBOL) record links #733

Clustering: Improve NCBI (iBOL) record links #733

timrobertson100 commented Jun 14, 2022

timrobertson100 commented Jun 21, 2022 •

edited

Loading

ManonGros commented Jun 21, 2022

timrobertson100 commented Jun 21, 2022 •

edited

Loading

ManonGros commented Jun 22, 2022

timrobertson100 commented Jun 23, 2022

ManonGros commented Jun 23, 2022

abubelinha commented Jun 14, 2023 •

edited

Loading

timrobertson100 commented Jun 15, 2023

abubelinha commented Jun 15, 2023

timrobertson100 commented Jun 16, 2023

Clustering: Improve NCBI (iBOL) record links #733

Clustering: Improve NCBI (iBOL) record links #733

Comments

timrobertson100 commented Jun 14, 2022

timrobertson100 commented Jun 21, 2022 • edited Loading

ManonGros commented Jun 21, 2022

timrobertson100 commented Jun 21, 2022 • edited Loading

ManonGros commented Jun 22, 2022

timrobertson100 commented Jun 23, 2022

ManonGros commented Jun 23, 2022

abubelinha commented Jun 14, 2023 • edited Loading

timrobertson100 commented Jun 15, 2023

abubelinha commented Jun 15, 2023

timrobertson100 commented Jun 16, 2023

timrobertson100 commented Jun 21, 2022 •

edited

Loading

timrobertson100 commented Jun 21, 2022 •

edited

Loading

abubelinha commented Jun 14, 2023 •

edited

Loading