Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering: Improve NCBI (iBOL) record links #733

Closed
timrobertson100 opened this issue Jun 14, 2022 · 10 comments
Closed

Clustering: Improve NCBI (iBOL) record links #733

timrobertson100 opened this issue Jun 14, 2022 · 10 comments
Assignees

Comments

@timrobertson100
Copy link
Member

https://www.gbif.org/occurrence/search?q=KJ139497

These Arctos and NCBI records should link I think (see catalogNumber)

@timrobertson100 timrobertson100 self-assigned this Jun 20, 2022
@timrobertson100
Copy link
Member Author

timrobertson100 commented Jun 21, 2022

The records are in the candidates table, but are dropped in final comparison as the NCBI record has no location or date information so the rules don't trigger. Given how sparse the NCBI data often is, we might consider a more relaxed rule that allows a link when 1. the accepted species and identifiers overlap and 2. the date/location doesn't contradict and 3. one side is NCBI/EMBL or an iBOL dataset.

What do you think @ManonGros please?

An alternative might be to try and unpack associatedSequences and relatedResources which will be populated in some cases (in this record for example) but looking at the variation in those fields suggest it won't be all that easy.

@ManonGros
Copy link

I agree with your suggestion. It makes sense when it comes to NCBI/EMBL and iBOL datasets. It might help up pick up a lot more clusters.

The associatedSequences field isn't that reliable. But on the other hand, if there was the same exact information in the field for two occurrences, it is a pretty good clue that these occurrences are related.

@timrobertson100 timrobertson100 changed the title Explore missing cluster Clustering: Improve NCBI (iBOL) record links Jun 21, 2022
@timrobertson100
Copy link
Member Author

timrobertson100 commented Jun 21, 2022

I've implemented this as described and created the Hive table, but not deployed this in Elasticsearch.

Currently 792k NCBI specimen-related records link (with 4.6M edges in total).

After running this we see 1,155,931 records from NCBI connect (with 5.04M edges in total) including the examples here above.

SELECT count(distinct id1) AS recs, count(*) AS links 
FROM prod_h.occurrence_relationships 
WHERE dataset1='d8cd16ba-bb74-4420-821e-083f2bac17c2'

timrobertson100 added a commit that referenced this issue Jun 21, 2022
MattBlissett pushed a commit that referenced this issue Jun 21, 2022
@ManonGros
Copy link

Let me know when this is implemented in production, I will update the blogpost (https://data-blog.gbif.org/post/clustering-occurrences/)

@timrobertson100
Copy link
Member Author

@ManonGros this is live now and implemented as described.

If a record comes from iBOL or EMBL, then we link anything with the same accepted scientific name and sharing an identifier (i.e. allow both date and location can be non_conflicting)

ManonGros added a commit to gbif/data-blog that referenced this issue Jun 23, 2022
ManonGros added a commit to gbif/data-blog that referenced this issue Jun 23, 2022
@ManonGros
Copy link

Thanks! I updated the blogpost

@abubelinha
Copy link

abubelinha commented Jun 14, 2023

Related to this clustering issue I was looking for info about how to link sequences to specimens or the other way round and I found this nice clustering example.

I didn't know about that INSDC Sequences dataset.
As INSDC is a collaboration of the 3 big Japan/USA/Europe sequence databases, should I understand the dataset contains (or will at some point contain) a darwincore-styled version of those nucleotide sequence databases ... including NCBI's Genbank? (that's the one mostly used by my institution staff).

EDIT: it seems to be a gbif-generated dataset, using provider's API: tons of information here, thanks a lot!!

@abubelinha

@timrobertson100
Copy link
Member Author

Thanks @abubelinha - yes this should be pulling from the Europe mirror which - I understand - includes all the data submitted through the NCBI genbank too.

If you find we miss something, please do let us know.

@abubelinha
Copy link

Thanks for answering.
I have very little knowledge of the internals of those genetic databases, but researchers do use them a lot to publish molecular analysis of preserved-specimens, so I try to link our specimens to their "related" sequences.

"related"? I can't believe the low quality requirements of NCBI on how voucher specimens (if any) are cited in their database.

Different researchers do it in different ways, many don't even cite catalogNumber, some other cite their personal field record ids, and so on. No way to solve those numbers on GBIF's end.

Perhaps a little effort could be done trying to extract institution/collection codes info (and map it to the dwc version) when GBIF pulls data from the source api.

@timrobertson100
Copy link
Member Author

Thanks @abubelinha - that reflects my own experience too.

I know there are some discussions to improve guidelines/requirements for submission in the 3 Genbank mirrors. I'm not sure how far those have progressed.

Thanks for taking the time to write up those ideas in the other issue - I'll respond to that next week when I have time to digest it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants