-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering: Improve NCBI (iBOL) record links #733
Comments
The records are in the candidates table, but are dropped in final comparison as the NCBI record has no location or date information so the rules don't trigger. Given how sparse the NCBI data often is, we might consider a more relaxed rule that allows a link when 1. the accepted species and identifiers overlap and 2. the date/location doesn't contradict and 3. one side is NCBI/EMBL or an iBOL dataset. What do you think @ManonGros please? An alternative might be to try and unpack associatedSequences and relatedResources which will be populated in some cases (in this record for example) but looking at the variation in those fields suggest it won't be all that easy. |
I agree with your suggestion. It makes sense when it comes to NCBI/EMBL and iBOL datasets. It might help up pick up a lot more clusters. The associatedSequences field isn't that reliable. But on the other hand, if there was the same exact information in the field for two occurrences, it is a pretty good clue that these occurrences are related. |
I've implemented this as described and created the Hive table, but not deployed this in Elasticsearch. Currently 792k NCBI specimen-related records link (with 4.6M edges in total). After running this we see 1,155,931 records from NCBI connect (with 5.04M edges in total) including the examples here above.
|
Let me know when this is implemented in production, I will update the blogpost (https://data-blog.gbif.org/post/clustering-occurrences/) |
@ManonGros this is live now and implemented as described. If a record comes from iBOL or EMBL, then we link anything with the same accepted scientific name and sharing an identifier (i.e. allow both date and location can be non_conflicting) |
Thanks! I updated the blogpost |
Related to this clustering issue I was looking for info about how to link sequences to specimens or the other way round and I found this nice clustering example. I didn't know about that INSDC Sequences dataset. EDIT: it seems to be a gbif-generated dataset, using provider's API: tons of information here, thanks a lot!! |
Thanks @abubelinha - yes this should be pulling from the Europe mirror which - I understand - includes all the data submitted through the NCBI genbank too. If you find we miss something, please do let us know. |
Thanks for answering. "related"? I can't believe the low quality requirements of NCBI on how voucher specimens (if any) are cited in their database. Different researchers do it in different ways, many don't even cite Perhaps a little effort could be done trying to extract institution/collection codes info (and map it to the dwc version) when GBIF pulls data from the source api. |
Thanks @abubelinha - that reflects my own experience too. I know there are some discussions to improve guidelines/requirements for submission in the 3 Genbank mirrors. I'm not sure how far those have progressed. Thanks for taking the time to write up those ideas in the other issue - I'll respond to that next week when I have time to digest it. |
https://www.gbif.org/occurrence/search?q=KJ139497
These Arctos and NCBI records should link I think (see catalogNumber)
The text was updated successfully, but these errors were encountered: