Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex rules trying to map institutionCode / collectionCode when available #19

Open
abubelinha opened this issue Jun 14, 2023 · 3 comments

Comments

@abubelinha
Copy link

abubelinha commented Jun 14, 2023

Very useful datasets.

Sometime ago I tried NCBI's LinkOut system to set Genbank published sequences linked back to our specimens.

Now I am interested in doing it the opposite way: search INSDC Sequences dataset and then link specimens from my institution to those sequences inside gbif context.

It would be nice if collectionCode (and/or insitutionCode) were available to filter INSDC sequences dataset.
Institutions could use a direct link to the subset of sequences attributed to their collections' specimens.

I might be wrong, but those codes seem to be always empty: only catalogNumber is provided.
So it's difficult to find sequences citing voucher specimens of a given collection/institution.

It's a shame because many times both of code and numbers are available in source data, but they are being mapped together into dwc catalogNumber field. A few examples:

  1. These are collectionCode+catalogNumber concatenations:

  2. I also found sometimes inverted combinations (catalogNumber before collectionCode):

  3. Also sometimes (collectionCode) goes inside parentheses: https://www.gbif.org/occurrence/3817942152

  4. This shows a collectionCode, wrongly mapped as a catalogNumber: https://www.gbif.org/occurrence/3350565676
    (but I might find out the number in publication and link back anyway)

Perhaps you could try to implement some regex rules and search for collectionCode in those cases which might be clear enough to solve?
i.e. having a list of candidate collectionCode names in upper case, try to check if they are followed or preceded by a catalogNumber.

Whenever you find a non-numeric uppercase string it might well be just a collectionCode and not a catalogNumber (see 4th example). Specially if it matches one of those candidate collectionCode names.

With candidate collectionCode names I mean using Index Herbariorum codes, or different collectionCode names published in GBIF.
EDIT: one difficulty of this task is to decide whether the codes shown in INSDC sequences voucher information should be mapped as collectionCode or institutionCode. I'd just map them to both fields because that's impossible to solve:
I have seen different specimens of a given institution and collection, and the same researcher has cited them in 3 different ways: "IC:CC:CN", "IC:CN" (no info about collection) and "CC:CN"

Anyway, thanks a lot for publishing this dataset
@abubelinha

@abubelinha abubelinha changed the title regex rules trying to map collectionCode when available regex rules trying to map institutionCode / collectionCode when available Jun 15, 2023
@timrobertson100
Copy link
Member

Thanks @abubelinha

There is certainly more to do, but some of this is implemented in the clustering. Specifically, it checks using combinations of ic:cc:cn and ic:cn when making overlaps. I found cc:cn problematic in other dataset comparisons (false positives) but would likely work for INSDC.

I'll try and find some time to review the INSDC identifiers again, along the lines you propose here.

@abubelinha
Copy link
Author

abubelinha commented Jun 18, 2023

For my use case I am done (used the api to grab our institution data by passing our code in specimen_voucher parameter request).
But having a way to filter this GBIF dataset by institution/collection codes would be great (so institutions can link INSC dataset for they own data just passing a filter parameter in their urls).

Unfortunately, codes cited as vouchers are not exclusive of institutions.
Nothing prevents a researcher to publish a sequence and cite an own-generated code which matches a real institution code.
(i.e. "MO-231" could be a specimen from either Mike Oldfield or Missouri Botanical Garden collection).

So you are right. There will be too many false positives when trying to batch apply this procedure for a huge list of institution/collection codes, specially for those whose codes are only 1 or 2 letters long.

But I still find useful to filter your dataset by Institution or Collection code, even with false positives.
Just warn dataset users they might need to further filter those data.

@timrobertson100
Copy link
Member

It's nice to hear you were able to bring those links back "home" into your system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants