regex rules trying to map institutionCode / collectionCode when available #19

abubelinha · 2023-06-14T21:52:25Z

Very useful datasets.

Sometime ago I tried NCBI's LinkOut system to set Genbank published sequences linked back to our specimens.

Now I am interested in doing it the opposite way: search INSDC Sequences dataset and then link specimens from my institution to those sequences inside gbif context.

It would be nice if collectionCode (and/or insitutionCode) were available to filter INSDC sequences dataset.
Institutions could use a direct link to the subset of sequences attributed to their collections' specimens.

I might be wrong, but those codes seem to be always empty: only catalogNumber is provided.
So it's difficult to find sequences citing voucher specimens of a given collection/institution.

It's a shame because many times both of code and numbers are available in source data, but they are being mapped together into dwc catalogNumber field. A few examples:

These are collectionCode+catalogNumber concatenations:
- No separator: https://www.gbif.org/occurrence/3349765325
  Voucher specimen here: https://www.gbif.org/occurrence/2821299116
- ':' separator: https://www.gbif.org/occurrence/3350297248
  Voucher specimen here: https://www.gbif.org/occurrence/895066924
- ' ' separator: https://www.gbif.org/occurrence/4008248179
  Voucher specimen here: https://www.gbif.org/occurrence/895006983
I also found sometimes inverted combinations (catalogNumber before collectionCode):
- No separator: https://www.gbif.org/occurrence/3990282449
  Voucher specimen here: https://www.gbif.org/occurrence/2821301687
Also sometimes (collectionCode) goes inside parentheses: https://www.gbif.org/occurrence/3817942152
This shows a collectionCode, wrongly mapped as a catalogNumber: https://www.gbif.org/occurrence/3350565676
(but I might find out the number in publication and link back anyway)

Perhaps you could try to implement some regex rules and search for collectionCode in those cases which might be clear enough to solve?
i.e. having a list of candidate collectionCode names in upper case, try to check if they are followed or preceded by a catalogNumber.

Whenever you find a non-numeric uppercase string it might well be just a collectionCode and not a catalogNumber (see 4th example). Specially if it matches one of those candidate collectionCode names.

With candidate collectionCode names I mean using Index Herbariorum codes, or different collectionCode names published in GBIF.
EDIT: one difficulty of this task is to decide whether the codes shown in INSDC sequences voucher information should be mapped as collectionCode or institutionCode. I'd just map them to both fields because that's impossible to solve:
I have seen different specimens of a given institution and collection, and the same researcher has cited them in 3 different ways: "IC:CC:CN", "IC:CN" (no info about collection) and "CC:CN"

Anyway, thanks a lot for publishing this dataset
@abubelinha

The text was updated successfully, but these errors were encountered:

timrobertson100 · 2023-06-17T12:00:14Z

Thanks @abubelinha

There is certainly more to do, but some of this is implemented in the clustering. Specifically, it checks using combinations of ic:cc:cn and ic:cn when making overlaps. I found cc:cn problematic in other dataset comparisons (false positives) but would likely work for INSDC.

I'll try and find some time to review the INSDC identifiers again, along the lines you propose here.

abubelinha · 2023-06-18T13:06:59Z

For my use case I am done (used the api to grab our institution data by passing our code in specimen_voucher parameter request).
But having a way to filter this GBIF dataset by institution/collection codes would be great (so institutions can link INSC dataset for they own data just passing a filter parameter in their urls).

Unfortunately, codes cited as vouchers are not exclusive of institutions.
Nothing prevents a researcher to publish a sequence and cite an own-generated code which matches a real institution code.
(i.e. "MO-231" could be a specimen from either Mike Oldfield or Missouri Botanical Garden collection).

So you are right. There will be too many false positives when trying to batch apply this procedure for a huge list of institution/collection codes, specially for those whose codes are only 1 or 2 letters long.

But I still find useful to filter your dataset by Institution or Collection code, even with false positives.
Just warn dataset users they might need to further filter those data.

timrobertson100 · 2023-06-18T15:12:09Z

It's nice to hear you were able to bring those links back "home" into your system.

abubelinha changed the title ~~regex rules trying to map collectionCode when available~~ regex rules trying to map institutionCode / collectionCode when available Jun 15, 2023

abubelinha mentioned this issue Jun 15, 2023

Clustering: Improve NCBI (iBOL) record links gbif/pipelines#733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex rules trying to map institutionCode / collectionCode when available #19

regex rules trying to map institutionCode / collectionCode when available #19

abubelinha commented Jun 14, 2023 •

edited

Loading

timrobertson100 commented Jun 17, 2023

abubelinha commented Jun 18, 2023 •

edited

Loading

timrobertson100 commented Jun 18, 2023

regex rules trying to map institutionCode / collectionCode when available #19

regex rules trying to map institutionCode / collectionCode when available #19

Comments

abubelinha commented Jun 14, 2023 • edited Loading

timrobertson100 commented Jun 17, 2023

abubelinha commented Jun 18, 2023 • edited Loading

timrobertson100 commented Jun 18, 2023

abubelinha commented Jun 14, 2023 •

edited

Loading

abubelinha commented Jun 18, 2023 •

edited

Loading