-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex rules trying to map institutionCode / collectionCode when available #19
Comments
Thanks @abubelinha There is certainly more to do, but some of this is implemented in the clustering. Specifically, it checks using combinations of I'll try and find some time to review the INSDC identifiers again, along the lines you propose here. |
For my use case I am done (used the api to grab our institution data by passing our code in specimen_voucher parameter request). Unfortunately, codes cited as vouchers are not exclusive of institutions. So you are right. There will be too many false positives when trying to batch apply this procedure for a huge list of institution/collection codes, specially for those whose codes are only 1 or 2 letters long. But I still find useful to filter your dataset by Institution or Collection code, even with false positives. |
It's nice to hear you were able to bring those links back "home" into your system. |
Very useful datasets.
Sometime ago I tried NCBI's LinkOut system to set Genbank published sequences linked back to our specimens.
Now I am interested in doing it the opposite way: search INSDC Sequences dataset and then link specimens from my institution to those sequences inside gbif context.
It would be nice if
collectionCode
(and/orinsitutionCode
) were available to filter INSDC sequences dataset.Institutions could use a direct link to the subset of sequences attributed to their collections' specimens.
I might be wrong, but those codes seem to be always empty: only
catalogNumber
is provided.So it's difficult to find sequences citing voucher specimens of a given collection/institution.
It's a shame because many times both of code and numbers are available in source data, but they are being mapped together into dwc
catalogNumber
field. A few examples:These are
collectionCode
+catalogNumber
concatenations:Voucher specimen here: https://www.gbif.org/occurrence/2821299116
':'
separator: https://www.gbif.org/occurrence/3350297248Voucher specimen here: https://www.gbif.org/occurrence/895066924
' '
separator: https://www.gbif.org/occurrence/4008248179Voucher specimen here: https://www.gbif.org/occurrence/895006983
I also found sometimes inverted combinations (
catalogNumber
beforecollectionCode
):Voucher specimen here: https://www.gbif.org/occurrence/2821301687
Also sometimes (
collectionCode
) goes inside parentheses: https://www.gbif.org/occurrence/3817942152This shows a
collectionCode
, wrongly mapped as acatalogNumber
: https://www.gbif.org/occurrence/3350565676(but I might find out the number in publication and link back anyway)
Perhaps you could try to implement some regex rules and search for
collectionCode
in those cases which might be clear enough to solve?i.e. having a list of candidate
collectionCode
names in upper case, try to check if they are followed or preceded by acatalogNumber
.Whenever you find a non-numeric uppercase string it might well be just a
collectionCode
and not acatalogNumber
(see 4th example). Specially if it matches one of those candidatecollectionCode
names.With candidate
collectionCode
names I mean using Index Herbariorum codes, or differentcollectionCode
names published in GBIF.EDIT: one difficulty of this task is to decide whether the codes shown in INSDC sequences voucher information should be mapped as
collectionCode
orinstitutionCode
. I'd just map them to both fields because that's impossible to solve:I have seen different specimens of a given institution and collection, and the same researcher has cited them in 3 different ways: "IC:CC:CN", "IC:CN" (no info about collection) and "CC:CN"
Anyway, thanks a lot for publishing this dataset
@abubelinha
The text was updated successfully, but these errors were encountered: