-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GBIF human-curated clusters based on dwc:otherCatalogueNumbers #781
Comments
BTW, is this the correct github/gbif place to raise this question? |
Thanks @abubelinha - this makes a lot of sense (and this repository is a good place to record it) The rules as described in the blog refer to identifiers overlap and I'll run some queries shortly and try and summarize how it is used and how well we could rely on ´otherCatalogNumber`. I think it would be preferable to use the IC:CC:CN approach rather than the GBIF dataset key to avoid this being too GBIF-specific. |
This post serves to document some basic data research on the potential for otherCatalogNumbers to inform clusters. Using the following tables as a basis:
There are 30,073,790 records from 1003 datasets listing 30,337,534 otherCatalogNumbers. Looking at the format of the identifiers, we anticipate some delimitation of institutionCode, collectionCode, catalogNumber.
Next up, I'll look at the performance of some of the joins. |
(this needs checked, but leaving here as I pause this for a couple of days) Assuming the catalogNumber itself is tokenized (e.g.
|
I've been exploring algorithms here, and want to leave a note on why this gets difficult. The algorithm I currently have ignores the delimiters ( UCLA W57-41 and UCLA : W57-41 is a nice example of where However, TNHC2799 and TNHC:2799 illustrate where this can go wrong (I think). There is so much variation in how codes are constructed that I think we need more than Alternatively, we could explore what a stricter match using only the IC:CC:CN version (no IC:CN) might yield. One would presume a triplet overlap of codes was a strong signal of a relationship - perhaps adding a Kingdom scope (although presumably there are cases that would cross kingdoms related to e.g. host relationships) |
@timrobertson100 Yes I think you are right, the catalogue number overlap alone isn't enough to infer the link. If possible at all, I think excluding the matches that have different Otherwise having an identifier overlap and same recorder or date is a good idea. The locality might be really difficult to compare but perhaps the state province when filled? Would that be possible? |
Sorry but I feel I asked for something a little bit different of what you are trying to do. Probably my misunderstanding of how the clustering works led me to ask for a feature which is not easy to implement here. These clusters look to me as an example of computer intelligence working to find out groups of occurrences that have a high probability of being duplicates of each other. But I want to stress the "human-curated" words in the title. That's the reason I suggested to use dataset_key.
I agree @timrobertson100 , this is absolutely too GBIF-specific. As a human curator, taking care of generating good links is my own problem: I should take note of correct gbif dataset_keys of my interest, and also note the way each dataset is formatting its catalognumbers when publishing to GBIF. So, even if you develop an algorithm capable of creating clusters by using catalognumber in combination with other fields ... I keep my question about the other possibility, at the risk of this being labeled as an off topic here. Thanks again for all your hard work on this. |
Thanks @abubelinha. Just confirming you aren't being misunderstood - the investigation really has been about whether we could issue guidelines on use of One worry with using datasetKey is that it is likely less stable than codes over time. People do repackage their datasets from time to time, where the collection code may - or may not - be more stable. Another idea we could consider is a publisher providing links to other records using |
there is also the |
Thanks for the info @timrobertson100 & @ManonGros .
At a first glance, I have the impression that using them might be much more difficult for the average user. Because the use case I propose is like this: A collection curator knows that some specimens are duplicates of another remote collection specimen (foreign IC+CC+CN is printed in the local label and stored in local database). Now I am just guessing how linking them could be done: Correct me if I am wrong. For using
Instead (option B), by trusting IC+CC+CN (or
A1 is much more difficult and time consuming than B1, I think.
I don't get what you mean with that repackaging. Is that something that would change And yes, it's great that GBIF occurrenceID tend to be more and more stable (thanks for the links!). But in the end, that's just statistics: let's say in year 2020 there were changes in 5% of the occurrenceID, and in 2021 it was only 2% It would be good to know opinion of other db curators. Maybe I am wrong in my assumptions of what it would be easier to do in other institutions. |
Sorry for being too verbose. |
Thanks @abubelinha
By dataset repackaging, I mean the situation that sometimes people replace their datasets completely after e.g. choosing to merge or split collections or how they share them.
Option B is certainly the preferred approach. I've tweaked what I ran before to be more strict to link when This yielded only 1139 additional records from the current system (which is good) and looking at a selection I can see they appear reasonable (e.g. this and this co-collected Fungi of different species (same branch perhaps) linked explicitly by the recorder). @ManonGros: I think it would therefore be reasonable to add a rule that allowed a link to be made for specimens only having This would be standards-compliant and would provide @abubelinha with what he really seeks - the ability to assert a link to other records using the codes on the label. Formating of those codes would be the responsibility of the publisher making the links and we can document that it needs to be a combination of all 3 parts, encouraging people to update GRSciColl and clean their data. I've attached the new relationships - Does this seem reasonable to you please @ManonGros? @abubelinha - it would mean you populate |
Thanks a lot @timrobertson100 I had not realized till now that GBIF only compared occurrences in different datasets when clustering. I mean, (Q1) if I use Not sure how useful this could become, but at a first glance it would be very interesting to know about these clusters: Also, this could reveal more links between institutions which would otherwise remain invisible. Along history, collection A has received yearly exchanged specimens from B and C.
When organisation A began creating links, they just played attention to numbers in labels of duplicate specimens received from B and C. Which lead to links between organisations (in bold).
But other yearly exchanges received from B and C did not contain their numbers in the labels (so they were stored at A without any info permitting to make links to the original B & C specimens). Later on, when curating my A dataset, I discovered by my own means some internal coincidences between A specimens (i.e. collector names & numbers, location descriptions, dates or whatever).
Looking to clusters 1,2,3 altogether, it is obvious that this is all just a big cluster of six elements:
So, asumming B and C curators are not generating this kind of links, and GBIF clustering system is not smart enough to catch these relationships (i.e., collector and places names typed differently in each database) ... playing attention to A internal links could permit to link B:B:5555 and C:C:6666 (so transferring knowledge between these two institutions as well). Apart from my interest in internal links, this leads me to a different related question.
Knowing that, GBIF could create:
Thanks |
|
Thanks @ManonGros - I will tidy up the code and merge it in the coming days.
We haven't implemented it yet @abubelinha, so please hold on with testing yet. You can prepare data knowing this is coming though.
Yes, but let's track that in a separate issue, please |
This has just been run in production for the first time. Two things remain to close this:
|
I wonder if it makes any sense using gbif-uat.org as a testing environment for BTW. How frequently are clusters generated? Thanks! |
Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive). Clustering at the moment runs periodically, but not on a schedule while it is an experimental feature. We expect it will be run ~weekly in the future, possibly more frequently. |
@timrobertson100 the blogpost/documentation should now be up to date |
This is a bit off-topic but not that much: With linking myself I mean actually introducing a set of clickable references in each occurrence, so user who is viewing that occurrence can click and navigate to the other occurrence urls (and these links should also be available through api and downloaded data). So my question could be addressed in different ways:
I know this probably deserves a new issue but I doubt nobody else has suggested such a thing before (I searched github issues but still coudln't find). For now I post the idea here, as this is double interesting for me in current clustering issue scenario:
Thanks a lot @timrobertson100 @ManonGros |
Hi again @ManonGros @timrobertson100 A side question related to this issue. But when I tried to get the information about the related occurrences, 5 of those clusters had disappeared (empty json):
But if you remove the I suppose the related elements changed in some way so cluster conditions are not meet anymore, but the That's fine. But is there any way to check past status of a cluster?
Is there any way I can do it myself? (recover the info about which occurrences were previously in a now "disappeared cluster"). Thanks a lot in advance |
I'm very sorry for the slow reply @abubelinha
That's correct. It's a little convoluted but we run a batch job daily (normally) to relate similar records and store links. We don't hold history of this table so it's not easy to determine what those links may have once been. I know this doesn't help find the broken links but hopefully it explains what has happened. |
Thanks for explaining the reasons @timrobertson100 We are interested in exploring clusters related to our datasets and how those relations change over time.
So we can query this table to select those occurrences which were related at some point, but they are not anymore (i.e. 12345 & 987654) ... and check the reasons (if taxonomic identification has changed in 987654, we might want to review our own identification in 12345). I would do the following (monthly or so):
Step 2 means running thousands of api queries and downloading much more info than I need. Being experimental, would our regular download api authentication credentials be enough to test it, or do we need to ask helpdesk for permission? Thanks! |
Thanks again I suggest a step before to force a refresh of your dataset and ensure the "is_in_cluster" is up to date. You would do this by 1) recreating your DwC-A or simply changing the timestamp on the file on your webserver (in linux you'd do a Give some time for that process to complete (e.g. 1hr) and then proceed with your 1, 2, 3 steps. I'm afraid the SQL API would only allow you to achieve 1) at the moment, and given it's not a huge dataset I think that your approach is likely the better one. The SQL API would allow something like As an aside, you can see from your ingestion history that we notice that the dataset rarely changes. I don't know if this is as you would expect? We'll help in any way we can |
Thanks again @timrobertson100 for the explanations. I guess you think we republish our dataset(s) frequently so I want to catch possible new clusters and detect those which have changed. Yes indeed, but the first part of the assumption is not yet true (as you noticed from the ingestion history: that's right). So why am I bothering with this right now? That's why I planned to do this monthly. Of course, when we republish our datasets I'll wait for crawling before trying to catch cluster changes and update my table, as you suggest. I expect the 5/1500 proportion to be higher in that case. So I understand there is no copy of the "cluster-relations tables" which we can query from the SQL API. That way I could track not only our own datasets occurrences, but also those which were once upon a time on the right column of my table above (but not anymore):
|
Excellent, thanks.
That's right, at least today and in honesty, I don't imagine it'll be there in the short term. As things develop, it would be really interesting for us to learn how much useful information you gain from the clusters - e.g. a new identification you can apply or so. It's one of those things we imagine might be happening, but never really know... |
We are testing the SQL API and I am a bit confused about how to filter by isInCluster field values.
That said, yesterday I tried several ways but none of them seems to be working (waited a few hours, and either failed or didn't end). I have now cancelled and re-run again: I am not sure whether using TRUE/true (for both booleans and strings) should make a difference. |
Hi, Thanks for pointing out the problem with the documentation. It seems "IS TRUE" isn't working correctly, but the other two forms should work. Our cluster was busy yesterday processing the updated eBird dataset, and there's currently a bug giving SQL downloads a low priority. These downloads will often be slower than the older predicate-based downloads, but I expect the two that are currently running to complete. |
Thanks for explaining @MattBlissett Could any of you confirm whether the UAT-non-clustering situation has changed?
I think the UAT-occurrence data is showing clusters now: BTW, regarding SQL downloads, I'd suggest to always include UAT system by default (when helpdesk gives access to to interested users). |
It has - we do now cluster in UAT, but note that UAT is now much smaller in data volume than it once was so it will detect few matches and give results for whatever is in the test environment at that time. Please be aware that clustering is run daily in UAT, but that might be reduced at busy times. |
Thanks! Cluster in UAT is great, so we can test the usage of Is own-dataset clustering already possible using that |
Sorry, it is not at the moment |
OK. I hope you didn't finally decide it was not possible:
I suspect this was finally forgotten. As for the currently working inter-datasets clustering I see other issues that don't mention spaces before and after vertical bars, but I understand they should be used to separate multiple catalogue numbers triplets, as per
EDIT: as you suggested, I would use ":" as the internal separator character between the 3 triplet parts. But what if we find some providers already using ":" inside any of those 3 parts?
Should we perhaps surround those parts with double quotes to remark that all text inside belongs to that part of the triplet?
I'd suggest to update the blog and show some examples for tricky situations, so most people who provides this information starts to use this field in a consistent way. Thanks ! |
I come from portal-feedback#3973
I was suggesting a pure dataset+catalogueNumber combination as a way to permit human-curated clusters.
As long as we follow strict rules to fill in those dwc field values with data taken from GBIF, I think those links could be trustable.
(perhaps the
catalogueNumber
part may be also standardised when clustering, as suggested in #777)i.e., we could use triplets of
institutionCode
:collectionCode
:catalogNumber
to fill in values indwc:otherCatalogNumbers
, like this: SANT:SANT:44553GBIF just needs to find and group the occurrences which match their
dwc:otherCatalogNumbers
values in order to put them in a given cluster (together with the occurrence pointed by those values).In theory, it might happen that different institution/collection pairs share identical codes, although I don't think it will really happen.
But if that's an issue we could just substitute the first two parts of the triplet (
institutionCode
:collectionCode
pair) by thedataset_key
provided by GBIF. Like this:dwc:otherCatalogNumbers = 1c334170-7ed1-11df-8c4a-0800200c9a66:44553
Of course that means data curators must first check GBIF in order to know
dataset_key
values for the collections they want to link. But that is trivial using the api, and only needed once per collection (as I understand these gbif dataset keys are pretty stable).Once we know them its very easy to populate our
dwc:otherCatalogNumbers
with all the info we can have about specimens we have exchanged with other institutions (or taken in loan from them, if our GBIF dataset is related to a scientific paper which relays on specimens examined from different collections).These clusters would work no matter how those different institutions have filled in locations ("Spain" vs "España"), collectors ("T.Robertson", "Robertson, T.") or even taxonomic names (which may have been corrected in their databases, or in my data paper ... so all datasets in the cluster can benefit from knowing about each other).
I must say all this is just an idea because my institution is not filling in dwc:otherCatalogNumbers yet. Just because of that uncertainity about the best way to construct them (triplets? pairs?). Look at comments in
dwc:otherCatalogNumbers
definition:FMNH:Mammal:1234
,NPS YELLO6778
|MBG 33424
As you can see, the 1st example is a triplet separated by
:
, whereas the 2nd and 3rd are pairs separated by spaces. Not to say that 1st and 2nd are not separated by|
(the suggested separator).That's not a good example for a standardized way of doing it.
As soon as GBIF clarificates a way to fill this in order to permit reliable clusters, we would indeed start doing it.
The good thing is that we can concatenate as many values as needed (no need to substitute current ones).
Perhaps GBIF could accept using
dataset_key
+catalogNumber
as an aditional pair for that field? So clusters might be based only on values which use that kind of codification.I am assumming that GBIF
dataset_key
may be easily recognized by regex, and that we use a known separator (+
or whatever) prior thecatalogNumber
part.Thanks a lot
@abubelinha
The text was updated successfully, but these errors were encountered: