Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow forced same-dataset occurrence clustering #1056

Open
abubelinha opened this issue Apr 20, 2024 · 0 comments
Open

allow forced same-dataset occurrence clustering #1056

abubelinha opened this issue Apr 20, 2024 · 0 comments

Comments

@abubelinha
Copy link

As discussed in #781 there are situations where repeated copies of the same occurrence within the same dataset should be allowed to be part of a cluster.
The most typical example I can see is when repeated copies of the same specimen may have received different catalogNumbers within the same collection,

  1. This can easily happen when a collection was merged into another, and both might have various old specimens in common (so all of them became part of the same dataset, which is now published in GBIF).
  2. Occasionally, specimens are repeated on purpose within the same collection (i.e. to keep several copies of type specimens).
  3. Also when specimens are exchanged between institutions in different years (if previous packages' lists are not carefully checked, new packages may sometimes contain additional copies of specimens already submitted in previous packages).

Replicated occurrences between different datasets (3) are already being targeted by GBIF clustering algorithms.
And I guess when that happens, all copies will be included in the cluster (even if one of those datasets has more than one copy).

But I think it is interesting for cases 1 and 2 to let those specimens also be shown as GBIF-detected clusters even when no other datasets are involved.
I am aware this can be a problem since "there are many datasets that would just cluster everything (e.g. gut analysis) that brought a technical consideration with cardinalities, and our feasibility of actually calculating these in a timely manner" (sic. @timrobertson100 )

To avoid that and also permit 1 & 2, I suggest the human-curated otherCatalogNumbers relationships to be the only conditions that can trigger a detection of intra-dataset clustering.

I hope that is possible to implement (and not too complicated) to implement.
Thanks!
@abubelinha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant