Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect if a species occurrence record is within its expected spatial distribution #255

Open
M-Nicholls opened this issue Oct 20, 2021 · 5 comments
Assignees
Labels
Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other Enhancement Requests for new feature or improvements to existing features

Comments

@M-Nicholls
Copy link
Contributor

M-Nicholls commented Oct 20, 2021

Where should this occur - part of the pipelines or a separate process?

check layers are available outlier detection

run expert distribution outlier detection - is there an expert distribution for the species, if so detect if a species occurrence record point is in/out of the expert distribution

add a distance of the point inside/outside expected distribution field to the record

add expert distribution outlier category
(compare the distance inside/outside the distribution boundary to the uncertainty)

  • within expected distribution - point and full uncertainty are within the range
  • likely within expected distribution - point within the range uncertainty is out
  • may be within expected distribution - point outside the range and uncertainty overlaps the range
  • outside expected distribution - point outside the range and uncertainty outside the range

Two scenarios:

1, Calculate all exisiting occurrences with existing expert distribution layers - one-time run
2, Re-calculate the related species when a new export distribution layer is added.

Link to pipeline issue: gbif/pipelines#622
Link to Spatial issue: AtlasOfLivingAustralia/spatial-service#186

@M-Nicholls
Copy link
Contributor Author

M-Nicholls commented Nov 2, 2021

what to do with generalised records
how to take record uncertainty into account

use the size of the distribution to determine how much the uncertainty or generalisation matters?
i.e. for a very small distribution uncertainty and generalisation will make a big difference as to whether the point is in or out
should records be considered in or out if it's uncertainty puts it in the range but the point is outside the range?

indicate the point is in/out but based on the uncertainty the record may be out/in

categories -
within expected distribution - point and full uncertainty are within the range
likely within expected distribution - point within the range uncertainty is out
may be within expected distribution - point outside the range and uncertainty overlaps the range
outside expected distribution - point outside the range and uncertainty outside the range

use of categories and distance outside distribution provides a through combination of metrics

@M-Nicholls
Copy link
Contributor Author

Add to data pre-filters
update assertion metadata
update support material

@M-Nicholls
Copy link
Contributor Author

what to do if there are multiple overlapping layers - e.g. likely | maybe layers and separate east coast/west coats layers e.g. grey nurse shark

@qifeng-bai
Copy link

what to do if there are multiple overlapping layers - e.g. likely | maybe layers and separate east coast/west coats layers e.g. grey nurse shark

Single layer / multi layers won't affect the calculation of in/out of layers, but it brings difficulty in calculating distance

@qifeng-bai
Copy link

Solution:
Jenkins schedules to run the program once every day.

For every run:
Pipelines loads all indexed records
Comparing with the existing outlier records, filter the new added records
Calculate outliers of those new records ONLY.

If a new expert layer is added or updated, manually deleted exisiting outlier records, then Pipelines will recalculated all index records

@acbuyan acbuyan changed the title Detect if a species occurrence record is within it's expected spatial distribution Detect if a species occurrence record is within its expected spatial distribution Nov 26, 2024
@acbuyan acbuyan added Enhancement Requests for new feature or improvements to existing features Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other labels Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other Enhancement Requests for new feature or improvements to existing features
Projects
None yet
Development

No branches or pull requests

3 participants