Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify lower quality records... #245

Closed
M-Nicholls opened this issue Sep 1, 2021 · 2 comments
Closed

Identify lower quality records... #245

M-Nicholls opened this issue Sep 1, 2021 · 2 comments
Labels
21 Hazard! Activity is very complex and should be broken down into smaller activities Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other DQ Reporting anything to do with reporting on the data quality of an occurrence

Comments

@M-Nicholls
Copy link
Contributor

M-Nicholls commented Sep 1, 2021

... so that the data set that is trusted

Identify and filter incomplete records:

  • no value supplied in core elements - eventDate, scientificName, decimalLatitude, decimaLongitude

#249

Identify and filter invalid records:

Values are

  • unreadable due to data type, length, format (this is different to doesn’t match a vocabulary)
  • out of range values e.g. coordinates

Identify and filter potentially incorrect records:

Automated identification of incorrect records:

Manually identify and filter incorrect records

Identify and filter duplicate records

  • Duplicate detection

Identify and filter less authoritative records

  • Lacking supporting and contextual fields – validation status, georeferenced date and method etc
  • Internally inconsistent – e.g. wrong state for coordinates, coordinates are too accurate
  • Non standard values – e.g. units, vocabulary matching
  • Records with default values – centre of Aus, centre of state/province, 1st of month, year, century.

Identify not fit for purpose records

  • Precision requirements – coordinate precision and uncertainty, taxonomic rank, temporal precision
  • Remove record types not suitable e.g. cultivated, eDNA, fossil, absence, pre-1700
@M-Nicholls M-Nicholls changed the title Identify lower quality records Identify lower quality records so they can be filtered out Sep 7, 2021
@M-Nicholls M-Nicholls changed the title Identify lower quality records so they can be filtered out Identify lower quality records Sep 7, 2021
@M-Nicholls M-Nicholls changed the title Identify lower quality records Identify lower quality records... Sep 7, 2021
@M-Nicholls
Copy link
Contributor Author

M-Nicholls commented Sep 14, 2021

For each filter needed establish:

  • what needs to be filtered,
  • is there an assertion or field(s) already,
  • if so does the assertion work correctly
  • is the field processed correctly - e.g. date parsing issues, update field processing
  • if not design and implement the assertion or field(s) and add to the data pre-filters

wherever an assertion is used to filter, check that that the assertion is operating correctly e.g. https://biocache.ala.org.au/occurrences/search?q=assertions:RECORDED_DATE_INVALID

many of the records appear to have valid dates: https://biocache.ala.org.au/occurrences/4eaa0bd1-5bb0-4e40-9452-31e8afb0a040

@acbuyan acbuyan added 21 Hazard! Activity is very complex and should be broken down into smaller activities Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other DQ Reporting anything to do with reporting on the data quality of an occurrence labels Nov 27, 2024
@peggynewman
Copy link

Ideas captured elsewhere. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
21 Hazard! Activity is very complex and should be broken down into smaller activities Data Quality Assertions Anything relating to data quality assertions, including distributions, pipelines and other DQ Reporting anything to do with reporting on the data quality of an occurrence
Projects
None yet
Development

No branches or pull requests

3 participants