Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a field without data breaks larger occurrences download #930

Open
mjwestgate opened this issue Oct 17, 2024 · 4 comments
Open

Adding a field without data breaks larger occurrences download #930

mjwestgate opened this issue Oct 17, 2024 · 4 comments

Comments

@mjwestgate
Copy link

This is based on an issue identified using galah here. Basically, when we select a field in our occurrence download, for a query where no records have data in that field, the whole download fails. I've put @daxkellie's summary of the problem below.

To walk through the problem, the following query asks for counts of Acacia aneura grouped by scientficName:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=scientificName&fsort=count&flimit=10000

It returns this:
[{"fieldName":"scientificName","fieldResult":[{"label":"Acacia aneura","i18nCode":"scientificName.Acacia aneura","count":80,"fq":"scientificName:\"Acacia aneura\""},{"label":"Acacia aneura var. major","i18nCode":"scientificName.Acacia aneura var. major","count":6,"fq":"scientificName:\"Acacia aneura var. major\""},{"label":"Acacia aneura var. aneura","i18nCode":"scientificName.Acacia aneura var. aneura","count":1,"fq":"scientificName:\"Acacia aneura var. aneura\""}],"count":3}]

Which is great. By changing facets to location, we get no records, suggesting that this field is empty:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=location&fsort=count&flimit=10000

Again, fine. We then format request as an occurrence download, including a number of fields including location:

"https://biocache-ws.ala.org.au/ws/occurrences/offline/download?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&fields=recordID%2CscientificName%2CvernacularName%2Ckingdom%2CeventDate%2CsamplingProtocol%2CindividualCount%2CrecordedBy%2Clocation&qa=none&facet=false&emailNotify=false&sourceTypeId=2004&reasonTypeId=4&email=martinjwestgate%40gmail.com&dwcHeaders=true"

This runs, stating we expect to receive 87 records:

{"status":"inQueue","totalRecords":87,"queueSize":1,"statusUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/status/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","cancelUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/cancel/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","searchUrl":"https://biocache.ala.org.au/occurrences/search?&q=*%3A*&fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&disableAllQualityFilters=true&fq=-basisOfRecord%3A%22FOSSIL_SPECIMEN%22+AND+-%28basisOfRecord%3A%22MATERIAL_SAMPLE%22+AND+contentTypes%3A%22Environmental+DNA%22%29&fq=-%28duplicate_status%3A%22ASSOCIATED%22+AND+duplicateType%3A%22DIFFERENT_DATASET%22%29&fq=-assertions%3ATAXON_MATCH_NONE+AND+-assertions%3AINVALID_SCIENTIFIC_NAME+AND+-assertions%3ATAXON_HOMONYM+AND+-assertions%3AUNKNOWN_KINGDOM+AND+-assertions%3ATAXON_SCOPE_MISMATCH&fq=-occurrenceStatus%3AABSENT&fq=-identificationVerificationStatus%3A%22needs_id%22&fq=-userAssertions%3A50001+AND+-userAssertions%3A50005&fq=-year%3A%5B*+TO+1700%5D&fq=-establishmentMeans%3A%22MANAGED%22+AND+-decimalLatitude%3A0+AND+-decimalLongitude%3A0+AND+-assertions%3A%22PRESUMED_SWAPPED_COORDINATE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_STATEPROVINCE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_COUNTRY%22+AND+-assertions%3A%22PRESUMED_NEGATED_LATITUDE%22+AND+-assertions%3A%22PRESUMED_NEGATED_LONGITUDE%22&fq=-outlierLayerCount%3A%5B3+TO+*%5D&fq=-spatiallyValid%3A%22false%22&fq=-coordinateUncertaintyInMeters%3A%5B10001+TO+*%5D"}

Finally, the resulting Zip file (https://biocache.ala.org.au/biocache-download/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1/1729125651839/data.zip") has no data in it. What we would expect instead would be for all the requested fields to be downloaded, but with only NAs in the location column.

@kylie-m
Copy link

kylie-m commented Oct 17, 2024

Relates to support ticket: https://support.ehelp.edu.au/a/tickets/209037

@adam-collins
Copy link
Contributor

The contents of the location field are the same as the lat_long field. This is signified by sourceFields in the index/fields service. This location field is not intended for use with the download service. The service is likely misleading because we include the dataType name (instead of class) and indicate that it is stored=true. There is also an intentional lack of other information on the record such as description, downloadDescription, info, class(s), dwcTerm.

The problem that needs fixing is with the biocache-service index/fields service. It is currently exposing fields that are intended for use in search only (not facets, not downloads) but that still report stored=true because that is required for other reasons.

I think dataTypes should be removed as their usage requires knowledge about SOLR queries. dataTypes geohash, packedQuad, quad, location.
image

The intention is to keep other search only fields in the index/fields response.

There is no intention to include virtual search fields in index/fields.

@mjwestgate
Copy link
Author

OK thanks @adam-collins, that makes sense. It also tallies with our workflows; we only allow users to query fields that are listed in index/fields, so if they aren't in there, the query will get stopped by galah at an earlier stage.

While we're doing that it might make sense to have a spring clean of other content too. The first three fields listed are _nest_parent_, _nest_path_ and _root_, for example, which doesn't seem right either.

@adam-collins
Copy link
Contributor

Post cleanup of index/fields, it will contain no internal use or fields with data types deemed complicated use. It will include:

  • fields that can be used everywhere (most fields)
  • fields that can only be used for search queries (case insensitive text searching, etc)

To differentiate between the two

  • stored: true can be downloaded and faceted
  • stored: false cannot be downloaded or faceted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants