-
Notifications
You must be signed in to change notification settings - Fork 177
ODC EP 004 Use alternative index backends
The datacube-core implementation is closely tied to PostgreSQL as a backend for indexing datasets and metadata. This proposal addresses the need for alternative backend for ease of use.
Tisham Dhar
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
Alternative highly scalable purely JSON oriented data stores or embedded serverless databases exist which can be used to work at scale and reduce the dependence on PostgreSQL.
Also embedded purely filesystem based backend will reduce the on-ramp to working with datacube-core and let people get started by simply pip installing datacube and indexing a bit of data into a file based DB via CLI tools without standing up PostgreSQL server.
Some specific backends are included here with use-cases.
Replace the DB queries performed to PostgreSQL with STAC Search API calls to locate relevant bands/s3-keys and offsets to load data from.
The common storage model for the datacube currently is COG files in S3 with STAC Metadata. Instead of indexing these STAC files into a PostgreSQL DB, they can be queried in situ using an ad-hoc Athena table.
STAC JSON files can directly added and indexed in ElasticSearch queries using spatial / temporal support in ElasticSearch to function as a purely document based index of the contents of S3. An ES backend will allow dynamic scaling of the search cluster for large read loads and full metadata searching without specific indices.
SQLite / Spatialite (and its successors) have very good performance for upto a few gigabytes of data and can be used as indices for smaller products with a few thousand datasets.
Similar to SQLite, ESRI has demonstrated indexing STAC documents into Geodatabase from S3 to create virtual mosaics. This functionality could be made available in Python via datacube-core as well.
As part of the Statistician project a subset and dump to LMDB capability was implemented, the scope of this can be increased to support the full datacube load API from LMDB.
Kirill: I feel like ODC-EP-003 needs to be completed before any of the above becomes possible within datacube-core
. Experimentation with dc.load
like interface that is backed by some other metadata store can of course happen independently from that. Have a look at rioxarray
library for example of loading data in the format compatible with what dc.load
produces (by compatible I mean things like .geobox
working as expected).
- Tisham Dhar - AWS Athena backend
- Imam Alam - STAC Backend
- (TBD) - Geodatabase backend
- (TBD) - LMDB backend
- (TBD) - SQLite backend
- (TBD) - ElasticSearch backend
Welcome to the Open Data Cube