Skip to content

ODC EP 004 Use alternative index backends

paulmacey1 edited this page Apr 19, 2021 · 19 revisions

ODC Enhancement: Support non-PostgreSQL Index backend

Overview

The datacube-core implementation is closely tied to PostgreSQL as a backend for indexing datasets and metadata. This proposal addresses the need for alternative backend for ease of use.

Proposed By

Tisham Dhar

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Alternative highly scalable purely JSON oriented data stores or embedded serverless databases exist which can be used to work at scale and reduce the dependence on PostgreSQL.

Also embedded purely filesystem based backend will reduce the on-ramp to working with datacube-core and let people get started by simply pip installing datacube and indexing a bit of data into a file based DB via CLI tools without standing up PostgreSQL server.

Proposal

Some specific backends are included here with use-cases.

STAC Search API

Replace the DB queries performed to PostgreSQL with STAC Search API calls to locate relevant bands/s3-keys and offsets to load data from.

AWS Athena

The common storage model for the datacube currently is COG files in S3 with STAC Metadata. Instead of indexing these STAC files into a PostgreSQL DB, they can be queried in situ using an ad-hoc Athena table.

Amazon DynamoDB

Using Amazon DynamoDB, a NoSQL database, as an index store together with libraries such as dynamodb-geo, querying of indexes based on circular or rectangular boundaries is possible. By creating composite sort keys within DynamoDB will allow for extremely fast, scalable, and flexible querying of indexes based on attributes or boundaries. Fine grained access can be established so that subsets of data can be returned based on each users' level of access. A key constraint with using DynamoDB as the index store is the number of indexes being returned in a query, at it may use a large amount of Read Capacity Units to return very large results. Understanding the size of returned result-sets and benchmarking the speed, cost, and performance will determine whether or not DynamoDB is an applicable alternative to other solutions.

DynamoDB local is an version of AWS DynamoDB that can be run locally on a PC/Laptop/Server.

AWS DynamoDB local AWS DynamoDB local - downloading and running

Google S2 library DyanmoDB-geo

Elastic Search

STAC JSON files can directly added and indexed in ElasticSearch queries using spatial / temporal support in ElasticSearch to function as a purely document based index of the contents of S3. An ES backend will allow dynamic scaling of the search cluster for large read loads and full metadata searching without specific indices.

SQLite

SQLite / Spatialite (and its successors) have very good performance for upto a few gigabytes of data and can be used as indices for smaller products with a few thousand datasets.

File GeoDatabase

Similar to SQLite, ESRI has demonstrated indexing STAC documents into Geodatabase from S3 to create virtual mosaics. This functionality could be made available in Python via datacube-core as well.

LMDB

As part of the Statistician project a subset and dump to LMDB capability was implemented, the scope of this can be increased to support the full datacube load API from LMDB.

Feedback

Kirill: I feel like ODC-EP-003 needs to be completed before any of the above becomes possible within datacube-core. Experimentation with dc.load like interface that is backed by some other metadata store can of course happen independently from that. Have a look at rioxarray library for example of loading data in the format compatible with what dc.load produces (by compatible I mean things like .geobox working as expected).

Voting

Enhancement Proposal Team

  • Tisham Dhar - AWS Athena backend
  • Imam Alam - STAC Backend
  • (TBD) - Geodatabase backend
  • (TBD) - LMDB backend
  • (TBD) - SQLite backend
  • (TBD) - ElasticSearch backend

Links

Clone this wiki locally