Skip to content

ODC EP 004 Use alternative index backends

Tisham Dhar edited this page Mar 7, 2022 · 19 revisions

ODC Enhancement: Support non-PostgreSQL Index backend

Overview

The datacube-core implementation is closely tied to PostgreSQL as a backend for indexing datasets and metadata. This proposal addresses the need for alternative backend for ease of use.

Proposed By

Tisham Dhar

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Alternative highly scalable purely JSON oriented data stores or embedded serverless databases exist which can be used to work at scale and reduce the dependence on PostgreSQL.

Also embedded purely filesystem based backend will reduce the on-ramp to working with datacube-core and let people get started by simply pip installing datacube and indexing a bit of data into a file based DB via CLI tools without standing up PostgreSQL server.

Proposal

Some specific backends are included here with use-cases.

STAC Search API

Replace the DB queries performed to PostgreSQL with STAC Search API calls to locate relevant bands/s3-keys and offsets to load data from.

AWS Athena

The common storage model for the datacube currently is COG files in S3 with STAC Metadata. Instead of indexing these STAC files into a PostgreSQL DB, they can be queried in situ using an ad-hoc Athena table. As Amazon S3 is the most common underlying data store for Amazon Athena, consideration should be made as to the structure of the index files in Amazon S3 to ensure highly performant Athena queries. The first approach to improving query performance is by partitioning data in Amazon S3. This is done by establishing folder structures where each folder segregates or groups related or similiar attributes such as product type, location, or use case. The second approach extends on the folder segregation using partition projections. Partition projections can reduce the runtime of queries against highly partitioned tables by specifying the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog

Amazon DynamoDB

Using Amazon DynamoDB, a NoSQL database, as an index store together with libraries such as dynamodb-geo, querying of indexes based on circular or rectangular boundaries is possible. By creating composite sort keys within DynamoDB will allow for extremely fast, scalable, and flexible querying of indexes based on attributes or boundaries. Fine grained access can be established so that subsets of data can be returned based on each users' level of access. A key constraint with using DynamoDB as the index store is the number of indexes being returned in a query, at it may use a large amount of Read Capacity Units to return very large results. Understanding the size of returned result-sets and benchmarking the speed, cost, and performance will determine whether or not DynamoDB is an applicable alternative to other solutions.

DynamoDB local is a version of AWS DynamoDB that can be run locally on a PC/Laptop/Server.

AWS DynamoDB local

AWS DynamoDB local - downloading and running

Google S2 library

DynamoDB-geo

Elastic Search

STAC JSON files can directly added and indexed in ElasticSearch queries using spatial / temporal support in ElasticSearch to function as a purely document based index of the contents of S3. An ES backend will allow dynamic scaling of the search cluster for large read loads and full metadata searching without specific indices.

Redis

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. In addition to core data structures, geospatial indexes can be created and be interacted with using Redis geo commands. These include :

As part of the Redis Module Hub there are 2 modules that are of particular interest that extend the core functionality of Redis. These are RedisJSON and RediSearch.

RedisJSON provides in-memory manipulation of JSON documents at high velocity and volume. With RedisJSON, you can natively store document data in a hierarchical, tree-like format to scale and query documents efficiently, significantly improving performance over storing and manipulating JSON with Lua and core Redis data structures. https://oss.redislabs.com/redisearch/

RediSearch implements a secondary index on top of Redis, but unlike other Redis indexing libraries, it does not use internal data structures such as sorted sets. This also enables more advanced features, such as multi-field queries, aggregation, and full text search capabilites. These capabilities include exact phrase matching and numeric filtering for text queries, something that is neither possible or efficient with traditional Redis indexing approaches. https://oss.redislabs.com/redisearch/

SQLite

SQLite / Spatialite (and its successors) have very good performance for upto a few gigabytes of data and can be used as indices for smaller products with a few thousand datasets.

In-memory Sqlite is available.

File GeoDatabase

Similar to SQLite, ESRI has demonstrated indexing STAC documents into Geodatabase from S3 to create virtual mosaics. This functionality could be made available in Python via datacube-core as well.

LMDB

As part of the Statistician project a subset and dump to LMDB capability was implemented, the scope of this can be increased to support the full datacube load API from LMDB.

Feedback

Kirill: I feel like ODC-EP-003 needs to be completed before any of the above becomes possible within datacube-core. Experimentation with dc.load like interface that is backed by some other metadata store can of course happen independently from that. Have a look at rioxarray library for example of loading data in the format compatible with what dc.load produces (by compatible I mean things like .geobox working as expected).

Ben Lewis: We already store most products organised like s3|http://product-specific-prefix/path/row/date/dataset and have informal product indexes (brief tabulations of product-name, data url prefix, and link to the product-definition yaml) in the dea-config repo. The datacube api is almost exclusively used for queries that specify a product-name and spatiotemporal extent, which hardly utilises DBMS capabilities (like complex queries and writing metadata updates). If we also stored a top-level geojson/gdb in those top level directories (essentially spatial index files to define the maximum envelope/footprint of each subdir) then it would be possible for the api to lazily satisfy most dataset queries, mostly by glob, without ever involving any DBMS. This might reduce latency and would certainly be more scalable, skipping the performance issues of DB cloning and pre-indexing, and bypassing the network security issues of DB servers, so also facilitating public api access.

Voting

Enhancement Proposal Team

  • Tisham Dhar - AWS Athena backend
  • Kirill Kouzoubov - STAC Backend
  • (TBD) - Geodatabase backend
  • (TBD) - LMDB backend
  • (TBD) - SQLite backend
  • (TBD) - ElasticSearch backend

Links

Clone this wiki locally