Skip to content

ODC EP 013 Index Driver API cleanup

Paul Haesler edited this page Jan 18, 2024 · 21 revisions

Overview

This EP is a proposal for a cleanup and rationalisation of the Index Driver API (i.e. the API that a new index driver is required to implement).

Details how backwards incompatibility and migration will be handled from 1.8 through 1.9 to 2.0.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • In draft
  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

The index driver API has evolved organically over time, mostly in an environment where there was only one index driver implementing it.

Now that there are multiple index drivers (and vague plans for more), the technical debt accrued during this ad hoc growth and evolution is starting to present unnecessary obstacles to both the development of future index drivers and the maintenance of existing drivers.

The aim of this EP is to simplify and minimise the effort required to implement a new index driver, and to allow the codebases for existing index driver to be cleaned up and simplified.

Wherever possible, new methods will be introduced and old methods deprecated in 1.9.x releases, with deprecated methods removed in 2.0.x releases. Backwards compatibility between 1.8.x and 1.9.x releases will be preserved where possible (apart from deprecation warnings).

Proposal

1. AbstractIndexDriver

In 1.8, AbstractIndexDriver defines two abstract methods:

  • connect_to_index: Simply calls from_config() from the driver's AbstractIndex implementation.
  • metadata_type_from_doc: Builds an unpersisted MetadataType model from an MDT document (i.e. a dictionary). Essentially a duplicate of the from_doc() method on the Metadata Resource (see below).

Proposal:

  • index_class: New abstract method, returns the driver's AbstractIndex implementation. (1.9)
  • connect_to_index: no longer abstract. Calls self.index_class().from_config(...) directly (1.9)
  • metadata_type_from_doc: Deprecate in 1.9, remove in 2.0 - recommend migration to index.metadata_types.from_doc()

2. AbstractIndex

2a. Boolean "supports" flags

AbstractIndex defines a set of boolean flags which implementations can override to specify which parts of the API they support.

The supports flags are relatively recent (introduced in 1.8.8, October 2022) and are only relevant to users working with different index drivers and developers of new drivers. Strict backwards compatibility is therefore not a driving concern in this case, but backwards incompatible changes are noted.

The basic concept seems sound, but this is an opportunity to cleanup and formalise.

Defaults

In 1.8, some flags default to True and some to False, and implementing indexes have to explicitly set only which flags differ from the default.

From 1.9, all flags will default to False. All index implementations must explicitly set flags for all features they support.

Metadata type support flags

These flags indicate which metadata types the index supports. supports_vector is a new addition, the rest already exist in 1.8 - e.g. this is how the postgis driver advertises that it only supports EO3 compatible metadata.

  • supports_legacy: supports legacy (non-eo3) ODC metadata types (e.g. eo, telemetry)
  • supports_eo3: supports eo3 compatible metadata types.
  • supports_nongeo: supports non-geospatial metadata types (e.g. telemetry). No dependency on supports_legacy to allow for future non-geospatial metadata types with eo3 style flattened metadata.
  • supports_vector: supports geospatial non-raster metadata types. Reserved for future use.
Database/storage feature flags

These flags indicate which database/storage capabilities the index supports:

  • supports_write: Supports methods like add, remove and update. E.g. an index driver providing access to a STAC API would set this to False.
  • supports_persistence: Supports persistent storage. Storage writes from previous instantiations will persist into future ones - e.g. the in-memory driver supports write but does not support persistence. Requires supports_write.
  • supports_transactions: Supports database transactions - e.g. the in-memory driver does not support transactions.
  • supports_spatial_indexes: Supports the creation of per-CRS spatial indexes - e.g. the postgis driver supports spatial indexes.

Note backwards incompatible change from 1.8: From 1.9, 1.8's supports_persistence is renamed support_write and a new supports_persistence flag with a slightly different interpretation is introduced.

User management flag

This flag indicates whether the index supports the user management methods exposed by index.users.

  • supports_users: Supports database user management, e.g. a SQL-Lite index driver would not support users.

This flag is new in 1.9

Lineage Support Flags

These flags indicate if and how the index driver supports dataset lineage.

  • supports_lineage: Supports some kind of lineage storage - either legacy style (with source_filter option in queries); or external lineage, as per EP-08.
  • supports_external_lineage: If true, supports EP-08 style external lineage API. Requires supports_lineage.
  • supports_external_home": If true, supports external home lineage data, as per EP-08. Requires supports_external_lineage.

In 1.8, there is a supports_source_filters flag. This is removed in 1.9 as it is equivalent to supports_lineage and not supports_external_lineage.

2b. Other changes

  • The type signature of the from_config() class method changes to take an ODCEnvironment instead of a LocalConfig in 1.9 as per the new config API (see EP-10).
  • Spatial index management methods are added in 1.9 (create_spatial_index, update_spatial_index, drop_spatial_index`).

3. User Resource API

No changes proposed for User Resource API, except to make implementation optional by setting supports_user_management to False, as discussed above.

4. Lineage Resource API

Lineage Resource is new 1.9, see EP-08.

No changes proposed.

5. Metadata type resource API

Proposed new method:

  • get_with_fields(field_names: Iterable[str]) -> Iterable[MetadataType]: Returns all metadata types that have all the named search fields.

Note that the existing method of the same name in the product resource becomes a wrapper to this.

No other proposed changes.

6. Product resource API

  • get_with_fields(field_names: Iterable[str]) -> Iterable[Product]: Implement in base class as a wrapper around metadata_types.get_with_fields above and get_with_types below.
  • get_with_types(types: Iterable[MetadataType]) -> Iterable[Product]: Proposed new method. Can be implemented in the base class via get_all().
  • get_field_names(product: Product | str | None = None) -> Iterable[str]: Replaces the method of the same name in the dataset resource. Signature expanded to take a Product or a product name. Can be implemented in the base class.

No other proposed changes.

7. Dataset resource API

7.1 Atomic read/retrieval methods

  • get_unsafe(id_: UUID | str, include_sources: bool = False) -> Dataset: New method for consistency with the other Resource APIs. Raises a KeyError if the supplied id does not exist.
  • get(id: UUID, include_sources: bool = False) -> Dataset: Implement in base class via get_unsafe above.

NOTE: The behaviour of get(id_, include_sources=True) differs based on whether the driver supports_external_lineage as per EP-08. Tthis will be implemented from 1.9.

Existing has method unchanged.

7.2 Bulk read/write methods

  • Bulk add method used by clone: _add_batch() - no changes proposed.
  • Very old (1.8) bulk read methods: bulk_get, bulk_has. (Take iterables of IDs, return Datasets (or bools for has).)
  • New bulk read methods used by clone: get_all_docs_for_product (get_all_docs calls get_all_docs_for_product, returns tuples of: Product, document, uris - but does not assemble them into Datsets)
  • Old bulk read method used to "archive all (active datasets)" and "restore all (archived datasets)" and "purge all (archived datasets)": get_all_dataset_ids() (Returns IDs only)

Propose:

  1. Deprecate "archive/restore/purge all" functionality in CLI from 1.9 and remove in 2.0
  2. Deprecate get_all_dataset_ids from 1.9 and remove in 2.0

7.3 Legacy Lineage methods

  • get_derived(id_): Deprecate in 1.9, remove in 2.0 (superceded by EP08 Lineage API).

7.4 Location/URI related methods

  • get_locations(), get_archived_locations(), get_archived_location_times()
  • add_location()
  • get_datasets_for_location()
  • remove_location(), archive_location(), restore_location()

These methods are an obvious symptom of the complexity introduced by supporting multiple locations. I'm not aware of anyone actually using multiple locations (and I'm not 100% it would work correctly if you tried).

Propose deprecating all these methods in 1.9 and removing in 2.0, dropping support for multiple locations all together - from 2.0 support a single location only.

7.5 Spatio-temporal extent methods

  • spatial_extent(ids: Iterable[UUID | str], crs: CRS | None =None) -> Geometry: Only supported by a driver that supports_spatial_indexes (i.e. not supported by legacy driver)
  • `get_product_time_bounds(product: Product) -> Tuple[datetime, datetime]

Propose:

  • making both these methods accept either an iterable of IDs OR a product and;
  • renaming get_product_time_bounds() to temporal_extent(). Deprecate the old method name in 1.9 and remove in 2.0.

I.e. end result is:

  • XXXX_extent(ids: Iterable[UUID | str] | None = None, product: Product | None = None, **kwargs) (one and only one of ids and product must be supplied.)

7.6 Search methods

This is where things get messy. I'll try to keep it as clear as possible.

Issues with the current API:
  • ALL search methods only return active (non-archived) datasets - no way to include archived datasets.

  • search_by_metadata(): Current typehint signature is incomplete - does not allow for nested metadata chunks to be passed in.

  • search_eager(): Misleadingly named and useless. Simply calls search() and returns the result as a list - so actually the exact opposite of eager.

  • search_returning_datasets_light(): Has some cool and interesting features but is poorly documented, has a design that is tightly coupled to the postgres index driver, and a complex implementation that violates the modularity established by the rest of the API. Furthermore I can't find any code anywhere that uses it. Propose deprecating in 1.9 and removing in 2.0.

  • search, search_by_product, search_returning, search_summaries:

    • In both the postgres and postgis drivers, these are all implemented as wrappers around a common private method _do_search_by_product(). This performs a product search first, then separate dataset searches for each matching or partially matching product. This makes some sense in the context of the postgres driver, but is less useful for the postgis driver. It makes "eager" searching impossible - there will always be a significant delay before returning the first matching dataset.

    • search_returning() and search_summaries() are functionally very closely related - search_summaries() is basically a special case of search_returning() with a different return format.

    • Despite all these methods being wrappers around the same function, special arguments are exposed inconsistently, being offered arbitrarily by some methods but not others.

    • search() nominally supports "source filters" (i.e. "find datasets derived from datasets that match these filters") This is not supported by a driver that supports_external_lineage (like postgis), as per EP-08.

Proposed cleanup
  • Update typehints of search_by_metadata() method to reflect actual behaviour.
  • Update documentation of search() method to say that results are not guaranteed to be sorted/grouped by product. This frees up the postgis driver to perform a more efficient direct (and eager) search in future.
  • Make field_names argument to search_returning() optional - default is all search fields.
  • Deprecate search_eager() in 1.9 and remove in 2.0
  • Deprecate search_summaries() in 1.9 and remove in 2.0 - suggest migration to search_returning().
  • Add archived: bool | None = False argument to all search methods. False = return active datasets only (default), True = return archived datasets only, None = return both active and archived datasets.
  • Add custom_offsets argument (as per search_returning_datasets_light()) to search_returning().
  • Add order_by: str | Field | None = None argument to search_returning(). None will mean unsorted. Postgres driver will leave unsupported. Postgis driver should be able to bypass the partial product search and start returning results immediately if order_by and custom_offsets are both None.
  • Deprecate search_returning_datasets_light() in 1.9 and remove in 2.0 - suggest migration to search_returning()
  • Note that most other search methods can be trivially reimplemented as wrappers around the new expanded search_returning() method - the abstract base class will offer this as the default implementation (and the postgis driver will take advantage of it).

7.7 Count methods

No changes proposed for count() or count_by_product().

count_product_through_time() and count_by_product_through_time() are closely related (as their confusingly similar names suggest). The latter returns counts by time-range per product (Iterable[Tuple[Product, Tuple[Range, int]]]). The former dispenses with the product grouping (Iterable[Tuple[Range, int]]) AND enforces that the query only includes datasets for one product. Propose deprecating count_product_through_time() in 1.9 (and recommending migrating to count_by_product_through_time()) and removing in 2.0

New method count_by(fields: Iterable[str|Field], custom_offsets: Mapping[str, Offset] | None = None, **query: QueryField) -> Iterable[Tuple[Tuple, int]] The Tuple[Tuple, int] is a tuple containing a named tuple with the requested fields and/or custom-offset values, and the relevant counts. count and count_by_product can then be reimplemented as wrappers around count_by in the base class.

7.8 Other methods

No changes are proposed to the following classes of methods:

  • atomic write (add, update, archive, restore, purge);
  • update support (can_update)

The following method will be deprecated in 1.9 and removed in 2.0 as it is replaced by a method of the same name on the product resource (see above):

  • get_field_names()
Clone this wiki locally