Proposal: unifying `datacube-core` and `odc-stac` data loading APIs in Datacube 2.0 #1533

robbibt · 2024-01-15T04:18:13Z

robbibt
Jan 15, 2024
Maintainer

Issue

Recently, odc.stac.load() has diverged from datacube.load() in several small but important ways. These include updates to several frequently used parameters:

Renamed dask_chunks -> chunks (consistent with rioxarray)
Renamed group_by -> groupby (consistent with pandas, xarray)
Renamed output_crs - > crs (as STAC loading has no need for separate query and load CRSs)
Replaced align with anchor (with slightly different functionality; further discussion in Datacube 1.9 and 2.0 discussion #1529)

And added and improved functionality:

Added geobox as a new alias for like
Added new bbox param
Updated geopolygon to be more flexible and work on geopandas and shapely objects

These changes are great, but they do increase friction between switching between datacube and odc-stac, as users need to remember different params to achieve the same end goal (loading a time series of data into xarray format). This pain point is particularly significant as datacube.load and odc.stac.load are two of the primary entry points to the ODC ecosystem for many of our downstream users.

Proposal

I would like to propose that in Datacube 2.0 we aim for full consistency (or as close as we can get to it) between data loading parameters shared between datacube-core and odc-stac. This would have the following benefits:

Improved user experience by allowing users to seamlessly load and combine data from Datacube and STAC
Less development overhead through re-using code (e.g. by moving shared functionality to odc.geo where possible)
Future proofing downstream code and documentation for a future where we may increasingly load data using STAC rather than an existing Datacube database.

The changes made in odc-stac listed above all make sense to me, so I would suggest that we adopt them directly in datacube-core in 2.0. This would mean adding deprecation warnings to current parameters (group_by, dask_chunks etc) in 1.9 so that users have enough time to transition to the new approaches for loading data.

Challenges

Some of these params (e.g. group_by, dask_chunks are used extensively through downstream code, so this would require users to make widespread changes to their applications when they upgrade to Datacube 2.0. However, my feeling is that a 2.0 upgrade is probably the best opportunity for a change like this, as breaking changes are likely to be a given.

Would love to hear what people think about this proposal - I really think if we could achieve something like this it could make a big difference to usability across the whole ODC ecosystem. 🙂

robbibt · 2024-01-15T04:33:49Z

robbibt
Jan 15, 2024
Maintainer Author

output_crs -> crs is a tricky one, as Datacube already uses crs to define the CRS of the coordinates used to query data. Two possible options could be:

Accept an inconsistency and leave output_crs and crs as they are (downside: crs would do different things in datacube-core vs. odc-stac)
Rename output_crs -> crs (matching odc-stac), and rename the current crs to query_crs to better explain its purpose.

An advantage of the second option would be addressing a common pain point for beginner users who regularly assume that crs is used to define the output CRS (not just used for querying...)

3 replies

SpacemanPaul Jan 15, 2024
Maintainer

Agree we should introduce "anchor" and deprecate "align" in favour of it in 1.9 (and remove "align" in 2.0)
Moving dask_chunks to chunks and group_by to groupby is relatively straightforward. (Introduce the new form and deprecate the old in 1.9, remove the old form in 2.0)
Agree geobox as an alias (or eventual replacement) for like
As you point out changing output_crs is problematic because ODC DOES have both a query and output CRS.

I think I would like to see a pathway like:

1.8: crs (query) and output_crs
1.9: Introduce query_crs so we have query_crs and output_crs. Put a deprecation warning message on crs pushing users to the fully qualified names, and noting the inconsistency with odc-stac.
2.0: crs becomes a synonym for output_crs. (Warning message advising of change in behaviour, to be removed at 2.0.x?)

Another option would be to have e.g. crs=foo become (in 2.0) a shortcut for query_crs=foo, output_crs=foo. (i.e. raise an error if user sets both crs and either query_crs or output_crs.) I think this is clearer API design, but perpetuates the inconsistency with odc-stac.

robbibt Jan 22, 2024
Maintainer Author

I like both of those options - RE I think this is clearer API design, but perpetuates the inconsistency with odc-stac, I'm not sure it's actually inconsistent at all: on odc-stac setting crs="EPSG:3577" gives you back data on an Australian Albers grid, and in your proposal crs="EPSG:3577" would do the same in datacube-core too - the only difference being the user would also have to provide their query in Australian Albers coords as well (which is a step that's not relevant to odc-stac at all).

Kirill888 Jan 24, 2024
Maintainer

I think output_crs= instead of crs= is a bit of a gotcha for most users currently, so forcing backward incompatibility change like that is more gain than loss anyway. Having said that, odc-stac does check for output_crs= argument if crs= is not supplied:

https://github.com/opendatacube/odc-stac/blob/0ac758e74b232a18365fd560d91acb3be279a1ee/odc/stac/_load.py#L479-L481

but it's not in any docs for obvious reasons.

alexgleith · 2024-01-24T04:08:06Z

alexgleith
Jan 24, 2024
Maintainer

I think all of these are good ideas.

Something I really like coming back to dc.load from odc.stac.load is bbox, which was mentioned above.

I also think that auto-guessing the output CRS is a fantastic usability win.

I think it's not a terrible idea to drive the whole query API off WGS84... though I am probably not thinking of a bunch of use cases.

Regardless, having output crs as crs is +1 from me.

10 replies

woodcockr Feb 5, 2024
Maintainer

Also not all measurements are bands, nor are the measurements :-) - its a bit awkward

Kirill888 Feb 6, 2024
Maintainer

In STAC universe, EO extension calls them bands, hence the name. It's relatively easy to introduce new name for the parameter and still accept old name as an alias with deprecation warning. odc.stac.load does accept measurements= as an alias for bands=, but not documenting it (this time on purpose), and not raising warnings:

    if bands is None:
        # dc.load name for bands is measurements
        bands = kw.pop("measurements", None)

https://github.com/opendatacube/odc-stac/blob/fca4ed0e8b38e3d65ca117716439fa100a97b392/odc/stac/_load.py#L474-L476

but adding warning is not hard.

robbibt Feb 6, 2024
Maintainer Author

I think that could be a good option - it would prevent users' code from breaking and allow for nice datacube-core and odc-stac interoperability, while still giving us the option to deprecate the old param in the future.

(although we would need to work out how to deal with funcs like dc.list_measurements)

SpacemanPaul Feb 9, 2024
Maintainer

The biggest blocker to switching to "bands" in core is that the term "measurements" is pretty baked in to the EO3 metadata format.
Core could expose "bands" as an alias to "measurements" in most of our user-facing APIs, but "measurements" will continue to be used internally and in EO3 metadata for the foreseeable future, I think.

robbibt Feb 9, 2024
Maintainer Author

Makes sense - and I think an alias approach would sufficiently satisfy my main goal: that a user could take load params from odc-stac and apply them seamlessly/interchangeably to load data via datacube-core/vice-versa. 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: unifying `datacube-core` and `odc-stac` data loading APIs in Datacube 2.0 #1533

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Proposal: unifying datacube-core and odc-stac data loading APIs in Datacube 2.0 #1533

robbibt Jan 15, 2024 Maintainer

Issue

Proposal

Challenges

Replies: 2 comments · 13 replies

robbibt Jan 15, 2024 Maintainer Author

SpacemanPaul Jan 15, 2024 Maintainer

robbibt Jan 22, 2024 Maintainer Author

Kirill888 Jan 24, 2024 Maintainer

alexgleith Jan 24, 2024 Maintainer

woodcockr Feb 5, 2024 Maintainer

Kirill888 Feb 6, 2024 Maintainer

robbibt Feb 6, 2024 Maintainer Author

SpacemanPaul Feb 9, 2024 Maintainer

robbibt Feb 9, 2024 Maintainer Author

Proposal: unifying `datacube-core` and `odc-stac` data loading APIs in Datacube 2.0 #1533

robbibt
Jan 15, 2024
Maintainer

Replies: 2 comments 13 replies

robbibt
Jan 15, 2024
Maintainer Author

SpacemanPaul Jan 15, 2024
Maintainer

robbibt Jan 22, 2024
Maintainer Author

Kirill888 Jan 24, 2024
Maintainer

alexgleith
Jan 24, 2024
Maintainer

woodcockr Feb 5, 2024
Maintainer

Kirill888 Feb 6, 2024
Maintainer

robbibt Feb 6, 2024
Maintainer Author

SpacemanPaul Feb 9, 2024
Maintainer

robbibt Feb 9, 2024
Maintainer Author