Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose resolution _groups_ for xarray support #114

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joshmoore
Copy link
Member

@joshmoore joshmoore commented Apr 21, 2022

In discussing with the xarray community, the one change to the NGFF
specification that needs to occur to prevent errors being raised
when opening a multiscale is for each resolution array to live
in a separate group. This has already been tested by thewtex
in https://github.com/spatial-image/spatial-image-multiscale and
the current spec is permissive enough to allow it. The proposal
here would enforce the subdirectories moving forward.

The conflict in xarray stems from the fact that each of our
subresolutions have the same dimension names ("x", "y,", etc.)
but different sizes. This is not allowed in the xarray (nor NetCDF)
model. An added benefit of this change is that other arrays with
the same resolution levels and the same dimensions (e.g. labels!)
could be stored together:

    ├── resolution-N/.zgroup
    │   ├── image/.zarray
    │   └── label/.zarray

cc: @thewtex @aurghs @malmans2
see: #48

In discussing with the xarray community, the one change to the NGFF
specification that needs to occur to prevent errors being raised
when opening a multiscale is for each resolution _array_ to live
in a separate _group_. This has already been tested by thewtex
in https://github.com/spatial-image/spatial-image-multiscale and
the current spec is permissive enough to allow it. The proposal
here would enforce the subdirectories moving forward.

The conflict in xarray stems from the fact that each of our
subresolutions have the same dimension names ("x", "y,", etc.)
but different sizes. This is not allowed in the xarray (nor NetCDF)
model. An added benefit of this change is that other arrays with
the same resolution levels and the same dimensions (e.g. labels!)
could be stored together:

```
    ├── resolution-N/.zgroup
    │   ├── image/.zarray
    │   └── labe/.zarray
```
@joshmoore joshmoore changed the title Propose resolution _groups_ for xarray support (see #48) Propose resolution _groups_ for xarray support Apr 21, 2022
@joshmoore joshmoore marked this pull request as draft April 21, 2022 09:47
@joshmoore joshmoore added this to the 0.5 milestone Apr 21, 2022
@d-v-b
Copy link
Contributor

d-v-b commented Apr 21, 2022

An added benefit of this change is that other arrays with
the same resolution levels and the same dimensions (e.g. labels!)
could be stored together

What is the advantage of this? A downside is that it couples the downsampling process for raw data to the downsampling process for labels (or any other image in the collection). Imagine if I want raw data downsampled by 2x2x2, but labels downsampled by 4x4x4, then the proposed layout becomes tricky to parse. I think it's conceptually cleaner to group by data type (raw, labels, etc) than grouping by resolution.

@thewtex
Copy link
Contributor

thewtex commented Apr 22, 2022

Intensity, labels, masks sampled are often sampled on the same voxel grid. It is common to use them together, and this is helpful to identify and use this association. This pattern led to the development of xarray Dataset, which this enables.

There is not the constraint that every intensity image has to have a label image or every label image has to have an intensity image at the downsampled resolutions.

├── 0 # Each multiscale level is stored as a separate Zarr array,
│ ... # which is a folder containing chunk files which compose the array.
├── n # The name of the array is arbitrary with the ordering defined by
├── 0 # Each multiscale level is stored as a separate Zarr group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> separate Zarr group, possibly nested ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 22, 2022

Intensity, labels, masks sampled are often sampled on the same voxel grid. It is common to use them together, and this is helpful to identify and use this association. This pattern led to the development of xarray Dataset, which this enables.

I'm all for using xarray.Dataset, but I have some datasets with over 80 different label images. I would not want the default behavior to be treating the set of all label images as a single xarray.Dataset. Instead, I would prefer that each label image is self-contained, and I combine them into xarray.Dataset instances based on the needs of a specific application.

@thewtex
Copy link
Contributor

thewtex commented Apr 25, 2022

If you prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application, then you can do that. I hear you. But, that should not block people who want to store a label image together with its intensity image.

@joshmoore
Copy link
Member Author

Ah, apologies. Late to the conversation. Thanks, both. I'm interpreting @d-v-b's last 👍 to mean that he is not proposing a MAY NOT for re-use of the pyramid group but would like to keep it as a MAY (and not SHOULD). Does that sound right? If so, I'll try to clarify the language.

│ └─ y # provide the "chunk coordinate" (t, c, z, y, x), where the maximum coordinate
│ └─ x # will be `dimension_size / chunk_size`.
│ └── image # Within the group, there will typically be a single array named "image".
│ │ # Other arrays may be added in future versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Dataset path would be 0/image , 1/image etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence Each dictionary in "datasets" MUST contain the field "path", whose value contains the path to the array for this resolution relative to the current zarr group. should mention that the path MUST be to an array within a group, typically named "image".
And the sample with "path": "0", etc should be updated to "path": "0/image" etc.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 26, 2022

I'm still not sure what to think here... the 👍 was to signify that in some situations someone might want to pack scale levels together. But for a specific format, like OME-NGFF, I don't see the appeal of a) packing scale levels together, and b) supporting multiple ways of representing the same thing.

On the contrary, I think the format should specify just one way to organize images, unless there's are really powerful (e.g., "representation X is impossible on storage backend Y") argument for polymorphism here. And if we are specifying just one way to organize images, I would strongly advocate an organization scheme that keeps separate multiscale images in separate folders / prefixes. This facilitates an access pattern where multiscale images are read / written independently, which I think is pretty common. Grouping by scale levels, on the other hand, facilitates an access pattern where the same scale for all images is read / written at once, which I think is pretty uncommon (and, not very scalable).

@joshmoore
Copy link
Member Author

b) supporting multiple ways of representing the same thing.

Guess in my mind this isn't really a new way of storing things, since it's already possible at the moment. The metadata of the datasets array is definitive on where to find the related arrays.

I would strongly advocate an organization scheme that keeps separate multiscale images in separate folders / prefixes.

(At the risk of contradicting myself) this I can see even from the metadata level. I would like to make the necessary changes to the spec so that there would only be one image (i.e. multiscale) in a zgroup.

I see @d-v-b's dilemma if I try to combine those two thoughts, since the only way would be to reference some common space outside of the current group by a "../" style reference.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 26, 2022

The metadata of the datasets array is definitive on where to find the related arrays.

IIRC, the purpose of including the path metadata in datasets was to allow a single instance of multiscale metadata to describe multiple multiscale collections within the same prefix / folder (e.g., a gaussian pyramid stored alongside a laplacian-of-gaussian pyramid), and to allow some flexibility in the names given to the different scale levels. Technically someone could use this metadata to denote a multiscale pyramid that's stored in a totally different zarr container, or a different format entirely, but this is not the intended purpose of that metadata (as far as I understand it), and so we shouldn't be bound support that usage.

@thewtex
Copy link
Contributor

thewtex commented Apr 28, 2022

which I think is pretty uncommon (and, not very scalable).

In practice, storing related images, different features, different modalities, label images, masks, that are sampled on the same sampling grid, is extremely common. This is what motivated xarray Datasets and it does not sense to bother with Datasets without this organization. The Dataset organization enables simply and direct identification of volumes where pixels correspond. It has proven to be very scalable in the geospatial community.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 28, 2022

The Dataset organization enables simply and direct identification of volumes where pixels correspond.

Yes, this storage layout is surely convenient where sampling grids match, but I can only see this working if you constrain sampling grids to match for all images.

Here's a realistic example of images I work with:

  • raw EM data: 6nm (12nm, 24nm, ...)
  • Semantic predictions from ML network: 4nm (8nm, 16nm, ...)
  • Light micrscopy: 50nm (100nm, 200nm, ...)

None of them are on the same sampling grid. How would you store them in the scheme you are proposing?

@thewtex
Copy link
Contributor

thewtex commented May 2, 2022

For these volumes that are not on the same sampling grid, they would not provide the useful indication that they are on sampled on the sampling grid -- they would be store in different groups, just like they are now. There is not an additional constraint that prevents them from being stored like they are now.

@jni
Copy link

jni commented Jul 10, 2022

it does not [make] sense to bother with Datasets without this organization.

I think this statement is too strong, @thewtex. I fall very much on @d-v-b's side of things here, I think it absolutely makes sense to group together datasets with different pixel spacings or even grid orientations.

Now, whether the engineering constraints on the xarray side are steadfast, or can be remedied upstream, I don't know, but my personal instinct would be to push back on that constraint a bit, rather than harden the spec one the ome-ngff side. The only place where I find this layout compelling is for multiple channels — which is probably where the geosciences applications come from?

@thewtex
Copy link
Contributor

thewtex commented Jul 10, 2022

I think it absolutely makes sense to group together datasets with different pixel spacings or even grid orientations.

We are in agreement! It absolutely does make sense to group together datasets with different spacing or grid orientations. However, that is not a reason to push back on this documentation update in this PR. Datasets with different spacings and grid orientations can and should be able to be grouped together. That can be done independently of this documentation update, and this change does not put a constraint on the creation on this type of dataset associations. The proposal actually prevents a potential over-constraint that the pixel arrays live in the same group. This is unlikely to require any changes to existing code because accessing a group or a nested group is done the same way. Currently, the NGFF standard does not explicitly say that the pixel arrays are in the same group or in separate groups. This would explicitly say that they could be in different groups. In practice, this means that NGFF can be compatible with Xarray and NetCDF, and I think we can all agree that it is in the interest of both standards to make them compatible, if possible. The nested group in Xarray/NetCDF is motivated by a reasonable approach to a need that makes sense (store the image pixel coordinates alongside the pixel data). And the use case is similar in geosciences as in bioimaging, medical imaging, microscopy: work with multi-dimensional images as numpy arrays with the same shape: multiple frequencies, multiple sensors, derived feature images, label images. Even if someone does not use this functionality, I do not think we should be unnecessarily over-constraining NGFF in a way that makes it incompatible with Xarray/NetCDF.

@jni
Copy link

jni commented Jul 19, 2022

Ah, thanks @thewtex, I should have looked at the spec rather than the short summary and the ensuing discussion. As I see it, the essence of this PR is to put different resolution levels in different groups rather than different arrays within one group. (?) If one wants the groups to be singletons, that's entirely fine. (?)

Also, groups are hierarchical, (?) meaning it's totally fine to have groups of groups, ie groups of multiscale data. (?)

Given all this, I'm ok with this PR. 😅

@thewtex
Copy link
Contributor

thewtex commented Jul 21, 2022

Also, groups are hierarchical, (?) meaning it's totally fine to have groups of groups, ie groups of multiscale data. (?)

@jni yes, that's it, sorry if the explanation was not clear. There can be additional associations of data through grouping. As we continue to make progress on the spec, we can add associations to meet needs.

@joshmoore
Copy link
Member Author

See discussion in Unidata/netcdf-c#2474 which suggests as part of this effort (and ASAP e.g. v0.5 if not retroactively for the previous versions) _ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups or stripped entirely.

@constantinpape
Copy link
Contributor

See discussion in Unidata/netcdf-c#2474 which suggests as part of this effort (and ASAP e.g. v0.5 if not retroactively for the previous versions) _ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups or stripped entirely.

How about making a patch release (0.4.1) for this?

@sbesson
Copy link
Member

sbesson commented Sep 9, 2022

How about making a patch release (0.4.1) for this?

So far, I see this proposal includes breaking changes in terms of the data layout so I don't think a patch release is an amenable option in its current form.

Semi-related, is the proposal to exclusively support the new layout i.e. have OME-NGFF 0.x fully compatible with the netcdf/xarray model. Or would we have a period of transition where both layouts would be supported? One way or another, this decision will have implications on implementations, both readers and writers.

@constantinpape
Copy link
Contributor

Thanks for the clarification @sbesson; I went through the whole discussion in more detail now and here are my thoughts:

_ARRAY_DIMENSIONS should be [...] stripped entirely.

This could be done as a patch release (and is what I was refering to), but it does not help w.r.t. compatibility with xarray.

_ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups [...]

Indeed, this is a breaking change and should not be done as a patch release (and for sure not retroactively, this would invalidate the v0.4 data that is out there!).

For the changes here: I guess we have two options:

  • adapt the layout changes proposed here for xarray compatibility, release them as v0.5 as this will be a breaking change. (where we still need to figure out what exactly to do about multiple arrays in one resolution group)
  • not going ahead with this, which would mean either not supporting xarray, or trying to ask for changes upstream so that xarray supports the ome fomat (which is unlikely to happen soon from what I recall from prev. conversations on this)

I am in favor of option 1 since I do believe xarray support is important and this is the only feasible way to get there. (Although it will need a bit of refactoring in readers and writers...)

Semi-related, is the proposal to exclusively support the new layout i.e. have OME-NGFF 0.x fully compatible with the netcdf/xarray model. Or would we have a period of transition where both layouts would be supported?

I think having a transition period would make things complicated. If we decide to go with this change, we should stick to the versioning and require 0.5 to have the new format.

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-community-call-transforms-and-tables/71792/5

@ivirshup
Copy link
Contributor

ivirshup commented Oct 8, 2022

Hey all, thought I should leave some comments here after the spec call this week.

tldr

  • I don't think this is the right model for xarray compatibility, as this makes interaction with xarray lossy
    • I'm broadly in agreement with @d-v-b on other points here.
  • I think we can get much richer xarray compatibility via xarray BackendEntrypoints (POC at bottom). This does not require _ARRAY_DIMENSIONS or the change proposed here.

This PR is more about xr.open_zarr compatibility than xarray compatibility

I'm completely onboard with xarray compatibility, but I'm not convinced allowing xr.open_zarr instead of xr.open_dataarray(..., engine="ome") or even ome.read_image(...) is valuable. I would strongly prefer the "obvious way" to do IO with OME arrays used OME dimensional information, which xr.open_zarr will not. I would suspect xarray devs would agree, as they're working towards this specific use-case:

I would note that if we define an OME backend, we may not even have to specify the engine due to the guess_can_open API for the backend interface.

netcdf compatibility

@joshmoore has brought up the topic of netcdf compatibility.

The netcdf-c library does not seem to be able to open a zarr store which isn't formatted in with the netcdf or xarray schemas Unidata/netcdf-c#2474. I don't think this incompatibility is worth changing the spec for.

  • A reader which can only read netcdf compliant stores won't be able to read the table format
  • I think it's reasonable to say a zarr implementation must (at minimum) be able to pass the test suite at zarr-developers/zarr-implementations to read an OME store.

Splitting the zarr library out of netcdf-c would resolve this.

We should control IO

I think there's a lot of value in maintaining control of IO. I actually think we can do a better job at "xarray compatibility" if we do.

  • Instead of needing explicit arrays for labeling dimensions, we could use implicit ones with the new xarray index types
  • The tables group could (eventually) be loaded as part of an xarray.DataTree (without having to update the schema)
  • We should be doing validation at IO time, especially O

If anything, I'd say the above makes us MORE compatible with xarray, since we'd be able to provide deeper integration with its features.

You could say that we can have our own controlled IO and also be compatible with xarray's zarr schema, but I still think that'd be a bad option. Why go out of our way to allow reading and writing that uses an orthogonal coordinate system? Why make it easy to read the file wrong?

Quick demo

As a quick demo of what this could look like, with some very hacky code:

OMEBackend class definition

Workaround from: https://github.com/aurghs/ome-datatree/blob/a8cb7729156b0ec7b09e909cb0d4e43ddfc200f3/ome_datatree/ome_datatree.py#L23-L36

import zarr, xarray as xr
from xarray.backends import ZarrStore, BackendEntrypoint

from collections import namedtuple


DummyStore = namedtuple("DummyStore", ("zarr_group",))

def open_ome_array(zarr_array: zarr.Array):
    from xarray.core import indexing
    from xarray import Variable
    from xarray.backends.zarr import ZarrArrayWrapper
    
    parent_pth, name = zarr_array.path.rsplit(sep="/", maxsplit=1)
    parent = zarr.Group(zarr_array.store)[parent_pth]
    store = DummyStore(parent)

    data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, store))

    # TODO: do a better job of grabbing metadata
    dimensions = [dim["name"] for dim in parent.attrs["multiscales"][0]["axes"]]
    attributes = dict(zarr_array.attrs)
    attributes.update(dict(parent.attrs))

    encoding = {
        "chunks": zarr_array.chunks,
        "preferred_chunks": dict(zip(dimensions, zarr_array.chunks)),
        "compressor": zarr_array.compressor,
        "filters": zarr_array.filters,
    }
    # _FillValue needs to be in attributes, not encoding, so it will get
    # picked up by decode_cf
    if getattr(zarr_array, "fill_value") is not None:
        attributes["_FillValue"] = zarr_array.fill_value

    return xr.DataArray(Variable(dimensions, data, attributes, encoding), name=name)


class OMEBackend(BackendEntrypoint):
    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
    ):
        assert isinstance(filename_or_obj, zarr.Array)
        assert drop_variables is None

        data_array = open_ome_variable(filename_or_obj)

        return xr.Dataset(
            {data_array.name: data_array}
        )
z_remote = zarr.open(
    "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0076A/10501752.zarr",
    mode="r"
)
da = xr.open_dataarray(z_remote["labels/0/0"], engine=OMEBackend)
da

Which opens a backed DataArray that used the OME metadata for it's dimensions and attrs:

image

I think this can be made quite powerful with the new coordinate systems, and will look into extending it once there are examples.

@thewtex
Copy link
Contributor

thewtex commented Oct 20, 2022

Hey @ivirshup,

Thanks for sharing your thoughts and code.

This PR is more about xr.open_zarr compatibility than xarray compatibility

This PR is about compatibility with netCDF. xarray and xr.open_zarr compatibilty come for free.

xarray Datasets are based on netCDF groups. And the OME multiscale images can mapped to the proposed higher order xarray.DataTree in a natural way.

An OME xarray.backend is a good idea; great job on a draft implementation. And, this standard clarification avoids unnecessary complexity in that implementation.

Yes, OME-NGFF and netCDF are different standards that do not overlap 100%. However, we should strive for compatibility when possible. We will not have 100% NGFF functionality. That does not mean that the functionality that results is not valuable. Few, if any, single piece of software implements 100% of the functionality of even the current relatively minimal OME-NGFF standard: high-content screening, axes, bioformat2raw.layout, coordindateTransformations, multiscales, omero, labels, image-label, plate well. That does not mean the current ecosystem of software striving for OME-NGFF support does not have value.

The value of standards means that we do not need to control all the related software. Indeed, this is an extremely important quality because it allows the ecosystem to flourish. And everyone benefits as a result.

Beyond xarray, a sampling of other software tools supporting the NetCDF standard:

ANDX (ARM NetCDF Data eXtract) and ANAX (ARM NetCDF ASCII eXtract)
ANTS (ARM NetCDF Tool Suite)
ARGOS (interActive thRee-dimensional Graphics ObServatory)
CDAT (Climate Data Analysis Tool)
CDFconvert (Convert netCDF to RPN and GEMPAK Grids)
cdfsync (network synchronization of netCDF files)
CDO (Climate Data Operators)
CIDS Tools
CSIRO MATLAB/netCDF interface
EPIC
Excel Use
EzGet
FAN (File Array Notation)
FERRET
FIMEX (File Interpolation, Manipulation, and EXtraction)
FWTools (GIS Binary Kit for Windows and Linux)
GDAL (Geospatial Data Abstraction Library)
GDL (GNU Data Language)
Gfdnavi (Geophysical fluid data navigator)
Gliderscope
GMT (Generic Mapping Tools)
Grace
GrADS (Grid Analysis and Display System)
Gri
GXSM - Gnome X Scanning Microscopy project
HDF (Hierarchical Data Format) interface
HDF-EOS to netCDF converter
HIPHOP (Handy IDL-Program for HDF-Output Plotting)
HOPS (Hyperslab OPerator Suite)
iCDF (imports chromatographic netCDF data into MATLAB)
IDV (Integrated Data Viewer)
Ingrid
Intel Array Visualizer
IVE (Interactive Visualization Environment)
JSON format with the ncdump-json utility
Java interface
Kst (2D plotting tool)
Labview interface
MBDyn (MultiBody Dynamics)
Max_diff_nc
MeteoExplorer
MeteoInfo
MexEPS (MATLAB interface)
MEXNC and SNCTOOLS (a MATLAB interface)
Mirone (Windows MATLAB-based display)
ncBrowse (netCDF File Browser)
nccmp (netCDF compare)
ncdx (netCDF for OpenDX)
ncensemble (command line utility to do ensemble statistics)
NCL (NCAR Command Language)
NcML-Java Binding
NCO (NetCDF Operators)
ncregrid
nctoolbox (a MATLAB common data model interface)
NCSTAT
ncview
ncvtk
NetCDF Ninja
netcdf tools
netcdf4excel (add-in for MS Excel)
NetCDF95 alternative Fortran API
Objective-C interface
Octave interface
Octave interface (Barth)
OPeNDAP (formerly DODS)
OpenDX (formerly IBM Data Explorer)
Panoply
PnetCDF
Paraview and vtkCSCSNetCDF
Perl interfaces
PolyPaint+
Pomegranate
Pupynere (PUre PYthon NEtcdf REader)
PyNGL and PyNIO
Python interfaces
QGIS (Quantum GIS)
R interface
Ruby interface
Scientific DataSet (SDS) Library
Apache Spatial Information System (SIS)
Tcl/Tk interfaces
Tcl-nap (N-dimensional array processor)
Visual Basic and VB.net
VisAD
Weather and Climate Toolkit (WCT)
WebWinds
xdfv (A slick NetCDF/HDF4/HDF5 contents viewer with developers in mind)
xray (Python N-D labelled arrays)
Zebra

Wouldn't it be cool if we could open OME-NGFF images, even in a basic way, without having to fork and hack and control each one? And wouldn't it be cool if the community of researchers using these tools could use software tools from the OME-NGFF community, even in just a basic way?

I strongly think we should unnecessarily avoid resisting compatibility with other standards, software tools and research communities.

@ivirshup
Copy link
Contributor

ivirshup commented Oct 22, 2022

This PR is about compatibility with netCDF.

I disagree with this. The title, branch name, and description are pretty specific to xarray – as is most of the discussion. The referenced issue is titled: "Compatibility with xarray".


Wouldn't it be cool if we could open OME-NGFF images,

I agree compatibility with existing tools is useful, however:

  1. I think how useful is a fair question. Like, what specific features are we getting here. xarray (xray on the list provided) is a great example here. Compatibility with xarray would be extremely useful, and worth considering, but we don't need netCDF compat for that.

  2. I also have strong suspicions about whether many of the tools listed actually work with a zarr netcdf store.

  3. My current understanding of plans for the ome-ngff spec:

  • You can't store different images in the same multiscales group
  • Label masks won't be stored alongside images

My understanding is that netcdf uses arrays being stored in the same groups to indicate that they should be used together (e.g. in an xarray.Dataset). I think this limits how useful having a netcdf compatible tool read directly from an ome-zarr store can be.


The value of standards means that we do not need to control all the related software. Indeed, this is an extremely important quality because it allows the ecosystem to flourish. And everyone benefits as a result.

I agree, but I think a lot of the potential here is actually realized from building on a standard like zarr, rather than going to a standard on top of zarr.


Alternative vision for netcdf compat

I think it would be quite easy to create a view of an ome-zarr store that was compatible with netcdf usage. This could be done with references (e.g. symlinks) and metadata transformations. Not so different from:

 > [I] prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application

But with the added benefit that the format itself can keep the "one way to store an image", while having broader compatibility with netcdf.

Alternative alternative vision for netcdf compat

Another vision would be to go full on netCDF, and layer all of OME-NGFF on top of it. I would assume this conversation has happened before here.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 23, 2022

The changes proposed in this PR (storing scale levels in separate groups) opens up two possibilities:

  • We can put multiple images that happen to have the same coordinates in one directory.

As I noted earlier in this issue, I don't love this idea. I stand by the principle that we should have just 1 way of organizing multiscale images, and it should be a way that isolates different multiscale images from each other.

I actually quite like this idea, but it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata). It seems premature to change the spec to support coordinate arrays before there's concrete proposal to actually use coordinate arrays for OME-NGFF. I should note that we only get xarray compatibility "for free" if we use exactly their zarr encoding, and the real blocker for that is the absence of coordinate arrays in OME-NGFF. I would love free compatibility with xarray but this PR doesn't actually bring us closer to it unless we have xarray-compatible coordinates.

@thewtex is there anything I'm missing here? Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

@ivirshup
Copy link
Contributor

As I noted earlier in this issue, I don't love this idea. I stand by the principle that we should have just 1 way of organizing multiscale images, and it should be a way that isolates different multiscale images from each other.

I would strongly agree with this point

I actually quite like this idea, but it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata)... and the real blocker for that is the absence of coordinate arrays in OME-NGFF. I would love free compatibility with xarray but this PR doesn't actually bring us closer to it unless we have xarray-compatible coordinates.

I believe the coordinates and displacements transformations kinda allow this, but it isn't strictly compatible with netcdf. The coordinates can live anywhere (the metadata just needs to a path), and you don't need to use coordinates.

@thewtex
Copy link
Contributor

thewtex commented Oct 27, 2022

The title, branch name, and description are pretty specific to xarray

I think how useful is a fair question.

There are many mentions of xarray. It is worth inspecting why,

  1. The proposal is to place images in groups?
  2. What is the motivation?

Regarding 1), by placing the image pixel array and metadata in a common group, we gain compatibility with the netCDF groups. And xarray is based on the netCDF data model:

Xarray provides two primary data structures, the xarray.DataArray and the xarray.Dataset

Xarray’s highest-level object is currently an xarray.Dataset, whose data model echoes that of a single netCDF group.

and extensions to Xarray's data model, labeled arrays without coordinates, and hierarchical data, the xarray DataTree, intentionally are compatible with the netCDF data model,

WIP implementation of a tree-like hierarchical data structure for xarray.

This aims to create the data structure discussed in xarray issue #4118, and therefore extend xarray's data model to be able to handle arbitrarily nested netCDF4 groups.

The approach used here is based on benbovy's DatasetNode example - the basic idea is that each tree node wraps a up to a single xarray.Dataset.

Note that nodes of the tree are xarray.Dataset's and not xarray.DataArray's.

Deviation from the netCDF data model means deviation the standard model used by the geospatial research community. This model has been around for decades, is used by other software, and does its job well. It is a standard and a community of existing data and software that support it. There are other Python libraries supporting netCDF, and other software built on the netCDF C library, netCDF Java library.

Regarding 2), maybe some folks are only interested in possible use of xarray as another Python library. Speaking for myself at least, I am interested in broader compatibility between open data and open source software, xarray and beyond, between the geospatial research community and the scientific software research communities. I would like to the ability for the same software developed for climate research in cancer research and vice versa. Many of the algorithms developed do not care whether the pixels come from clouds or cells.

This means compatibility between the OME-NGFF and netCDF data models. We get a lot of value by being able to load pixel data and dimension names. This is why the labelled array without coordinates is mentioned and is considered. Many times, just a pixel data array goes a long way.

By placing the image in a group, we gain compatibility of pixel data array between OME-NGFF and netCDF.

create a view of an ome-zarr store that was compatible with netcdf usage. This could be done with references (e.g. symlinks) and metadata transformations.

Yes, this alternative is worth considering, and it adds unnecessary complexity for just accessing pixel data. And implementation across all the software that supports the netCDF data model is not scalable or sustainable.

Another vision would be to go full on netCDF, and layer all of OME-NGFF on top of it. I would assume this conversation has happened before here.

We do not want to shoehorn all of OME-NGFF into netCDF, but that is not what is proposed.


I also have strong suspicions about whether many of the tools listed actually work with a zarr netcdf store.

Many tools are based on the Unidata netCDF C library, which is getting zarr support, as previously mentioned, and that will trickle down.


I agree, but I think a lot of the potential here is actually realized from building on a standard like zarr, rather than going to a standard on top of zarr.

This approach was taken many times in the TIFF ecosystem -- parties came along, built on the TIFF format, and created their own data models that did not share common tags. Sure, transformations could be implemented. But this causes unnecessary pain when common information is desired. Wikipedia's characterization of TIFF:

TIFF is a complex format, defining many tags of which typically only a few are used in each file. This led to implementations supporting many varying subsets of the format, a situation that gave rise to the joke that TIFF stands for Thousands of Incompatible File Formats.

We should seek compatibility when possible and appropriate. Loading pixel arrays is important.


it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata). It seems premature to change the spec to support coordinate arrays before there's concrete proposal to actually use coordinate arrays for OME-NGFF.

@d-v-b I agree with you. We should not bring in xarray's support for netCDF data coordinates into OME-NGFF unless it is appropriate.

And this proposal does not add coordinates to OME-NGFF.

We can store coordinate arrays alongside image data, like xarray's zarr encoding.

Note that xarray can save and load Dataset's with and without coordinate arrays. They are not required.


Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

Yes!

  1. Merge this PR. :-). For pixel array data and dimension labels.

  2. @ivirshup has the excellent idea to create an xarray OME backend, which could populate coordinate arrays based on the OME spatial transformation metadata.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 27, 2022

Note that xarray can save and load Dataset's with and without coordinate arrays. They are not required.

Coordinate arrays are required if you want xarray to know about the coordinates of your data. Loading OME-NGFF data into xarray without the coordinates (specified implicitly via coordinateTransformations) would not be a good use of xarray.

Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

Yes!

1. Merge this PR. :-). For pixel array data and dimension labels.

2. @ivirshup has the excellent idea to create an xarray OME backend, which could populate coordinate arrays based on the OME spatial transformation metadata.

I believe that only option 2 is needed, because the hypothetical xarray OME backend would be free to create xarray.Dataset instances from collections of zarr arrays, so putting images in separate groups would not be needed.

In fact, I think we should aim for a situation where either OME-NGFF images are read correctly by xarray (e.g., with coordinates), or not at all, with no middle ground. Without including either a) xarray-compatible coordinate arrays, or b) the assurance of an OME-NGFF backend for xarray, this PR enables lossy deserialization of OME-NGFF images by xarray which could lead to massive confusion -- for example, after this PR, someone might load an OME-NGFF collection into xarray, generate some coordinates (because the OME-NGFF transformations were ignored by xarray), and then use Dataset.to_zarr(), without noting that this generates an invalid OME-NGFF container. Until the xarray support is complete and lossless, we shouldn't encourage people to use it with OME-NGFF data.

@thewtex
Copy link
Contributor

thewtex commented Oct 31, 2022

This proposal supports use of Xarray, including corrordinates, correctly. It is not necessary to only support loading Xarray through overly complex transformations that only works in specific implementations and intentionally diverges from conventions of the geospatial community.

Also, coordinate arrays are not required by all use cases in Xarray. The fact is, you can write and read xarray Datasets without coordinates. This is what motivates a labeled array without coordinates in a simplified version of the Xarray package proposed in the medical imaging community (nibabel). We should look to support this use case.

Without including either a) xarray-compatible coordinate arrays, or b) the assurance of an OME-NGFF backend for xarray, this PR enables lossy deserialization of OME-NGFF images by xarray which could lead to massive confusion

This is not correct. There is not going to be massive confusion if coords are not present. There could be confusion if there were coords loaded and they did not have the right values. But that is not the case.

use Dataset.to_zarr(), without noting that this generates an invalid OME-NGFF container.

Dataset.to_zarr is not going to automatically generate a valid OME-NGFF with or without this proposal.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 1, 2022

Also, coordinate arrays are not required by all use cases in Xarray.

As a frequent xarray user, I'm a little skeptical of this claim. Yes, technically you can have dimensions without coordinates, but In my experience, coordinates are the key feature of xarray, and I simply wouldn't use the library if I didn't want coordinates. And I find it very hard to imagine doing anything useful with multiscale images in xarray without coordinates, because there will be no way to relate different scale levels to one another. So, speaking for myself, if xarray could natively load OME-NGFF multiscale images but not generate coordinates, this would not be terribly useful unless coordinates were handled correctly.

This proposal supports use of Xarray, including corrordinates, correctly.

Can you explain how this proposal handles coordinates correctly? As I understand it, there are only two ways to get xarray-compatible coordinates for OME-NGFF (explicit coordinate arrays or an OME-NGFF backend for xarray)

@d-v-b
Copy link
Contributor

d-v-b commented Mar 12, 2023

the spirit of this discussion continues in #174

@briannapagan
Copy link

@d-v-b + others on this thread. Would you be able to join one of the GeoZarr Steering Working Group meetings to discuss this a bit further. Info on scheduling/how to join is here: https://hackmd.io/@briannapagan/geozarr-spec-swg

@jni
Copy link

jni commented Mar 29, 2023

@briannapagan I'd love to attend but the time zone lines up very poorly for me (I'm in Melbourne Australia). Having said that for the one in mid-April maybe I'll stay up. 😃 (We lose DST this weekend so it becomes 1am-2am, slightly closer to a civilised time. 😅)

@jni
Copy link

jni commented Mar 29, 2023

Wait is it EST or EDT? 🤔 #TimeZonesAreHard

@briannapagan
Copy link

@jni it's EDT now (thanks for the catch on the dodc, and yes #TimeZonesAreHard haha!).

Trying to keep it as async as possible - but also happy to chat one on one in the future at a time better for you. You can also track convos: https://github.com/zarr-developers/geozarr-spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.