Consolidated Zarr support could improve S3 data loading #2987

mannreis · 2024-08-19T13:13:48Z

Hello 👋

We've noticed the difference between reading a remote Zarr dataset [https://...#mode=s3,zarr] and local one [file://....#mode=file,zarr]:

$ time ncdump/ncdump -v tas file://${HOME}/${DATASET}/#mode=zarr | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real 0m0.111s
user 0m0.104s
sys 0m0.017s

$ time ncdump/ncdump -v tas https://${ENDPOINT}/${BUCKET}/${DATASET}/#mode=zarr,s3 | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real 0m9.854s
user 0m4.739s
sys 0m0.162s

Network overhead is expected, specially if the service imposes rate limits. But such a difference motivated me to look at the implementation behaviour.

It seems that the approach used by netcdf is similar to the one used with Python Zarr, fetching all the metadata in advance. And for this reason the following requests are sent (in netcdf) for my the example above:

4 GET Requests to list the dataset metadatafiles
674 HEAD Requests to mainly fetch the size of the object to be transfered
224 GET Requests to actually read the content of both metadata (223) and data/chunks (1) objects.

There are 3x more HEAD than GET which can be a tinny improvement, but overall this is not much different from what Python does:

import zarr,s3fs,os
s3 = s3fs.S3FileSystem(endpoint_url=f'https://{os.environ["ENDPOINT"]}/',anon=True)
store = s3fs.S3Map(root=f'{os.environ["BUCKET"]}/{os.environ["DATASET"]}',s3=s3)
d = zarr.open(store,'r')
print(d.info)

Which produces:

127 GET Requests to list metadatafiles or variable names after the dataset prefix
349 HEAD Requests to check for metadatafiles

Implementing a consolidated access mode could improve the situation. In Python, the example above can be simplified to a single request:

1 GET Request to fetch the content of /.zmetadata (Note that not even a HEAD request is done in advance)

import zarr,s3fs,os
s3 = s3fs.S3FileSystem(endpoint_url=f'https://{os.environ["ENDPOINT"]}/',anon=True)
store = s3fs.S3Map(root=f'{os.environ["BUCKET"]}/{os.environ["DATASET"]}',s3=s3)
d = zarr.open_consolidated(store)
print(d.info)

If this is desired perhaps it could be supported by other modes, like file (or even zip!?) as well. In that case I think it would be part of the zarr api and not a specific zmap S3 implementation.

I will try to come up with a PR for this but it would be great to have some feedback and if positive, some pointers/draft on how to support it (via #mode=consolidated controls? Environment? Only when build --with-consolidated-zarr?)

Thanks!

The text was updated successfully, but these errors were encountered:

joshmoore · 2024-08-26T14:53:03Z

👍 for support of Zarr v2 "consolidated". A discussion on the zarr-python team yesterday touched on how to deal with the potential differences in this respect between possible differences in the definition of consolidated between the v2 and v3 formats. The decision was to add arguments to enable the v2 consolidated format to the v3 library, but potentially disallow those arguments when producing the v3 format (since the v3 library will need to support both the v2 and v3 formats.) </tongue_twister>

DennisHeimbigner · 2024-08-26T17:27:28Z

Sorry, I apparently missed this Issue when it was first posted.
In any case, we have always planned to support consolidated
metadata for both V2 and V3. The problem was that there appeared
to be no specification for the JSON for consolidated metadata.
Has that changed? Can you point me to that spec?
Josh's note about V3 supporting both V2 and V3 is unclear.
I get that the actual (python) library will need to read files in both V2
and V3 formats. But I do not understand this remark:

The decision was to add arguments to enable the v2 consolidated format to the v3 library, but potentially disallow those arguments when producing the v3 format

What kind of arguments are being considered?

joshmoore · 2024-08-26T19:08:02Z

The problem was that there appeared
to be no specification for the JSON for consolidated metadata.
Has that changed? Can you point me to that spec?

No, that has not changed, but agreed that that is a difficult for the v2 format.

What kind of arguments are being considered?

Correction. For zarr-python library v2, I should have said "methods" or "API" for activating consolidated metadata. (Those don't yet exist for zarr-python library v3.) The method arguments I was thinking of are in xarray: https://docs.xarray.dev/en/stable/user-guide/io.html#consolidated-metadata

DennisHeimbigner · 2024-08-26T19:26:38Z

Ok, I see.
So the big holdup at the moment is a JSON spec for
consolidated metadata for V2 and another for V3.

joshmoore · 2024-08-26T19:29:33Z

The discussion around V3 is currently ongoing. It's unlikely that there will be significant work on a V2 "spec". (I would certainly be for having an "upgrade guide" between the two which may be as close as we can come.)

mannreis · 2024-08-27T07:55:16Z

Thanks for the discussion! In the meantime I've tried to just add a "caching layer" the metadata functions that would GET the .z* files to see what the difference would be [1] . I've opened #2992 but its a draft perhaps not useful on the long term.

[1]

$ time ncdump/ncdump -v tas https://${ENDPOINT}/${BUCKET}/${DATASET}/#mode=zarr,s3,consolidated | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real	0m0.262s
user	0m0.155s
sys	0m0.022s

mannreis mentioned this issue Aug 27, 2024

Draft: Add mode to read consolidated ZARR datasets #2992

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidated Zarr support could improve S3 data loading #2987

Consolidated Zarr support could improve S3 data loading #2987

mannreis commented Aug 19, 2024

joshmoore commented Aug 26, 2024

DennisHeimbigner commented Aug 26, 2024

joshmoore commented Aug 26, 2024

DennisHeimbigner commented Aug 26, 2024

joshmoore commented Aug 26, 2024

mannreis commented Aug 27, 2024

Consolidated Zarr support could improve S3 data loading #2987

Consolidated Zarr support could improve S3 data loading #2987

Comments

mannreis commented Aug 19, 2024

joshmoore commented Aug 26, 2024

DennisHeimbigner commented Aug 26, 2024

joshmoore commented Aug 26, 2024

DennisHeimbigner commented Aug 26, 2024

joshmoore commented Aug 26, 2024

mannreis commented Aug 27, 2024