How to recover previous zarr store if append_dim fails (e.g. during cluster computation) but metadata has already been written? #8956

harryC-space-intelligence · 2024-04-18T09:22:03Z

harryC-space-intelligence
Apr 18, 2024

I am using to_zarr( append_dim = 'time' ) to iteratively update a zarr store from xarray.

The first thing that happens is that the metadata is eagerly updated - I imagine this is a conscious design choice and has benefits. However, it also means that if something goes wrong during the writing process, then we are left in an unstable state where the metadata does not match the actual contents of the store.

To be specific, I am using dask distributed to compute and write the chunks of data on a cluster, and I want the state of my zarr store to be recoverable if, for example, KilledWorker exception occurs.

I think that to achieve this two things need to happen:

The metadata needs to revert to its state from before to_zarr was called.
The appended dimension (in this case time) needs to be truncated back to where it was before.

My questions are

how can this be achieved?
and if not currently, should this functionality be incorporated into xarray?

Experiments so far

My first thought for achieving (1) was to call old_ds.to_zarr(store, mode='a', compute=False) but this doesn't seem to have any effect.

I then tried writing the original metadata directly to the .zmetadata file - this appeared to solve (1). However, re-trying the append_dim step then doesn't behave as expected without doing (2). So far, I have found no way to 'reverse' the appending of a dimension.

NB - I'm working with large volumes of data using cloud storage, so trying to avoid solutions that involve creating copies every time / reading into memory.

Answered by rabernat

Apr 18, 2024

Hi @harryC-space-intelligence!

I understand exactly what you're facing here. This is a challenge so many of of us hit as we scale out our Zarr data processing workflows. In distributed pipelines of a certain size, some failures are inevitable for any number of reasons, and these failures can leave our datasets in a corrupted state.

One solution is to to stage all of the changes locally first, either in memory or on disk, and then copy them up in buik once complete. That is okay for small updates, but, as you noted, it's not efficient or compatible with a big distributed pipeline.

It's important to recognize that what you are asking for is a common feature of actual database systems: what …

View full answer

rabernat · 2024-04-18T16:15:04Z

rabernat
Apr 18, 2024
Maintainer

Hi @harryC-space-intelligence!

I understand exactly what you're facing here. This is a challenge so many of of us hit as we scale out our Zarr data processing workflows. In distributed pipelines of a certain size, some failures are inevitable for any number of reasons, and these failures can leave our datasets in a corrupted state.

One solution is to to stage all of the changes locally first, either in memory or on disk, and then copy them up in buik once complete. That is okay for small updates, but, as you noted, it's not efficient or compatible with a big distributed pipeline.

It's important to recognize that what you are asking for is a common feature of actual database systems: what you need is a database transaction. You want either the entire operation to succeed or fail consistently, and not leave your data in a corrupted state. Furthermore, you want to make writes with the transaction from distributed workers potentially on different hosts. That complicates things!

This is not really a problem that Xarray can solve. It's actually a problem with the underlying database you're talking to: S3 (or whatever object storage you're using). S3 is in fact a database--a simple key-value store. But S3's consistency model only makes guarantees about individual objects in isolation; there are no transactions over multiple objects. Unfortunately the state of the Zarr dataset is spread over many objects! 😫

We thought long and hard about the right way to overcome this problem. To solve it, our company (Earthmover) created Arraylake, a cloud native data lake platform built around the Zarr data model. Essentially, we incorporate some simple ideas from database systems in order to enable transactions, rollbacks, and version control for Zarr data. You can read about how this solution works here: https://docs.earthmover.io/concepts/version_control_system. (It also does a lot of other cool stuff, like automatically providing a data catalog!)

With Arraylake, you can keep using Xarray and Zarr as you are today (including reading / writing from distributed clusters), but instead of storing your data directly in S3, you store it in Arraylake (which still ultimately talks to S3, but through an extra layer of indirection). Our customers are mostly teams using Xarray and Zarr in production contexts, where these sorts of errors are unacceptable. Feel free to reach out ([email protected]) if you want to discuss more!

0 replies

harryC-space-intelligence · 2024-04-24T09:50:46Z

harryC-space-intelligence
Apr 24, 2024
Author

@rabernat thanks very much for your response!

Good to know this is a common problem helpful to see the bigger picture.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to recover previous zarr store if append_dim fails (e.g. during cluster computation) but metadata has already been written? #8956

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to recover previous zarr store if append_dim fails (e.g. during cluster computation) but metadata has already been written? #8956

harryC-space-intelligence Apr 18, 2024

Replies: 2 comments

rabernat Apr 18, 2024 Maintainer

harryC-space-intelligence Apr 24, 2024 Author

harryC-space-intelligence
Apr 18, 2024

rabernat
Apr 18, 2024
Maintainer

harryC-space-intelligence
Apr 24, 2024
Author