How to recover previous zarr store if append_dim fails (e.g. during cluster computation) but metadata has already been written? #8956
-
I am using The first thing that happens is that the metadata is eagerly updated - I imagine this is a conscious design choice and has benefits. However, it also means that if something goes wrong during the writing process, then we are left in an unstable state where the metadata does not match the actual contents of the store. To be specific, I am using dask distributed to compute and write the chunks of data on a cluster, and I want the state of my zarr store to be recoverable if, for example, I think that to achieve this two things need to happen:
My questions are
Experiments so far My first thought for achieving (1) was to call I then tried writing the original metadata directly to the NB - I'm working with large volumes of data using cloud storage, so trying to avoid solutions that involve creating copies every time / reading into memory. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @harryC-space-intelligence! I understand exactly what you're facing here. This is a challenge so many of of us hit as we scale out our Zarr data processing workflows. In distributed pipelines of a certain size, some failures are inevitable for any number of reasons, and these failures can leave our datasets in a corrupted state. One solution is to to stage all of the changes locally first, either in memory or on disk, and then copy them up in buik once complete. That is okay for small updates, but, as you noted, it's not efficient or compatible with a big distributed pipeline. It's important to recognize that what you are asking for is a common feature of actual database systems: what you need is a database transaction. You want either the entire operation to succeed or fail consistently, and not leave your data in a corrupted state. Furthermore, you want to make writes with the transaction from distributed workers potentially on different hosts. That complicates things! This is not really a problem that Xarray can solve. It's actually a problem with the underlying database you're talking to: S3 (or whatever object storage you're using). S3 is in fact a database--a simple key-value store. But S3's consistency model only makes guarantees about individual objects in isolation; there are no transactions over multiple objects. Unfortunately the state of the Zarr dataset is spread over many objects! 😫 We thought long and hard about the right way to overcome this problem. To solve it, our company (Earthmover) created Arraylake, a cloud native data lake platform built around the Zarr data model. Essentially, we incorporate some simple ideas from database systems in order to enable transactions, rollbacks, and version control for Zarr data. You can read about how this solution works here: https://docs.earthmover.io/concepts/version_control_system. (It also does a lot of other cool stuff, like automatically providing a data catalog!) With Arraylake, you can keep using Xarray and Zarr as you are today (including reading / writing from distributed clusters), but instead of storing your data directly in S3, you store it in Arraylake (which still ultimately talks to S3, but through an extra layer of indirection). Our customers are mostly teams using Xarray and Zarr in production contexts, where these sorts of errors are unacceptable. Feel free to reach out ([email protected]) if you want to discuss more! |
Beta Was this translation helpful? Give feedback.
-
@rabernat thanks very much for your response! Good to know this is a common problem helpful to see the bigger picture. |
Beta Was this translation helpful? Give feedback.
Hi @harryC-space-intelligence!
I understand exactly what you're facing here. This is a challenge so many of of us hit as we scale out our Zarr data processing workflows. In distributed pipelines of a certain size, some failures are inevitable for any number of reasons, and these failures can leave our datasets in a corrupted state.
One solution is to to stage all of the changes locally first, either in memory or on disk, and then copy them up in buik once complete. That is okay for small updates, but, as you noted, it's not efficient or compatible with a big distributed pipeline.
It's important to recognize that what you are asking for is a common feature of actual database systems: what …