You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem:
The current send command implements a dask cluster (when specified) to parallelise the writing of each variable stored in a given dataset to individual Zarr stores on the JASMIN object storage. The writing of the chunks belonging to a given variable is then completed using .to_zarr() in serial and is hence dependent on the number of chunks.
Proposal:
Introduce a new send_with_dask function which uses a combination of workers and threads in a dask cluster to write the chunks belonging to each variable in parallel. A loop is implemented (serial) over the variables contained in the given dataset - this is the behaviour of .to_zarr() when using an xarray Dataset.
Challenge:
Currently, the approach above is easy to implement with a LocalCluster by simply adding two new arguments to the existing send function, however, configuring the cluster will need to be different when using JASMIN (which uses dask-gateway or a Archer2 (where a SLURMCluster is available).
The text was updated successfully, but these errors were encountered:
oj-tooth
changed the title
Implementing dask parallelism to writing chunks to JASMIN OS.
Add dask parallelism to writing chunks to JASMIN OS.
Nov 7, 2024
oj-tooth
changed the title
Add dask parallelism to writing chunks to JASMIN OS.
Add dask parallelisation to writing chunks to JASMIN OS.
Nov 7, 2024
Problem:
The current
send
command implements a dask cluster (when specified) to parallelise the writing of each variable stored in a given dataset to individual Zarr stores on the JASMIN object storage. The writing of the chunks belonging to a given variable is then completed using.to_zarr()
in serial and is hence dependent on the number of chunks.Proposal:
Introduce a new
send_with_dask
function which uses a combination of workers and threads in a dask cluster to write the chunks belonging to each variable in parallel. A loop is implemented (serial) over the variables contained in the given dataset - this is the behaviour of.to_zarr()
when using an xarray Dataset.Challenge:
Currently, the approach above is easy to implement with a LocalCluster by simply adding two new arguments to the existing
send
function, however, configuring the cluster will need to be different when using JASMIN (which uses dask-gateway or a Archer2 (where a SLURMCluster is available).The text was updated successfully, but these errors were encountered: