Restart P2P if cluster significantly grew in size #8673

fjetter · 2024-06-05T14:06:22Z

A downside of the new P2P algorithm is that it locks in the set of participating workers as soon as the very first p2p task is executing on a worker. This is a well known problem for downscaling clusters (also when a worker dies) and it is currently handled by restarting the entire P2P run.

For upscaling clusters there is currently no logic implemented. New workers are only allowed to participate in new P2P runs or non-P2P tasks.

This is particularly disturbing if one starts a cluster with few or even None / one workers and expected the adaptivity to handle. The most likely error case in this situation is that the entire P2P operation focuses on a single worker and this worker eventually dies with an out of disk exception (unless dataset is small, of course).

In the past we discussed some sophisticated implementations that involve ring hashing that would let us resume work but I would like to explicitly define this out of scope for the moment and instead pursue a simpler approach.

With the tools available to use I would assume that the easiest way to do this would be to restart a P2P operation if a certain heuristic is true.

For example: If cluster size increased by X% and P2P transfer progress is below Y% restart the P2P operation.

This heuristic should describe cases where we would at least finish more quickly with a restart than if we waited.

fjetter · 2024-06-05T15:57:50Z

I think a very simple heuristic like

Cluster size doubled in size
Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

would go a long way to avoid the very obvious fail case that is being described here https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266/8?u=fjetter where the users starts with one worker and runs into an error.

We can still fine tune later. When implementing this we should also think about what metrics would be useful to tune this further.

hendrikmakait · 2024-06-06T08:03:13Z

Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

"below 25%" as in less than 25% of the transfer tasks have been completed?

fjetter · 2024-06-06T08:25:45Z

"below 25%" as in less than 25% of the transfer tasks have been completed?

yes

github-actions bot added the needs triage label Jun 5, 2024

fjetter mentioned this issue Jun 5, 2024

P2P rechunking of ERA-5 from spatial to temporal dimension is failing hard dask/dask#11162

Closed

fjetter added enhancement Improve existing functionality or make things work better performance adaptive All things relating to adaptive scaling shuffle and removed needs triage labels Jun 5, 2024

This was referenced Jun 5, 2024

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Open

Track cluster growth during P2P operations #8675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart P2P if cluster significantly grew in size #8673

Restart P2P if cluster significantly grew in size #8673

fjetter commented Jun 5, 2024

fjetter commented Jun 5, 2024

hendrikmakait commented Jun 6, 2024

fjetter commented Jun 6, 2024

Restart P2P if cluster significantly grew in size #8673

Restart P2P if cluster significantly grew in size #8673

Comments

fjetter commented Jun 5, 2024

fjetter commented Jun 5, 2024

hendrikmakait commented Jun 6, 2024

fjetter commented Jun 6, 2024