Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart P2P if cluster significantly grew in size #8673

Open
fjetter opened this issue Jun 5, 2024 · 3 comments
Open

Restart P2P if cluster significantly grew in size #8673

fjetter opened this issue Jun 5, 2024 · 3 comments
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better performance shuffle

Comments

@fjetter
Copy link
Member

fjetter commented Jun 5, 2024

A downside of the new P2P algorithm is that it locks in the set of participating workers as soon as the very first p2p task is executing on a worker. This is a well known problem for downscaling clusters (also when a worker dies) and it is currently handled by restarting the entire P2P run.

For upscaling clusters there is currently no logic implemented. New workers are only allowed to participate in new P2P runs or non-P2P tasks.

This is particularly disturbing if one starts a cluster with few or even None / one workers and expected the adaptivity to handle. The most likely error case in this situation is that the entire P2P operation focuses on a single worker and this worker eventually dies with an out of disk exception (unless dataset is small, of course).

In the past we discussed some sophisticated implementations that involve ring hashing that would let us resume work but I would like to explicitly define this out of scope for the moment and instead pursue a simpler approach.

With the tools available to use I would assume that the easiest way to do this would be to restart a P2P operation if a certain heuristic is true.

For example: If cluster size increased by X% and P2P transfer progress is below Y% restart the P2P operation.

This heuristic should describe cases where we would at least finish more quickly with a restart than if we waited.

@fjetter
Copy link
Member Author

fjetter commented Jun 5, 2024

I think a very simple heuristic like

  • Cluster size doubled in size
  • Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

would go a long way to avoid the very obvious fail case that is being described here https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266/8?u=fjetter where the users starts with one worker and runs into an error.

We can still fine tune later. When implementing this we should also think about what metrics would be useful to tune this further.

@hendrikmakait
Copy link
Member

Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

"below 25%" as in less than 25% of the transfer tasks have been completed?

@fjetter
Copy link
Member Author

fjetter commented Jun 6, 2024

"below 25%" as in less than 25% of the transfer tasks have been completed?

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better performance shuffle
Projects
None yet
Development

No branches or pull requests

2 participants