Restart P2P if cluster significantly grew in size #8673
Labels
adaptive
All things relating to adaptive scaling
enhancement
Improve existing functionality or make things work better
performance
shuffle
A downside of the new P2P algorithm is that it locks in the set of participating workers as soon as the very first p2p task is executing on a worker. This is a well known problem for downscaling clusters (also when a worker dies) and it is currently handled by restarting the entire P2P run.
For upscaling clusters there is currently no logic implemented. New workers are only allowed to participate in new P2P runs or non-P2P tasks.
This is particularly disturbing if one starts a cluster with few or even None / one workers and expected the adaptivity to handle. The most likely error case in this situation is that the entire P2P operation focuses on a single worker and this worker eventually dies with an out of disk exception (unless dataset is small, of course).
In the past we discussed some sophisticated implementations that involve ring hashing that would let us resume work but I would like to explicitly define this out of scope for the moment and instead pursue a simpler approach.
With the tools available to use I would assume that the easiest way to do this would be to restart a P2P operation if a certain heuristic is true.
For example: If cluster size increased by X% and P2P transfer progress is below Y% restart the P2P operation.
This heuristic should describe cases where we would at least finish more quickly with a restart than if we waited.
The text was updated successfully, but these errors were encountered: