-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully shutdown by ... #8050
Comments
@fjetter from my perspective this is only a nice-to-have. No particular pressure from me. It also only works if we think we can reliably start workers in around 90s (which I think we usually can, but it's worth verifying). |
Not entirely opposed to this but there are a couple of challenges to overcome with the implementation that make this a non-trivial change. Right now, the IIRC this lock was primarily introduced to guard against races with the old, deprecated So, either way we should ensure that retire_workers can be called concurrently... There are a couple of options
I'm mentioning the full replicate on this list because this may actually be the objectively better variant anyhow when we're in a scale down + scale up situation since I suspect that we'll end up with a skewed data distribution regardless of this timeout. |
Ah, that explains why I was seeing only singletons downgrade at once. This seems to be a biggish deal with the Spot behavior we're observing. cc'ing @ntabris and @dchudz so that they have context on this limitation. Probably a separate issue, but how hard would it be to batch some of these?
Oh, I'd very happily nuke client-side
+1 |
This should not be necessary.
What does the memory look like just before a host dies? Is there any managed memory still there, or has it all been spilled away and it all looks unmanaged? |
+1. When and if we reimplement them on top of AMM, retaining compatibility with the current API would be a colossal pain anyway. |
I'm not following. The aim of the proposal is (1) to get more useful work out of the instance before it does graceful shutdown and (2) to have a better chance to have replacement instances running before graceful shutdown. Are you saying there's already a way to accomplish that? Or that there's a better way to change things in order to accomplish them? Or that you disagree with those aims? |
The aim of this ticket is that healthy workers should not die due to OOM when spot instances are being retired. Squeezing 2 extra minutes of work out of instances that will shut down soon is pointless IMHO. And this is the person who always obsesses about performance talking. What I'm saying is that graceful worker retirement on A must not cause death-by-OOM on B, and that's the problem that must be solved. |
That certainly sounds like one problem to be solved. Wouldn't it be extra nice if graceful shutdown could transfer data to fresh workers, though? Or are you thinking that doesn't matter in most cases? |
This is basically what happens if the new worker is already up. We're moving the data to the workers with the least data in memory |
The "possibly death by OOM due to graceful downscaling" situation is something unrelated to this specific ticket, isn't it? |
Yes, and part of the point (I thought) of delayed worker shutdown is to give the new worker a chance to show up. |
Is it? I thought it was the central point of the opening post. Or did I misread this?
|
I think you misread. It's not about not having sufficient memory but rather about not moving the data to a full/busy worker if there is a fresh one starting in a minute. This is about utilization, avoiding data skew, etc. |
I was interested in various cloud providers termination grace duration for spot/pre-emptible instances, and conclude that there are 120 seconds for AWS, and 30 seconds for GCP and Azure. AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html |
I wonder if data could be spilled somewhere in a way that the worker can go away before a new one arrives. Maybe to a peer or to the scheduler, but not put back into active processing until there is a worker that can take it. |
Then IMHO this is a Coiled issue, not a distributed one. Coiled should just delay the call to retire_workers. Pseudocode in Coiled now: async def aws_said_worker_will_die_in_3_minutes(addr):
await asyncio.gather(
scheduler.retire_worker(addr),
spawn_new_worker(),
) New pseudocode in Coiled: async def aws_said_worker_will_die_in_3_minutes(addr):
try:
await asyncio.wait_for(spawn_new_worker(), timeout=90))
finally:
await scheduler.retire_worker(addr) |
The current behaviour is to distribute all unique data to peer workers, preferring those with the lowest memory pressure. |
So, running on Coiled with Spot instances I'm seeing the following behavior:
Get lots of spot instances
Do lots of work
In the middle of that work see machines start to go away.
Dask handles this nicely, draining the machines onto their peers and then clearing it swiftly, well within the two-minute window.
More machines (pretty much all of them) go away to the point where we lose data.
New on-demand machines come up thirty seconds later and restart the work
In this situation I wish that the spot machines had actually decided to hang around for as long as they thought was possible before offloading their data. We have two minutes. New machines take about a minute to show up. In many cases data transfer happens in 10s or so.
So, a proposal policy on graceful worker shutdown
The text was updated successfully, but these errors were encountered: