-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement improved draining #1191
Comments
Options for graceful shutdown: (1) Add a new ShutdownStarting message to ZDS. Upon this, send GOAWAYs to clients. Mixed: draining happens immediately instead of at the end of the pod. This probably doesn't matter; this type of backpressure is more useful when the client's are not mesh-aware. HBONE clients are, generally. This does mean we close pooled connections a bit faster though, which is maybe a small win... unless we legitimately would use the pooled connections. (2) Move DelWorkload hook to a CNI DEL cmd. This ensures we run after the application shuts down, but before the network is destroyed (I guess DEL runs in reverse order? and we are always the last plugin). We can then drain and ACK once we are done, then pod shutdown can complete. One nice thing is this shutdown happens out of band to the pod -- the pod is complete removed while this is taking place. So users won't see "pod is taking a while to shutdown and causing issues"; its invisible. Obviously this shouldn't be slow, though. Pro: "perfect" timing to do last-mile shutdown sequences (3) Workaround on client side: when we see a pod is deleted (via WDS), remove any pooled connections to it Pros: effectively allows us to drop pooled connections. (4) Do nothing. Client's will eventually have keepalives timeout and drop the HBONE connections. Pros: simple |
Yeah - in general I think the following are true:
I think the latter two scenarios are ~roughly the same thing, and probably require some kind of ACK to the node agent, so it can know when the entire "remove pod" flow is actually done. Otherwise I think the state might get a bit ambiguous? Probably most of this will be answered in impl tho. |
I have a slight pref for (1) I think. I am wary of getting too hooked into the CNI plugin lifecycle, and have some preference for being opportunistic with pod cleanup/shutdown rather than trying to manage it or stall it, since at the end of the day we don't control that, and it's not our domain.
Yeah, I think this is the kind of thing that gives me pause. It's harder to explain to operators, it's not terribly intuitive, and corner cases are hard to reason about.
Note that for (2) we could also maybe use GC: https://github.com/containernetworking/cni/blob/main/SPEC.md#gc-clean-up-any-stale-resources |
I am not sure GC works since we need to do our cleanup before the rest of the CNI does (else we cannot send GOAWAY since the veth is gone).
Its not about the client apps. If we don't goaway, we have a few issues:
|
Can we safely factor out the use of keepalives for HBONE connections entirely if we do this, or would we still need them as a backup/failsafe to avoid keeping around stale conns if there are unexpected disruptions (I think this is the current state)? If we can safely in all cases drop keepalives if we do this, it's worth doing. If we still need keepalives as a Plan B no matter what, then to me it feels less so. |
I think keepalives are a good general practice. There is no guarantee a TCP connection stays live -- there can ALWAYS be something that non-gracefully terminates. But they also shouldn't be used as the primary way to close a connection |
Yeah that's part of the spec so we can rely on that. |
Conceptually and hygienically I agree with that, but in this specific case where the thing we care about protecting (the workload/client app) is already isolated from this, I don't know if practically speaking it makes a difference unless there are things we (that is, ztunnel) need to clean up that we can't clean up in other ways. |
I don't think we should let these decisions tie to whether we use keepalives. They are either useful or not to detect non-graceful broken connections, and that doesn't really change just because we remove one way a connection can be broken non-graceful broken connections; there are inherently always going to be these. I am not an expert here but it seems like its pretty low-risk, medium-reward to have so worth keeping |
One thing I am concerned with (1) is, consider a case where I have With (1), once pod-b starts to terminate, we are going to open up new HBONE connections for each new app connection between pod-a and pod-b. The end result, from users POV, is increased CPU/latency when the pod is shutting down. We do similar in sidecars, but it has a purpose: it tells the application to retry to another backend that is not shutting down. There is really no backpressure mechanism in hbone. The only thing we need to is to tell the client to drop their pooled connections gracefully. That is better done at the last minute, which (2) can provide |
Right - in both cases, all the connections the user app ( Do we know that that's significant? It's probably worse for same-node. This is also something we could likely entirely mitigate practically with a relatively simple clientside backoff in the HBONE pool, which wouldn't be the worst idea to have regardless (I guess that's option 3)
In theory, this could still be the case tho here? Even in (1) we are backpressuring to the app, ultimately, which may have its own retry logic, which may connect to The only benefit of (2) that I can see is we bubble up the backpressure to the app a bit faster, but again an HBONE pool clientside backoff would probably solve this effectively as well, and would be a good idea generally. |
No, that is not the case. Maybe I am not explaining the scenario properly. Pod B starts terminating, but is not terminated yet. I.e. delationTimestamp is non-null, but pod phase is running. During this time, per our rules, Pod B MUST continue to accept traffic. In both cases, the traffic will succeed. In (1), each connection will open a new HBONE connection. In (2), we will pool like normal. Maybe I have recency bias after exploring istio/istio#51855, where similar logic caused the app to disable pooling which caused effectively a DOS. (note: issue is not about ambient, just same idea). |
Ah ok. It's a bit of a chicken-or-egg problem. And yeah - we probably should just take no action at all on "TERM_START". We have to proxy traffic until the last possible second, or we will inevitably negatively interact with $WHATEVER_POOLING_OR_SHUTDOWN_LOGIC a given client/server want to perform. A combination of 3 and 4 is probably still the simplest and least likely to interfere there - keepalives (which we already have) + clientside pool backoff (which we arguably want anyway, since we could still have the same kind of backpressure problem if the dest ztunnel is down but the dest apps aren't) on consecutive failures. We proxy everything we can until the last possible moment (when the dest process terminates), and the instant we can't, we bubble the backpressure up to the client app (which does whatever it wants). |
One thing to clarify is GOAWAY doesn't close the inner connections. It just means once there no more active inner connections, the out connection will be dropped instead of saved for pooling. So it is viable to send GOAWAY before the last second, it just may not be ideal. |
Related:
There are a few shutdown sequence we need to handle
We already get the
delworkload
message when the pod is done, not when it starts exiting. We need to shutdown immediately, not do a long drain.Ideally, we also have a graceful shutdown. That is, when the pod is gone, clients know to close their connections. For TCP, this is no problem - when the connection is dropped they should get a RST (TODO: verify this happens, since the veth is ripped out from us!).
For HBONE, we want to send a GOAWAY and tls close_alert to gracefully shutdown. We cannot do this after the pod shuts down, since the CNI will remove the veth!
This is on a per-pod basis
In this case, we want to have a long drain period to allow new connections to gracefully connect to the new ztunnel, and let existing connections naturally die out.
The text was updated successfully, but these errors were encountered: