-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028
Comments
Offline compactionOffline compaction makes sense to me. It's useful to perform compaction + defragmentation before we restore a snapshot. We may need to handle it from both etcd and bbolt perspective. Let me take care of it. But it may take some time, because I will be on some personal stuff starting from next Tuesday and until 1 ~ 2 weeks later. I propose to add a separate command Offline bumping revisionI am not against offline bumping revision, but please see my concern/comment in #15381 (comment). Would like to get more feedback before we move on. |
I think we already have Maybe before we go into the debate on what we need to implement for compaction, do we even need to compact? |
As @tjungblu said, nowere I mentioned that I want to run offline compaction. I just want to mark revisions as compacted so they are inaccessible by clients. For large db files it's preferred that we can do that without doing any work. Please don't agree to work on the task, but say that you don't agree on the design and change it. I don't think you understood the design, we don't want to perform offline compaction, we don't care about offline defragmentation, we just want to make restore work. I what I mean by work is for etcd to uphold the https://etcd.io/docs/v3.6/learning/api_guarantees/. Meaning revision is unique and ordered. |
(moving your arguments over from #15381 (comment) so we don't split the discussion into too many places)
Agreed, what @dusk125 wrote so far is just adding a single bump transaction on top of the snapshot data: So we're neither modifying the snapshot you restore from (important for integrity checks), nor do we hack surgically around the binary files. I think Marek's approach is even a step further, by just adding "+x" on the next transaction that's executed - not sure this is enough. Maybe @dusk125 has some more input on what worked and what didn't, I believe we tried that before.
We need to differentiate on what's down, quorum loss definitely means k8s API is not writable and no linearized reads are possible. Workload on the worker nodes can still continue happily on their own. Operators/Controllers (also consider third party stuff you installed with a helm chart) might be impacted, depending on what they're doing with the API. The restarting definitely works functionally, but I think there is a window of time between you restoring to
Whatever we bring up from a restored snapshot, must be guaranteed so far into the future that we don't create that alternate history. There must be data loss by definition, but that should be reconciled correctly by the operators (:crossed_fingers:). |
Please elaborate this? |
In the work I had been doing, I did not modify the snapshot, nor do I think we should modify it in whatever path we take. If, for whatever reason, the restore fails, the user could remove the proposed flags, and go through the normal restore and restart their workload; we wouldn't be able to do this if the snapshot file has been changed. My testing was focused around getting a watcher to resync automatically as if the key its watching on changed, without it actually needing to change. The watcher would reconnect to the restored etcd and get a notification that the value was modified, and thus get its "new" value and hopefully continue operating as normal. If the revision your watching is shown as compacted, then the watch fails. If the watcher contract is to invalidate and re-watch when a watch fails due to a compacted revision, then I think @serathius 's suggestion of marking everything as compacted after restore (not touching the snapshot file) and adding a single dummy revision on top of everything could work; I can start testing this out. |
Hi @serathius @dusk125 @tjungblu, this is great proposal. Could you please share how invalidating client cache (operator/controller) depends on watch fails on compacted higher revision will work or it works conceptually as of today? I think that's the fundamental assumption (theory) we make that drives this design. It should clear some confusion in this thread. (or should we move this conversation to k/k issue?) |
Thanks for all the discussions in the meantime, I think I'm also a little further in my understanding and try to summarize in more detailed pieces. Let's look at this from a K8s API point of view first. We want to ensure all operators/controllers consuming the watchlist API will receive a signal to invalidate their caches. Working backwards, from the official docs we can find that a 410 should be returned to signal a fresh listing from the start. Which in turn, is always triggered by our own etcd return of From etcd point of view, we want to ensure that we have a high enough revision to start from after restoring from a snapshot before we serve any data to the outside world. Yet, we also need to ensure that we don't serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer). Conveniently, this is exactly what the compaction does. For that, do we really need to run a full compaction? Or is it enough to mark a revision range as already compacted? I think there are multiple approaches here, one that's hacky would be to use Rewinding a little, how to establish the revision in the snapshot? Similar to Allen's approach, we discussed about O(n) looping through each key in the snapshot to establish which revision is the highest. If you look into kvstore.go, it already does that. I think there were some reservations around the performance impact of its linear runtime. Since creating a snapshot already requires iteration, we would also propose caching this information in the snapshot proto as additional/trailing metadata. When contained in a snapshot it would allows us to save the iteration of the whole key space when restoring. |
I would be happy to take this issue on, can you assign this issue to me please? |
Thanks all for the explanation, but it seems that we are still playing with a not well explained concept "
Yes, indeed there might be a gap in between. If users request a revision in the gap, then they get nothing. Users may be confused because etcd's revisions are supposed to be continuous. So far I do not see any feasible & good approach other than a complete offline compaction just as I mentioned in #16028 (comment) |
Fair point, I think this is just not practically possible because we don't know what the very latest revision was in a disaster recovery scenario. If all you have is a backup snapshot from two days ago and everything else is somehow lost, how do you find out the most recently served revision?
We should settle on some non-functional requirements for a restoration. Marek made a good point about the process being as fast and reliable as possible to avoid spooking admins/SREs on an already stressful task. Waiting for a full offline compaction, that also might fail, on big snapshots might take some time. From my experience starting a big etcd from disk (let's say close to our 8 gigs limit) until ready to serve the first byte of client requests (TTFB if you so want to call it) is on the order of several minutes. Does anyone have some reliable numbers from their cloud fleet they can share? I don't have this metric in our telemetry sadly. |
Just had a quick discussion with @serathius . There are two things.
|
I've been able to verify that marking a revision as compacted: The following is from running the server after restoring with my most recent changes in my draft PR with
|
From a etcd cluster administrator perspective, I would like the etcdctl snapshot restore flag named as I don't want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs. As a mitigation, naming the snapshot db file suffix with the revision and doing a calculation afterwards also works.. It's just not as convenient as just supplying the desired one. |
It is a good point. The concern is you might not always be able to know the desired target revision, but you are always able to provide a rough big enough bumping revision. As long as we can enhance the etcdutl tool to get the current highest revision, then we can easily calculate the bumping revision (desired revision - current highest revision). It seems not a big problem. |
This task should have completed? |
Yes, I the remaining work should be on Kubernetes side. @dusk125 @tjungblu would you be interested in finishing kubernetes/kubernetes#118501 ? |
What would you like to be added?
This is etcd side bug for kubernetes/kubernetes#118501
tldr; Etcd restore operation breaks existing etcd guarantees about revision never decreasing.
Proposal
Tracking work
--bump-revision
flag toetcdutl snapshot restore
operation.--mark-compacted
flag toetcdutl snapshot restore
operation.--bump-revision
is high enoughWhy is this needed?
Without zero official guidance each vendor/administrator had a different plan for etcd restore many of which were broken and incorrect. Having an official guided restore operation allow whole community to work together and share their experience leading to improvements in Kubernetes disaster recovery handling.
The text was updated successfully, but these errors were encountered: