-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent revision and data occurs #13766
Comments
cc @ptabor |
Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ? |
Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use |
IIRC, it was originally created as 3.5.1. |
I didn't set |
Checking the running server, current config is: |
Hi @liuycsd, thanks for the reporting this issue.
Looks like the data inconsistency triggers are
Can you help us better understand your cluster is set up?
Kindly suggest:
|
I suggest you try 3.5.2 and see if this problem can still happen. Thre was a data corruption issue fixed recently. |
Thanks. That's just a simple environment for testing, with limited resources. No monitoring or backup setup. I try to reproduce it in another 3.5.1 cluster, but it didn't occur. I setup a low cpu and disk quota to simulate a overloaded system so some leader elections were triggered randomly, at the same time, keep running several etcdctl compare/put txn, but the revisions inconsistent didn't occur. I may try v3.5.2 and setup some basic monitor and backup later to see whether or not it occurs again.
Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this. |
Sorry for the accident close |
Thanks @chaochn47 @wilsonwang371 and everyone else. After checking the log again, I found a panic did occur on node |
Can you share the stacktraces? |
I don't have access to the cluster right now. I may paste it when it's available later. |
can you check if #13505 is the root cause? |
How could I verify that? The stacktraces are similar.
On node
|
do you have any update on this ? |
There is some performance overhead expected fro running it, I would recommend it to run every couple of hours. |
I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption. However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce. As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption? |
We found data inconsistency in our internal etcd cluster. This was found in 3.4 but this is happening more often after we start using 3.5 which has a better performance. When this happens, a solution would be take down the error node for some time and bring it back again later. This will force the data resync inside the cluster and the problem will be gone. However, when this happens, no alert or special message was given. It gives the SRE team a hard time to detect this. Luckily, they were able to dump the log and I found there is a panic within the time range that we found issue. Other than that, there is no other error messages discovered. The content of the panic is very similar to the one described in this thread. Basically, somewhere in kvsToEvents(), a panic was triggered. With some investigation, we believed the panic should be an error either from reading boltdb or from reading memory. From my point of view, it should be a data corruption inside boltdb. Later in #13505, we found that the boltdb was accessed without a lock protection. Without a proper lock held, we can corrupt boltdb data and this can definitely trigger a data corruption in etcd. That's why I believe this patch should be able to fix this data corruption issue. |
We encountered this corruption in one kube etcd cluster running etcd 3.5.1 yesterday morning and then upgraded all of our clusters to 3.5.2 today because we were going to be doing a rolling-restart of the kube api servers, which causes dangerously high memory spikes on etcd. We faced etcd OOMs in two of our clusters when performing this rolling update. Both of those etcd clusters were running 3.5.2 and both experienced this corruption. The kube symptom of this is that querying various objects returned different results, probably depending on which etcd server was hit. This resulted in errors when listing objects like:
and cases where querying the same object multiple times in a row altered between a recent state and one from 10+ minutes before. On the etcd side, the most obvious symptom was the wildly different database sizes between members.
So, I can confirm that corruption problems definitely still exist in 3.5.2, and that the experimental corruption check flag may need some more work to detect it. The servers in the first two cases were Please let me know if any additional information would be valuable for you here. |
Thanks @PaulFurtado sharing, this is hugely informative. There are 3 important pieces of information:
I will look into reproducing this myself, couple of questions about corruption check. It is possible that corruption check run, but don't report an issue due to high WAL latency of followers. To get more details about why it's not working can you:
Would be also good to know if you have seen same problem with v3.4.X. If you have a environment for reproducing the issue it would be good to confirm that problem only affects v3.5. It would be also useful to eliminate that the problem is with any optional v3.5 feature, can you confirm provide etcd configuration you use? |
@serathius sorry, I had mentioned the
We have never experienced this issue with 3.4.x (or if we did, it must have been extremely subtle and never caused us pain). Under 3.4.x we experienced OOM kills constantly, and had faced so many outages due to OOMs so I would expect that we definitely would have noticed it on 3.4.x. We have been running on 3.5.x for a bit over a month now and this is the first time we hit an OOM condition and we immediately hit this corruption, so it definitely seems like a 3.5 issue. These are the exact flags from one of the
I also doubt that it's relevant, but we run with these env vars as well:
I will also mention that we are running kubernetes 1.21 and we are aware that it is intended to talk to an etcd 3.4.x cluster. However, we upgraded to etcd 3.5 because it is impossible to run etcd with stable latency and memory usage on 3.4.x with kube clusters of our size. Is it possible that something about the 3.4.x client could be corrupting the 3.5.x server? |
I am not able to reproduce it with a 3 distributed etcd node cluster backing a kubernetes The reproduction set up is generating load from a standby node using benchmark put, range and txn-mixed. The 3 node keeps oom_reaped, recovered from the local disk and the revision is monotonically increasing across the cluster. The revision divergence is minimal and slow nodes will catch up eventually.
Thanks! On my end, I will try to rerun the reproduction process with etcd cluster upgrading/downgrading between 3.5 and 3.4. |
I think reproduced the issue, it was surprisingly easy when I knew where to look. Based on @PaulFurtado comment I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don't already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don't just test if database is correctly restored after unrecoverable error. I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps. With that I looked into using same method to test v3.4.18 release. I didn't managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn't test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X. I'm sharing the code I used for reproduction here #13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function. |
NOTE: This new version contains the fix for etcd-io/etcd#13766 so we no longer need the `--experimental-initial-corrupt-check=true` flag
NOTE: This new version contains the fix for etcd-io/etcd#13766 so we no longer need the `--experimental-initial-corrupt-check=true` flag
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Does this bug impact v3.4.x? |
I'm running a 3-member etcd cluster in a testing environment,
and a k8s cluster with each kube-apiserver connecting to one etcd server via localhost.
etcd cluster: created and running with v3.5.1
k8s cluster: v1.22
I found a data inconsistent several days ago:
Some keys exists on node
x
andy
but not on nodez
, and some onz
but not on the other two, e.g. different pods list would be returned from different servers.Some keys have different values between
z
and the others, and can be updated to another different value via the corresponding etcd endpoint, e.g. the kube-system/kube-controller-manager lease points to different pods on different servers and both pods can successfully update the lease via their corresponding local api-server and etcd.While other keys, including some new created ones, are consistent.
Checking with
etcdctl endpoint status
,raft_term
,leader
,raftIndex
andraftAppliedIndex
are same, butrevision
anddbSize
does not:revision
on nodez
is smaller than the others (~700000 diff) and both revisions are increasing.Checking with
etcd-dump-logs
, it seems all the 3 nodes are receiving the same raft logs.I'm not sure how and when this happens.
But the nodes might be under loads and may run out of memory or disk spaces sometimes.
Checking and comparing db keys with
bbolt keys SNAPSHOT key
, and seaching node operating system logs for the revisions near where the difference starts, I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of nodez
may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.The text was updated successfully, but these errors were encountered: