Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft/rafttest: TestPause failed #132992

Closed
cockroach-teamcity opened this issue Oct 19, 2024 · 6 comments · Fixed by #133270
Closed

raft/rafttest: TestPause failed #132992

cockroach-teamcity opened this issue Oct 19, 2024 · 6 comments · Fixed by #133270
Assignees
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 19, 2024

raft/rafttest.TestPause failed on release-24.3 @ 4cbedefd790c75cb0f21f77ed8d917c8528a7d15:

raft2024/10/19 04:19:49 INFO: 3 received MsgVoteResp from 2 at term 2
raft2024/10/19 04:19:49 INFO: 3 has received 3 MsgVoteResp votes and 2 vote rejections
raft2024/10/19 04:19:49 INFO: 3 became leader at term 2
raft2024/10/19 04:19:49 INFO: 3 [logterm: 2, index: 6, vote: 3] ignored MsgVote from 4 [logterm: 1, index: 5] at term 2: supporting fortified leader 3 at epoch 1
raft2024/10/19 04:19:49 INFO: 2 [term: 2] received a MsgVote message with higher term from 4 [term: 3]
raft2024/10/19 04:19:49 INFO: 2 became follower at term 3
raft2024/10/19 04:19:49 INFO: 2 [logterm: 1, index: 5, vote: 0] cast MsgVote for 4 [logterm: 1, index: 5] at term 3
raft2024/10/19 04:19:49 INFO: 1 [term: 2] received a MsgVote message with higher term from 4 [term: 3]
raft2024/10/19 04:19:49 INFO: 1 became follower at term 3
raft2024/10/19 04:19:49 INFO: 1 [logterm: 1, index: 5, vote: 0] cast MsgVote for 4 [logterm: 1, index: 5] at term 3
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 received MsgVoteResp from 5 at term 3
raft2024/10/19 04:19:49 INFO: 4 has received 2 MsgVoteResp votes and 0 vote rejections
raft2024/10/19 04:19:49 INFO: 4 received MsgVoteResp from 1 at term 3
raft2024/10/19 04:19:49 INFO: 4 has received 3 MsgVoteResp votes and 0 vote rejections
raft2024/10/19 04:19:49 INFO: 4 became leader at term 3
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 3 [term: 2] received a MsgFortifyLeader message with higher term from 4 [term: 3]
raft2024/10/19 04:19:49 INFO: 3 became follower at term 3
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 2 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: found conflict at index 6 [existing term: 2, conflicting term: 3]
raft2024/10/19 04:19:49 INFO: replace the unstable entries from index 6
raft2024/10/19 04:19:49 INFO: 5 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 4 [term: 3] ignored a MsgFortifyLeader message with lower term from 3 [term: 2]
raft2024/10/19 04:19:49 INFO: 1 [term: 3] ignored a MsgApp message with lower term from 3 [term: 2]
    node_test.go:126: commits failed to converge!
I241019 04:19:54.853300 347 (gostd) node.go:162  [-] 13  raft.1: stop
I241019 04:19:54.853411 349 (gostd) node.go:162  [-] 14  raft.2: stop
I241019 04:19:54.853481 351 (gostd) node.go:162  [-] 15  raft.3: stop
I241019 04:19:54.853522 353 (gostd) node.go:162  [-] 16  raft.4: stop
I241019 04:19:54.853565 355 (gostd) node.go:162  [-] 17  raft.5: stop
--- FAIL: TestPause (5.38s)

Parameters:

  • attempt=1
  • run=25
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-43390

@cockroach-teamcity cockroach-teamcity added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Oct 19, 2024
@miraradeva
Copy link
Contributor

Duplicate of #132205.

This test expects a stable leader, and we see from the logs that the loader changed:

raft2024/10/19 04:19:49 INFO: 3 became leader at term 2
...
raft2024/10/19 04:19:49 INFO: 4 became leader at term 3

It's similar in flavor to #121745 (comment), which is fixed. There might be a slightly different reason for the leadership change here, but it's most likely another variant of the waitLeader flakeyness described in etcd-io/raft#195.

Assigning P3 and removing the release blocker label, similar to #121745.

@miraradeva miraradeva added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2024
@iskettaneh
Copy link
Contributor

iskettaneh commented Oct 23, 2024

+1 to what @miraradeva said. The test basically proposes many entries to different nodes, and assumes that there was a stable leader, so therefore all entries that were proposed got committed, which is not true if there are leadership changes.

Some suggestions to make the test more stable:

  1. Follow-up on rafttest: reduce waitLeader flakiness etcd-io/raft#195 where we consider a leader in the test to be stable if it has the highest term in the cluster. This should work for tests that don't have partitions, and should reduce the chances of unstable leader.

  2. Increase the electionTimeout: Right now, the nodes in the tests call elections if they don't hear from a leader for [10,20)ms. This is a short duration where multiple nodes could end up calling for elections at the same time. Maybe we can increase it to [100,200)ms?

@iskettaneh
Copy link
Contributor

I am able to reproduce the failure by running:

./dev test pkg/raft/rafttest -f TestPause --stress --count=10000 --deadlock

iskettaneh added a commit to iskettaneh/cockroach that referenced this issue Oct 23, 2024
We sporadically see that some raft node_test tests fail due to the
leader not being stable. This commit should reduce the chances of that
happening by increasing the election timeout to 50ms (instead of 10ms).

I couldn't reproduce the bug locally with this change.

If the bug still happens, we can try to force leadership to make it
more deterministic.

Fixes: cockroachdb#132992

Release note: None
@iskettaneh
Copy link
Contributor

I took a step back when I noticed that we have multiple of similar problems that started showing up in the last 2 weeks. I was able to bisect the issue to this: d3f4d01#diff-2a745fc78de353dd2e58c332ca186b00e887630d719ba3831fd6282633306acb

@iskettaneh
Copy link
Contributor

Talked offline with @arulajmani, the culprit commit didn't do anything weird. However, it introduced a mutex that might have helped show this bug more often due to timing issues

craig bot pushed a commit that referenced this issue Oct 23, 2024
119035: sql, sem, opt, explain, memo, kv: audit bit-flag-checking helpers r=DrewKimball,mgartner a=michae2

`@mgartner's` [comment](#118931 (review)) on #118931 inspired me to audit all the other helper functions that check whether a flag is set in a bitfield. I might have found a couple bugs.

See individual commits for details.

Epic: None
Release note: None

133270: raft: deflake non-determinisctic raft node tests r=iskettaneh a=iskettaneh

We sporadically see that some raft node_test tests fail due to the leader not being stable. This commit should reduce the chances of that happening by increasing the election timeout to 250ms (instead of 50ms).

I couldn't reproduce the bug locally with this change.

If the bug still happens, we can try to force leadership to make it more deterministic.

Fixes: #132992, #131676, #132205, #133048

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Ibrahim Kettaneh <[email protected]>
@craig craig bot closed this as completed in f42e023 Oct 23, 2024
Copy link

blathers-crl bot commented Oct 23, 2024

Based on the specified backports for linked PR #133270, I applied the following new label(s) to this issue: branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Oct 23, 2024
blathers-crl bot pushed a commit that referenced this issue Oct 23, 2024
We sporadically see that some raft node_test tests fail due to the
leader not being stable. This commit should reduce the chances of that
happening by increasing the election timeout to 250ms (instead of 50ms).

I couldn't reproduce the bug locally with this change.

If the bug still happens, we can try to force leadership to make it
more deterministic.

Fixes: #132992

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 23, 2024
We sporadically see that some raft node_test tests fail due to the
leader not being stable. This commit should reduce the chances of that
happening by increasing the election timeout to 250ms (instead of 50ms).

I couldn't reproduce the bug locally with this change.

If the bug still happens, we can try to force leadership to make it
more deterministic.

Fixes: #132992

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 23, 2024
We sporadically see that some raft node_test tests fail due to the
leader not being stable. This commit should reduce the chances of that
happening by increasing the election timeout to 250ms (instead of 50ms).

I couldn't reproduce the bug locally with this change.

If the bug still happens, we can try to force leadership to make it
more deterministic.

Fixes: #132992

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants