You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Isolated sub set of a cluster(e.g., {A, B}) keeps increasing term and retrying election.
When communication is restored, one of the isolated node(e.g., A) will take the leadership
from the running sub cluster(e.g., {C, D, E}), with a higher term.
vote
A ---------> B
|
| unreachable
---------------------- isolation
|
+------------.
v v
vote
C ---------> D E
| ^
` ------------------------'
vote
Thus in an unstable network environment, the leader changes very frequently.
Raft solution: pre-vote
Raft solves this problem by introducing a pre-vote RPC:
A candidate tries to communicate with a quorum of replicas to see if there is a living
leader, without modifying the state of other replicas, before voting itself.
Our solution: without pre-vote.
Our solution would be simpler:
Just do not increase the term of a candidate unless it has to.
Without pre-vote, handle_vote_request() behaves as below:
granting a vote
According to raft spec, a node reject a vote if any one of the following
conditions is met:
reject if req.term < local.term.
reject if req.term == local.term && req.leader != local.leader
i.e., a node has voted for other leader.
reject if req.effective_leader != local.effective_leader
i.e., there is still a valid leader that has not yet timed out:
in other word, an heartbeat has received in the past interval.
reject if rea.log_id < local.log_id.
These conditions implies an partially ordered data structure:
To simplify these conditions
We defines an order for effective_leader and leader
Partial order Leader
The field leader and effective_leader can be considered as a set, define:
type Leader = BtreeSet<NodeId> // It is empty or has exactly one element
The order of two Leader is defined by set containment: a > b ↔ a ⊃ b
Partial order Vote
If we capsulate fields into a struct Vote, it is obvious that it is also partial order and the order is a vector-order.
typeVote = {term_and_leader:(u64,Leader),// `effective_leader` becomes {} if a node timed out receiving a heartbeat.effective_leader:Leader,log_id:LogId,}
let a: Vote
let b: Vote
a >= b iff (
(a.term, a.leader) >= (b.term, b.leader)
&& a.effective_leader >= b.effective_leader
&& a.log_id >= b.log_id)
∴ A node grants a vote iff request.vote >= local.vote and we compare Vote in a vector order,
i.e., given X = [x₁, x₂ ... ] and Y = [y₁, y₂ ... ], X >= Y ↔ ∀i xᵢ >= yᵢ
Pseudo code of voting
fncandidate_loop(){let vote = build_vote();send_vote_requests(vote);letmut next_term = 0;letmut granted = 1;// I vote for myself of course.letmut received = 1;// I respond to myself of course.looplet remote_vote = select!{
x = recv() => x,
_ = timeout() => break,
}
recevied += 1;if vote >= remote_vote {
granted += 1;}else{let v = vote.clone();
v.term = remote_vote.term + 1;if v >= remote_vote {// incr only when necessary
next_term = max(next_term, v.term);}}}if granted >= n/2+1{// become leaderreturn;}if received >= n/2+1 && next_term > 0{self.term = next_term;}}fnhandle_vote_request(vote:Vote){if vote >= local.vote{save(vote)}send_response(local.vote)}
Example
With the above isolated network: {A, B} | {C, D, E}, while C is the leader,
what will happen is described as below:
The log_id is ignored in comparison; [] is a vector, compare in vector order; () is a tuple, compare in dictionary order
A becomes a Candidate, start election with term=10.
At that time, A did not increase term yet.
A send vote request to B, while B.effective_leader is not timed out, A.vote = [(10, {A}), {A}] is not greater than B.vote = [(10, {C}), {C}].
A.vote > B.vote does not hold, thus A.vote is rejected.
A will revert to a Follower.
When B.effective_leader timed out, B.effective_leader becomes an empty set {}.
In the next round of voting, A will find that A.vote = [(11, {A}), {A}] is greater than B.vote = [(10, {C}), {}].
But A did not receive enough responses(at least 2), A does not know if there is another valid leader or not.
Thus A does not increase its term.
When the network isolation is restored,
A send A.vote = [(10, {A}), {A}] to C, C will reject A.vote if is still the leader, i.e., C.effective_leader = {C}.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Problem
Isolated sub set of a cluster(e.g.,
{A, B}
) keeps increasing term and retrying election.When communication is restored, one of the isolated node(e.g., A) will take the leadership
from the running sub cluster(e.g.,
{C, D, E}
), with a higher term.Thus in an unstable network environment, the leader changes very frequently.
Raft solution: pre-vote
Raft solves this problem by introducing a
pre-vote
RPC:A candidate tries to communicate with a quorum of replicas to see if there is a living
leader, without modifying the state of other replicas, before voting itself.
Our solution: without pre-vote.
Our solution would be simpler:
Just do not increase the term of a candidate unless it has to.
Without pre-vote,
handle_vote_request()
behaves as below:granting a vote
According to raft spec, a node reject a vote if any one of the following
conditions is met:
reject if
req.term < local.term
.reject if
req.term == local.term && req.leader != local.leader
i.e., a node has voted for other leader.
reject if
req.effective_leader != local.effective_leader
i.e., there is still a valid leader that has not yet timed out:
in other word, an heartbeat has received in the past interval.
reject if
rea.log_id < local.log_id
.These conditions implies an partially ordered data structure:
To simplify these conditions
We defines an
order
foreffective_leader
andleader
Partial order
Leader
The field
leader
andeffective_leader
can be considered as aset
, define:The
order
of twoLeader
is defined by set containment:a > b ↔ a ⊃ b
Partial order
Vote
If we capsulate fields into a struct
Vote
, it is obvious that it is also partial order and the order is avector-order
.∴ A node grants a vote iff
request.vote >= local.vote
and we compareVote
in a vector order,i.e., given X = [x₁, x₂ ... ] and Y = [y₁, y₂ ... ],
X >= Y ↔ ∀i xᵢ >= yᵢ
Pseudo code of voting
Example
With the above isolated network:
{A, B} | {C, D, E}
, whileC
is the leader,what will happen is described as below:
A
becomes aCandidate
, start election with term=10.At that time,
A
did not increase term yet.A
send vote request toB
, whileB.effective_leader
is not timed out,A.vote = [(10, {A}), {A}]
is not greater thanB.vote = [(10, {C}), {C}]
.A.vote > B.vote
does not hold, thusA.vote
is rejected.A
will revert to aFollower
.When
B.effective_leader
timed out,B.effective_leader
becomes an empty set{}
.In the next round of voting,
A
will find thatA.vote = [(11, {A}), {A}]
is greater thanB.vote = [(10, {C}), {}]
.But
A
did not receive enough responses(at least 2),A
does not know if there is another valid leader or not.Thus
A
does not increase its term.When the network isolation is restored,
A send
A.vote = [(10, {A}), {A}]
toC
,C
will rejectA.vote
if is still the leader, i.e.,C.effective_leader = {C}
.A
will revert to aFollower
.Beta Was this translation helpful? Give feedback.
All reactions