-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rac2,kvserver,raft: use raft LogSnapshot directly #132600
Conversation
c358f0d
to
ede46c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real implementation computes a full
entry.Size()
for the proto, vs the fake was accounting only theentry.Data
payload size. This difference is not crucial/significant, but maybe we should revisit which one should be used.
I haven't looked at the PR yet, but regarding this, RACv1 also used len(entry.Data)
. This is because entry.Size()
is not super-cheap -- it needs to compute the size of three varints. And I don't expect any write heavy workload will have those sizes be significant compared to len(entry.Data)
. So we do the same in RACv2, and I don't think we should change that code.
Regarding what Raft does in LogSnapshot.LogSlice
, it doesn't need to be the same, since RACv2 already needs to compensate for the fact that Raft does not know about what entries are subject to AC. But if it is trivial to change Raft to use len(entry.Data)
, perhaps we should.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree we should just use len(entry.Data)
consistently, for the cost & simplicity reasons. It wouldn't be hard to change in raft
, though it should probably be done in sync with other log storage things in the stack (the CRDB raft.Storage
implementation and the raft log cache). I'll do it separately.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still haven't looked at the code, but I am nervous about the datadriven test changes.
Reviewed 1 of 8 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush
line 235 at r5 (raw file):
# Scheduler event. Only two entries are dequeued from the send-queue since 3+3 # = 6MiB exceeds the 4MiB threshold for sending in one scheduler event.
This seems stale. Also, given we passed 4MB to LogSnapshot.LogSlice
and the entry length was ~3MB, I don't understand why we did not get back two entries.
pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush
line 267 at r5 (raw file):
scheduled-replicas: 2 # Another scheduler event. The send-queue is empty and force flushing stops.
I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush
line 235 at r5 (raw file):
Previously, sumeerbhola wrote…
This seems stale. Also, given we passed 4MB to
LogSnapshot.LogSlice
and the entry length was ~3MB, I don't understand why we did not get back two entries.
The reason we get only one entry is in the semantics of the LogSlice
method (which were inherited from how things are in the entire raft
log storage stack): we return a prefix of the requested range with size <= maxSize
at all times. The only exception when we can return more is when the **first **entry is large.
So, when we read the second entry and spot that adding it to the returned slice would bump the total size above maxSize
, we drop it on the floor.
This isn't ideal: I think we should return this entry since we've already paid for fetching it. But it's the status quo.
Is there an existing issue for changing this, and if not, can you create one? |
92bb058
to
b600c1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #132789 and referenced if from code comments. Also, I changed this PR structure a bit: now, the first commit modifies the LogSlice
method in rac2
test to mirror the current semantics of LogStorage
methods. All the datadriven test changes are in the first commit. All other commits cause no change to the datadriven test files, which demonstrates that this change is a no-op.
I am reviewing the datadriven test changes in the first commit, to make it demonstrate what it wants to demonstrate. Other than than, there should be no blockers to review the rest.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
@sumeerbhola I factored out the semantics change into a separate PR #132791, please review that one first. Then I'll rebase this PR on top which should be a no-op w.r.t. the datadriven tests. |
b600c1b
to
1ab91eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush
line 267 at r5 (raw file):
Previously, sumeerbhola wrote…
I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.
Done in #132791
1ab91eb
to
172d1af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reading. pushing some initial questions/comments.
Reviewed 2 of 12 files at r11, 3 of 11 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/raft/storage.go
line 199 at r12 (raw file):
defer ms.Unlock() ms.callStats.term++ if i < ms.ls.prev.index {
should this be i <= ms.ls.prev.index
since ms.ls.prev.index
is not accessible anymore?
Would the code benefit by defining a ms.ls.firstIndex() method.
pkg/raft/storage.go
line 243 at r12 (raw file):
defer ms.Unlock() prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term}
nit: why is this called prev
given this is likely going to be the next state?
pkg/raft/storage.go
line 245 at r12 (raw file):
prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term} // Check whether the snapshot is outdated. if prev.index <= ms.snapshot.Metadata.Index {
Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.
pkg/raft/storage.go
line 251 at r12 (raw file):
ms.snapshot = snap // TODO(pav-kv): the term must be the last accepted term passed in. term := max(ms.ls.term, prev.term)
Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?
pkg/raft/storage.go
line 255 at r12 (raw file):
return nil }
(note to self: stopped here)
172d1af
to
1b7304b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
pkg/raft/storage.go
line 199 at r12 (raw file):
Previously, sumeerbhola wrote…
should this be
i <= ms.ls.prev.index
sincems.ls.prev.index
is not accessible anymore?Would the code benefit by defining a ms.ls.firstIndex() method.
This should be i < ms.ls.prev.index
as is. For the prev
entry, we should be able to return its term for log matching purposes. E.g. if we construct a MsgApp
starting at the first log entry, we also need to set MsgApp.LogTerm
to the term of the preceding entry.
Lines 75 to 79 in 0866efe
// Term returns the term of the entry at the given index, which must be in the | |
// valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before | |
// FirstIndex is retained for matching purposes even though the rest of that | |
// entry may not be available. | |
Term(index uint64) (uint64, error) |
I could reinstate the firstIndex()
method, but I didn't find it much useful myself because prev
is descriptive enough (the entry before the first), and all the checks can be done against prev
directly. This follows the convention instated by the LogSlice
struct and many uses of it in this package.
pkg/raft/storage.go
line 243 at r12 (raw file):
Previously, sumeerbhola wrote…
nit: why is this called
prev
given this is likely going to be the next state?
This name inherited the LogSlice
terminology (see the LogSlice
initialization below). The LogSlice.prev
field is the entry preceding the first entry of the slice.
So this is sort of a "next prev", if you will. I renamed it to id
to avoid the terminology clash. This ID is is sort of self-explanatory in the code - the entryID
at which the snapshot ends.
pkg/raft/storage.go
line 245 at r12 (raw file):
Previously, sumeerbhola wrote…
Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.
Added an assertion that the Metadata.Term
does not regress.
To help the speed of review: this commit is a no-op. It basically replaces the entries
slice (in which entries[0]
is fake and denotes the "prev" entry) with a more straightforward LogSlice
with the same meaning (LogSlice.prev == entries[0]
, and LogSlice.entries == entries[1:]
).
The today's code does not try to enforce correctness much, neither does this PR. There are checks at other levels: raftLog
and unstable
struct, and the raft/RawNode
struct itself; so when things are written to Storage
, they have been verified. We can look at hardening later, hence various TODOs that I've added.
pkg/raft/storage.go
line 251 at r12 (raw file):
Previously, sumeerbhola wrote…
Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?
That depends on which term we consider:
- If it's the last entry term, it can regress.
- If it's the "leader term", it can not.
Here, LogSlice.term
is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in this MemoryStorage
(though it should in the future). So I just update it to something so that the LogSlice
is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.
I could also leave LogSlice.term
at zero because it's not used here. But it needs to be set for the rac2
tests that call LogSnapshot.LogSlice()
, in one of the next commits.
Made this code simpler for now by setting LogSlice.term
to the last entry term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r10, 3 of 11 files at r12, 2 of 7 files at r15, 8 of 9 files at r16, 2 of 2 files at r17, 5 of 6 files at r18, 5 of 5 files at r19.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/raft/storage.go
line 199 at r12 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
This should be
i < ms.ls.prev.index
as is. For theprev
entry, we should be able to return its term for log matching purposes. E.g. if we construct aMsgApp
starting at the first log entry, we also need to setMsgApp.LogTerm
to the term of the preceding entry.Lines 75 to 79 in 0866efe
// Term returns the term of the entry at the given index, which must be in the // valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before // FirstIndex is retained for matching purposes even though the rest of that // entry may not be available. Term(index uint64) (uint64, error) I could reinstate the
firstIndex()
method, but I didn't find it much useful myself becauseprev
is descriptive enough (the entry before the first), and all the checks can be done againstprev
directly. This follows the convention instated by theLogSlice
struct and many uses of it in this package.
Thanks for the clarification.
pkg/raft/storage.go
line 251 at r12 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
That depends on which term we consider:
- If it's the last entry term, it can regress.
- If it's the "leader term", it can not.
Here,
LogSlice.term
is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in thisMemoryStorage
(though it should in the future). So I just update it to something so that theLogSlice
is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.I could also leave
LogSlice.term
at zero because it's not used here. But it needs to be set for therac2
tests that callLogSnapshot.LogSlice()
, in one of the next commits.Made this code simpler for now by setting
LogSlice.term
to the last entry term.
Ack
pkg/raft/storage.go
line 150 at r16 (raw file):
// // TODO(pav-kv): the term field of the logSlice is populated with best effort // to keep the logSlice valid, but it must be sourced from the upper layer.
What is the actual todo here? Is it to make it less best effort?
pkg/raft/storage.go
line 278 at r16 (raw file):
ms.snapshot.Metadata.ConfState = *cs } ms.snapshot.Data = data
This method is not part of raft.Storage
so I had to guess based on the code that CreateSnapshot
means that this is installing a snapshot at index i.
It's odd that ms.ls
is not being modified to truncate the log, to start from i+1
. I suppose this is because CreateSnapshot
is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?
pkg/raft/storage.go
line 293 at r16 (raw file):
raftlogger.GetLogger().Panicf("compact %d is out of bound lastindex(%d)", index, last) } ms.ls = ms.ls.forward(index)
Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about raftLog.applied
means that a higher layer is guaranteeing that, though I don't see raftLog
as a member of MemoryStorage
so unsure how to interpret it.
pkg/raft/storage.go
line 324 at r16 (raw file):
prefix := ms.ls.sub(ms.ls.prev.index, first-1) // NB: protect the suffix of the old slice from rewrites. ms.ls.entries = append(prefix[:len(prefix):len(prefix)], entries...)
Is this code using a cap of len(prefix)
to make a shallow copy of LogSlice
immutable?
Is that a general property of all LogSlice
s, and if yes, should it be documented where LogSlice
is declared?
pkg/raft/storage.go
line 226 at r17 (raw file):
func (ms *MemoryStorage) LogSnapshot() LogStorageSnapshot { // Copy the log slice, and protect it from appends. MemoryStorage never // overwrites entries in the log slice, so this is sufficient to get an
That partially answers my previous question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
pkg/raft/storage.go
line 150 at r16 (raw file):
Previously, sumeerbhola wrote…
What is the actual todo here? Is it to make it less best effort?
Yes, make it reflect the actual "last accepted term" / "leader term" of the log. Right now it's conservatively set to the last entry term.
pkg/raft/storage.go
line 278 at r16 (raw file):
Previously, sumeerbhola wrote…
This method is not part of
raft.Storage
so I had to guess based on the code thatCreateSnapshot
means that this is installing a snapshot at index i.
It's odd thatms.ls
is not being modified to truncate the log, to start fromi+1
. I suppose this is becauseCreateSnapshot
is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?
Yeah, the log is not necessarily in sync with the state machine / snapshot. Also, this method usually comes together with a subsequent Compact
.
Right now it's only used in tests (as well as the entire MemoryStorage
struct).
pkg/raft/storage.go
line 293 at r16 (raw file):
Previously, sumeerbhola wrote…
Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about
raftLog.applied
means that a higher layer is guaranteeing that, though I don't seeraftLog
as a member ofMemoryStorage
so unsure how to interpret it.
Yeah, it's ensured at the higher level: applied >= snapshot.Index
. If applied > snapshot.Index
, it should be in theory possible to compact at index > snapshot.Index
, esp. when we separate the raft log storage from state machine storage. The only requirement is that the compaction is <= the durable applied
index.
pkg/raft/storage.go
line 324 at r16 (raw file):
Previously, sumeerbhola wrote…
Is this code using a cap of
len(prefix)
to make a shallow copy ofLogSlice
immutable?
Is that a general property of allLogSlice
s, and if yes, should it be documented whereLogSlice
is declared?
Yes, LogSlice
is immutable. The only allowed operation is appending to it. Will pop a note about this in LogSlice
definition.
This capping is a self-defence mechanism of MemoryStorage
. If the caller appends to this slice (and it may happen inside the LogSnapshot.LogSlice()
call when we stitch together the storage part of the slice with the in-memory unstable
slice), the original slice in MemoryStorage
does not get corrupted.
pkg/raft/storage.go
line 226 at r17 (raw file):
Previously, sumeerbhola wrote…
That partially answers my previous question.
I'll clarify the comment: it should say that there is both self-defence from the MemoryStorage
(because the caller can append to the slice), but the caller is also defended from appends that MemoryStorage
does to this slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
1b7304b
to
3a2b50a
Compare
Improved the commenting a bit in response to the review comments. |
bors r=sumeerbhola |
This PR:
raft.MemoryStorage
to support log snapshotsrac2
tests to use the realLogSnapshot
(backed byMemoryStorage
) instead of a fake onerac2
code to passraft.LogSnapshot
by value instead of hiding it behind an interface that might allocateRelated to #128779