rac2,kvserver,raft: use raft LogSnapshot directly #132600

pav-kv · 2024-10-14T19:25:39Z

This PR:

refactors raft.MemoryStorage to support log snapshots
converts rac2 tests to use the real LogSnapshot (backed by MemoryStorage) instead of a fake one
converts rac2 code to pass raft.LogSnapshot by value instead of hiding it behind an interface that might allocate

Related to #128779

cockroach-teamcity · 2024-10-14T19:25:59Z

This change is

sumeerbhola

The real implementation computes a full entry.Size() for the proto, vs the fake was accounting only the entry.Data payload size. This difference is not crucial/significant, but maybe we should revisit which one should be used.

I haven't looked at the PR yet, but regarding this, RACv1 also used len(entry.Data). This is because entry.Size() is not super-cheap -- it needs to compute the size of three varints. And I don't expect any write heavy workload will have those sizes be significant compared to len(entry.Data). So we do the same in RACv2, and I don't think we should change that code.
Regarding what Raft does in LogSnapshot.LogSlice, it doesn't need to be the same, since RACv2 already needs to compensate for the fact that Raft does not know about what entries are subject to AC. But if it is trivial to change Raft to use len(entry.Data), perhaps we should.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pav-kv

Yeah, I agree we should just use len(entry.Data) consistently, for the cost & simplicity reasons. It wouldn't be hard to change in raft, though it should probably be done in sync with other log storage things in the stack (the CRDB raft.Storage implementation and the raft log cache). I'll do it separately.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

sumeerbhola

still haven't looked at the code, but I am nervous about the datadriven test changes.

Reviewed 1 of 8 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 235 at r5 (raw file):


# Scheduler event. Only two entries are dequeued from the send-queue since 3+3
# = 6MiB exceeds the 4MiB threshold for sending in one scheduler event.

This seems stale. Also, given we passed 4MB to LogSnapshot.LogSlice and the entry length was ~3MB, I don't understand why we did not get back two entries.

pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 267 at r5 (raw file):

scheduled-replicas: 2

# Another scheduler event. The send-queue is empty and force flushing stops.

I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 235 at r5 (raw file):

Previously, sumeerbhola wrote…

This seems stale. Also, given we passed 4MB to LogSnapshot.LogSlice and the entry length was ~3MB, I don't understand why we did not get back two entries.

The reason we get only one entry is in the semantics of the LogSlice method (which were inherited from how things are in the entire raft log storage stack): we return a prefix of the requested range with size <= maxSize at all times. The only exception when we can return more is when the **first **entry is large.

So, when we read the second entry and spot that adding it to the returned slice would bump the total size above maxSize, we drop it on the floor.

This isn't ideal: I think we should return this entry since we've already paid for fetching it. But it's the status quo.

sumeerbhola · 2024-10-16T19:17:09Z

This isn't ideal: I think we should return this entry since we've already paid for fetching it. But it's the status quo.

Is there an existing issue for changing this, and if not, can you create one?

pav-kv

Filed #132789 and referenced if from code comments. Also, I changed this PR structure a bit: now, the first commit modifies the LogSlice method in rac2 test to mirror the current semantics of LogStorage methods. All the datadriven test changes are in the first commit. All other commits cause no change to the datadriven test files, which demonstrates that this change is a no-op.

I am reviewing the datadriven test changes in the first commit, to make it demonstrate what it wants to demonstrate. Other than than, there should be no blockers to review the rest.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

pav-kv · 2024-10-16T21:39:37Z

@sumeerbhola I factored out the semantics change into a separate PR #132791, please review that one first. Then I'll rebase this PR on top which should be a no-op w.r.t. the datadriven tests.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 267 at r5 (raw file):

Previously, sumeerbhola wrote…

I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.

Done in #132791

sumeerbhola

still reading. pushing some initial questions/comments.

Reviewed 2 of 12 files at r11, 3 of 11 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/raft/storage.go line 199 at r12 (raw file):

	defer ms.Unlock()
	ms.callStats.term++
	if i < ms.ls.prev.index {

should this be i <= ms.ls.prev.index since ms.ls.prev.index is not accessible anymore?

Would the code benefit by defining a ms.ls.firstIndex() method.

pkg/raft/storage.go line 243 at r12 (raw file):

	defer ms.Unlock()

	prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term}

nit: why is this called prev given this is likely going to be the next state?

pkg/raft/storage.go line 245 at r12 (raw file):

	prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term}
	// Check whether the snapshot is outdated.
	if prev.index <= ms.snapshot.Metadata.Index {

Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.

pkg/raft/storage.go line 251 at r12 (raw file):

	ms.snapshot = snap
	// TODO(pav-kv): the term must be the last accepted term passed in.
	term := max(ms.ls.term, prev.term)

Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?

pkg/raft/storage.go line 255 at r12 (raw file):

	return nil
}

(note to self: stopped here)

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

pkg/raft/storage.go line 199 at r12 (raw file):

Previously, sumeerbhola wrote…

should this be i <= ms.ls.prev.index since ms.ls.prev.index is not accessible anymore?

Would the code benefit by defining a ms.ls.firstIndex() method.

This should be i < ms.ls.prev.index as is. For the prev entry, we should be able to return its term for log matching purposes. E.g. if we construct a MsgApp starting at the first log entry, we also need to set MsgApp.LogTerm to the term of the preceding entry.

cockroach/pkg/raft/storage.go

Lines 75 to 79 in 0866efe

 // Term returns the term of the entry at the given index, which must be in the 

 // valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before 

 // FirstIndex is retained for matching purposes even though the rest of that 

 // entry may not be available. 

 Term(index uint64) (uint64, error)

I could reinstate the firstIndex() method, but I didn't find it much useful myself because prev is descriptive enough (the entry before the first), and all the checks can be done against prev directly. This follows the convention instated by the LogSlice struct and many uses of it in this package.

pkg/raft/storage.go line 243 at r12 (raw file):

Previously, sumeerbhola wrote…

nit: why is this called prev given this is likely going to be the next state?

This name inherited the LogSlice terminology (see the LogSlice initialization below). The LogSlice.prev field is the entry preceding the first entry of the slice.

So this is sort of a "next prev", if you will. I renamed it to id to avoid the terminology clash. This ID is is sort of self-explanatory in the code - the entryID at which the snapshot ends.

pkg/raft/storage.go line 245 at r12 (raw file):

Previously, sumeerbhola wrote…

Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.

Added an assertion that the Metadata.Term does not regress.

To help the speed of review: this commit is a no-op. It basically replaces the entries slice (in which entries[0] is fake and denotes the "prev" entry) with a more straightforward LogSlice with the same meaning (LogSlice.prev == entries[0], and LogSlice.entries == entries[1:]).

The today's code does not try to enforce correctness much, neither does this PR. There are checks at other levels: raftLog and unstable struct, and the raft/RawNode struct itself; so when things are written to Storage, they have been verified. We can look at hardening later, hence various TODOs that I've added.

pkg/raft/storage.go line 251 at r12 (raw file):

Previously, sumeerbhola wrote…

Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?

That depends on which term we consider:

If it's the last entry term, it can regress.
If it's the "leader term", it can not.

Here, LogSlice.term is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in this MemoryStorage (though it should in the future). So I just update it to something so that the LogSlice is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.

I could also leave LogSlice.term at zero because it's not used here. But it needs to be set for the rac2 tests that call LogSnapshot.LogSlice(), in one of the next commits.

Made this code simpler for now by setting LogSlice.term to the last entry term.

sumeerbhola

Reviewed 1 of 6 files at r10, 3 of 11 files at r12, 2 of 7 files at r15, 8 of 9 files at r16, 2 of 2 files at r17, 5 of 6 files at r18, 5 of 5 files at r19.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/raft/storage.go line 199 at r12 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

This should be i < ms.ls.prev.index as is. For the prev entry, we should be able to return its term for log matching purposes. E.g. if we construct a MsgApp starting at the first log entry, we also need to set MsgApp.LogTerm to the term of the preceding entry.

cockroach/pkg/raft/storage.go

Lines 75 to 79 in 0866efe

// Term returns the term of the entry at the given index, which must be in the

// valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before

// FirstIndex is retained for matching purposes even though the rest of that

// entry may not be available.

Term(index uint64) (uint64, error)

I could reinstate the firstIndex() method, but I didn't find it much useful myself because prev is descriptive enough (the entry before the first), and all the checks can be done against prev directly. This follows the convention instated by the LogSlice struct and many uses of it in this package.

Thanks for the clarification.

pkg/raft/storage.go line 251 at r12 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

That depends on which term we consider:

If it's the last entry term, it can regress.

If it's the "leader term", it can not.

Here, LogSlice.term is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in this MemoryStorage (though it should in the future). So I just update it to something so that the LogSlice is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.

I could also leave LogSlice.term at zero because it's not used here. But it needs to be set for the rac2 tests that call LogSnapshot.LogSlice(), in one of the next commits.

Made this code simpler for now by setting LogSlice.term to the last entry term.

Ack

pkg/raft/storage.go line 150 at r16 (raw file):

	//
	// TODO(pav-kv): the term field of the logSlice is populated with best effort
	// to keep the logSlice valid, but it must be sourced from the upper layer.

What is the actual todo here? Is it to make it less best effort?

pkg/raft/storage.go line 278 at r16 (raw file):

		ms.snapshot.Metadata.ConfState = *cs
	}
	ms.snapshot.Data = data

This method is not part of raft.Storage so I had to guess based on the code that CreateSnapshot means that this is installing a snapshot at index i.
It's odd that ms.ls is not being modified to truncate the log, to start from i+1. I suppose this is because CreateSnapshot is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?

pkg/raft/storage.go line 293 at r16 (raw file):

		raftlogger.GetLogger().Panicf("compact %d is out of bound lastindex(%d)", index, last)
	}
	ms.ls = ms.ls.forward(index)

Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about raftLog.applied means that a higher layer is guaranteeing that, though I don't see raftLog as a member of MemoryStorage so unsure how to interpret it.

pkg/raft/storage.go line 324 at r16 (raw file):

		prefix := ms.ls.sub(ms.ls.prev.index, first-1)
		// NB: protect the suffix of the old slice from rewrites.
		ms.ls.entries = append(prefix[:len(prefix):len(prefix)], entries...)

Is this code using a cap of len(prefix) to make a shallow copy of LogSlice immutable?
Is that a general property of all LogSlices, and if yes, should it be documented where LogSlice is declared?

pkg/raft/storage.go line 226 at r17 (raw file):

func (ms *MemoryStorage) LogSnapshot() LogStorageSnapshot {
	// Copy the log slice, and protect it from appends. MemoryStorage never
	// overwrites entries in the log slice, so this is sufficient to get an

That partially answers my previous question.

pav-kv

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

pkg/raft/storage.go line 150 at r16 (raw file):

Previously, sumeerbhola wrote…

What is the actual todo here? Is it to make it less best effort?

Yes, make it reflect the actual "last accepted term" / "leader term" of the log. Right now it's conservatively set to the last entry term.

pkg/raft/storage.go line 278 at r16 (raw file):

Previously, sumeerbhola wrote…

This method is not part of raft.Storage so I had to guess based on the code that CreateSnapshot means that this is installing a snapshot at index i.
It's odd that ms.ls is not being modified to truncate the log, to start from i+1. I suppose this is because CreateSnapshot is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?

Yeah, the log is not necessarily in sync with the state machine / snapshot. Also, this method usually comes together with a subsequent Compact.
Right now it's only used in tests (as well as the entire MemoryStorage struct).

pkg/raft/storage.go line 293 at r16 (raw file):

Previously, sumeerbhola wrote…

Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about raftLog.applied means that a higher layer is guaranteeing that, though I don't see raftLog as a member of MemoryStorage so unsure how to interpret it.

Yeah, it's ensured at the higher level: applied >= snapshot.Index. If applied > snapshot.Index, it should be in theory possible to compact at index > snapshot.Index, esp. when we separate the raft log storage from state machine storage. The only requirement is that the compaction is <= the durable applied index.

pkg/raft/storage.go line 324 at r16 (raw file):

Previously, sumeerbhola wrote…

Is this code using a cap of len(prefix) to make a shallow copy of LogSlice immutable?
Is that a general property of all LogSlices, and if yes, should it be documented where LogSlice is declared?

Yes, LogSlice is immutable. The only allowed operation is appending to it. Will pop a note about this in LogSlicedefinition.

This capping is a self-defence mechanism of MemoryStorage. If the caller appends to this slice (and it may happen inside the LogSnapshot.LogSlice() call when we stitch together the storage part of the slice with the in-memory unstable slice), the original slice in MemoryStorage does not get corrupted.

pkg/raft/storage.go line 226 at r17 (raw file):

Previously, sumeerbhola wrote…

That partially answers my previous question.

I'll clarify the comment: it should say that there is both self-defence from the MemoryStorage (because the caller can append to the slice), but the caller is also defended from appends that MemoryStorage does to this slice.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

Epic: none Release note: none

pav-kv · 2024-10-18T18:05:16Z

Improved the commenting a bit in response to the review comments.

pav-kv · 2024-10-18T18:06:34Z

bors r=sumeerbhola

craig · 2024-10-18T19:02:45Z

Build succeeded:

pav-kv force-pushed the raft-mem-storage branch 2 times, most recently from c358f0d to ede46c7 Compare October 14, 2024 19:33

pav-kv requested review from sumeerbhola and kvoli October 14, 2024 19:35

pav-kv marked this pull request as ready for review October 14, 2024 19:35

pav-kv requested review from a team as code owners October 14, 2024 19:35

sumeerbhola reviewed Oct 14, 2024

View reviewed changes

pav-kv commented Oct 14, 2024

View reviewed changes

sumeerbhola requested changes Oct 16, 2024

View reviewed changes

pav-kv commented Oct 16, 2024

View reviewed changes

pav-kv mentioned this pull request Oct 16, 2024

rac2: fix send_stream_stats test #132779

Merged

pav-kv force-pushed the raft-mem-storage branch 2 times, most recently from 92bb058 to b600c1b Compare October 16, 2024 20:44

pav-kv commented Oct 16, 2024

View reviewed changes

pav-kv force-pushed the raft-mem-storage branch from b600c1b to 1ab91eb Compare October 16, 2024 21:44

pav-kv commented Oct 16, 2024

View reviewed changes

pav-kv force-pushed the raft-mem-storage branch from 1ab91eb to 172d1af Compare October 17, 2024 17:39

pav-kv requested a review from sumeerbhola October 17, 2024 17:41

sumeerbhola requested changes Oct 17, 2024

View reviewed changes

pav-kv force-pushed the raft-mem-storage branch from 172d1af to 1b7304b Compare October 18, 2024 11:49

pav-kv commented Oct 18, 2024

View reviewed changes

sumeerbhola approved these changes Oct 18, 2024

View reviewed changes

pav-kv commented Oct 18, 2024

View reviewed changes

sumeerbhola approved these changes Oct 18, 2024

View reviewed changes

pav-kv added 2 commits October 18, 2024 19:01

raft: add a note about LogSlice immutability

8949022

Epic: none Release note: none

raft: make MemoryStorage a LogSlice

c1380e8

Epic: none Release note: none

pav-kv added 3 commits October 18, 2024 19:04

raft: support log snapshots in MemoryStorage

727022b

Epic: none Release note: none

rac2: use the real LogSnapshot in tests

920f003

Epic: none Release note: none

rac2,kvserver: use raft LogSnapshot directly

3a2b50a

Epic: none Release note: none

pav-kv force-pushed the raft-mem-storage branch from 1b7304b to 3a2b50a Compare October 18, 2024 18:04

craig bot merged commit c874696 into cockroachdb:master Oct 18, 2024
22 of 23 checks passed

pav-kv deleted the raft-mem-storage branch October 18, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rac2,kvserver,raft: use raft LogSnapshot directly #132600

rac2,kvserver,raft: use raft LogSnapshot directly #132600

pav-kv commented Oct 14, 2024 •

edited

Loading

cockroach-teamcity commented Oct 14, 2024

sumeerbhola left a comment

pav-kv left a comment

sumeerbhola left a comment

pav-kv left a comment

sumeerbhola commented Oct 16, 2024

pav-kv left a comment

pav-kv commented Oct 16, 2024

pav-kv left a comment

sumeerbhola left a comment

pav-kv left a comment

sumeerbhola left a comment

pav-kv left a comment

sumeerbhola left a comment

pav-kv commented Oct 18, 2024

pav-kv commented Oct 18, 2024

craig bot commented Oct 18, 2024

	// Term returns the term of the entry at the given index, which must be in the
	// valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before
	// FirstIndex is retained for matching purposes even though the rest of that
	// entry may not be available.
	Term(index uint64) (uint64, error)

rac2,kvserver,raft: use raft LogSnapshot directly #132600

rac2,kvserver,raft: use raft LogSnapshot directly #132600

Conversation

pav-kv commented Oct 14, 2024 • edited Loading

cockroach-teamcity commented Oct 14, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola commented Oct 16, 2024

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv commented Oct 16, 2024

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv commented Oct 18, 2024

pav-kv commented Oct 18, 2024

craig bot commented Oct 18, 2024

pav-kv commented Oct 14, 2024 •

edited

Loading