Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2,kvserver,raft: use raft LogSnapshot directly #132600

Merged
merged 5 commits into from
Oct 18, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Oct 14, 2024

This PR:

  • refactors raft.MemoryStorage to support log snapshots
  • converts rac2 tests to use the real LogSnapshot (backed by MemoryStorage) instead of a fake one
  • converts rac2 code to pass raft.LogSnapshot by value instead of hiding it behind an interface that might allocate

Related to #128779

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv force-pushed the raft-mem-storage branch 2 times, most recently from c358f0d to ede46c7 Compare October 14, 2024 19:33
@pav-kv pav-kv marked this pull request as ready for review October 14, 2024 19:35
@pav-kv pav-kv requested review from a team as code owners October 14, 2024 19:35
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real implementation computes a full entry.Size() for the proto, vs the fake was accounting only the entry.Data payload size. This difference is not crucial/significant, but maybe we should revisit which one should be used.

I haven't looked at the PR yet, but regarding this, RACv1 also used len(entry.Data). This is because entry.Size() is not super-cheap -- it needs to compute the size of three varints. And I don't expect any write heavy workload will have those sizes be significant compared to len(entry.Data). So we do the same in RACv2, and I don't think we should change that code.
Regarding what Raft does in LogSnapshot.LogSlice, it doesn't need to be the same, since RACv2 already needs to compensate for the fact that Raft does not know about what entries are subject to AC. But if it is trivial to change Raft to use len(entry.Data), perhaps we should.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we should just use len(entry.Data) consistently, for the cost & simplicity reasons. It wouldn't be hard to change in raft, though it should probably be done in sync with other log storage things in the stack (the CRDB raft.Storage implementation and the raft log cache). I'll do it separately.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still haven't looked at the code, but I am nervous about the datadriven test changes.

Reviewed 1 of 8 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 235 at r5 (raw file):


# Scheduler event. Only two entries are dequeued from the send-queue since 3+3
# = 6MiB exceeds the 4MiB threshold for sending in one scheduler event.

This seems stale. Also, given we passed 4MB to LogSnapshot.LogSlice and the entry length was ~3MB, I don't understand why we did not get back two entries.


pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 267 at r5 (raw file):

scheduled-replicas: 2

# Another scheduler event. The send-queue is empty and force flushing stops.

I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 235 at r5 (raw file):

Previously, sumeerbhola wrote…

This seems stale. Also, given we passed 4MB to LogSnapshot.LogSlice and the entry length was ~3MB, I don't understand why we did not get back two entries.

The reason we get only one entry is in the semantics of the LogSlice method (which were inherited from how things are in the entire raft log storage stack): we return a prefix of the requested range with size <= maxSize at all times. The only exception when we can return more is when the **first **entry is large.

So, when we read the second entry and spot that adding it to the returned slice would bump the total size above maxSize, we drop it on the floor.

This isn't ideal: I think we should return this entry since we've already paid for fetching it. But it's the status quo.

@sumeerbhola
Copy link
Collaborator

This isn't ideal: I think we should return this entry since we've already paid for fetching it. But it's the status quo.

Is there an existing issue for changing this, and if not, can you create one?

@pav-kv pav-kv force-pushed the raft-mem-storage branch 2 times, most recently from 92bb058 to b600c1b Compare October 16, 2024 20:44
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #132789 and referenced if from code comments. Also, I changed this PR structure a bit: now, the first commit modifies the LogSlice method in rac2 test to mirror the current semantics of LogStorage methods. All the datadriven test changes are in the first commit. All other commits cause no change to the datadriven test files, which demonstrates that this change is a no-op.

I am reviewing the datadriven test changes in the first commit, to make it demonstrate what it wants to demonstrate. Other than than, there should be no blockers to review the rest.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 16, 2024

@sumeerbhola I factored out the semantics change into a separate PR #132791, please review that one first. Then I'll rebase this PR on top which should be a no-op w.r.t. the datadriven tests.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/testdata/range_controller/do_force_flush line 267 at r5 (raw file):

Previously, sumeerbhola wrote…

I think we need to adjust the test so it does demonstrate the state transitions it was trying to show before.
Same comment applies to send_q_watcher.

Done in #132791

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reading. pushing some initial questions/comments.

Reviewed 2 of 12 files at r11, 3 of 11 files at r12.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/raft/storage.go line 199 at r12 (raw file):

	defer ms.Unlock()
	ms.callStats.term++
	if i < ms.ls.prev.index {

should this be i <= ms.ls.prev.index since ms.ls.prev.index is not accessible anymore?

Would the code benefit by defining a ms.ls.firstIndex() method.


pkg/raft/storage.go line 243 at r12 (raw file):

	defer ms.Unlock()

	prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term}

nit: why is this called prev given this is likely going to be the next state?


pkg/raft/storage.go line 245 at r12 (raw file):

	prev := entryID{index: snap.Metadata.Index, term: snap.Metadata.Term}
	// Check whether the snapshot is outdated.
	if prev.index <= ms.snapshot.Metadata.Index {

Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.


pkg/raft/storage.go line 251 at r12 (raw file):

	ms.snapshot = snap
	// TODO(pav-kv): the term must be the last accepted term passed in.
	term := max(ms.ls.term, prev.term)

Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?


pkg/raft/storage.go line 255 at r12 (raw file):

	return nil
}

(note to self: stopped here)

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)


pkg/raft/storage.go line 199 at r12 (raw file):

Previously, sumeerbhola wrote…

should this be i <= ms.ls.prev.index since ms.ls.prev.index is not accessible anymore?

Would the code benefit by defining a ms.ls.firstIndex() method.

This should be i < ms.ls.prev.index as is. For the prev entry, we should be able to return its term for log matching purposes. E.g. if we construct a MsgApp starting at the first log entry, we also need to set MsgApp.LogTerm to the term of the preceding entry.

// Term returns the term of the entry at the given index, which must be in the
// valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before
// FirstIndex is retained for matching purposes even though the rest of that
// entry may not be available.
Term(index uint64) (uint64, error)

I could reinstate the firstIndex() method, but I didn't find it much useful myself because prev is descriptive enough (the entry before the first), and all the checks can be done against prev directly. This follows the convention instated by the LogSlice struct and many uses of it in this package.


pkg/raft/storage.go line 243 at r12 (raw file):

Previously, sumeerbhola wrote…

nit: why is this called prev given this is likely going to be the next state?

This name inherited the LogSlice terminology (see the LogSlice initialization below). The LogSlice.prev field is the entry preceding the first entry of the slice.

So this is sort of a "next prev", if you will. I renamed it to id to avoid the terminology clash. This ID is is sort of self-explanatory in the code - the entryID at which the snapshot ends.


pkg/raft/storage.go line 245 at r12 (raw file):

Previously, sumeerbhola wrote…

Do we not need to compare the terms because snapshots always represent committed data (since they represent the state machine)? Should we have an assertion that the term is not regressing.

Added an assertion that the Metadata.Term does not regress.


To help the speed of review: this commit is a no-op. It basically replaces the entries slice (in which entries[0] is fake and denotes the "prev" entry) with a more straightforward LogSlice with the same meaning (LogSlice.prev == entries[0], and LogSlice.entries == entries[1:]).

The today's code does not try to enforce correctness much, neither does this PR. There are checks at other levels: raftLog and unstable struct, and the raft/RawNode struct itself; so when things are written to Storage, they have been verified. We can look at hardening later, hence various TODOs that I've added.


pkg/raft/storage.go line 251 at r12 (raw file):

Previously, sumeerbhola wrote…

Can't the term regress in the LogSlice since it could represent uncommitted data? That is, the snapshot's term should always be used?

That depends on which term we consider:

  • If it's the last entry term, it can regress.
  • If it's the "leader term", it can not.

Here, LogSlice.term is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in this MemoryStorage (though it should in the future). So I just update it to something so that the LogSlice is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.

I could also leave LogSlice.term at zero because it's not used here. But it needs to be set for the rac2 tests that call LogSnapshot.LogSlice(), in one of the next commits.

Made this code simpler for now by setting LogSlice.term to the last entry term.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 6 files at r10, 3 of 11 files at r12, 2 of 7 files at r15, 8 of 9 files at r16, 2 of 2 files at r17, 5 of 6 files at r18, 5 of 5 files at r19.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/raft/storage.go line 199 at r12 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

This should be i < ms.ls.prev.index as is. For the prev entry, we should be able to return its term for log matching purposes. E.g. if we construct a MsgApp starting at the first log entry, we also need to set MsgApp.LogTerm to the term of the preceding entry.

// Term returns the term of the entry at the given index, which must be in the
// valid range: [FirstIndex()-1, LastIndex()]. The term of the entry before
// FirstIndex is retained for matching purposes even though the rest of that
// entry may not be available.
Term(index uint64) (uint64, error)

I could reinstate the firstIndex() method, but I didn't find it much useful myself because prev is descriptive enough (the entry before the first), and all the checks can be done against prev directly. This follows the convention instated by the LogSlice struct and many uses of it in this package.

Thanks for the clarification.


pkg/raft/storage.go line 251 at r12 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

That depends on which term we consider:

  • If it's the last entry term, it can regress.
  • If it's the "leader term", it can not.

Here, LogSlice.term is supposed to be the leader term that can not regress. But it's not passed in from the caller yet, and also not used for anything internally in this MemoryStorage (though it should in the future). So I just update it to something so that the LogSlice is self-consistent. The TODO captures the need to update the leader term here to the one known to the upper layer, instead of making it up.

I could also leave LogSlice.term at zero because it's not used here. But it needs to be set for the rac2 tests that call LogSnapshot.LogSlice(), in one of the next commits.

Made this code simpler for now by setting LogSlice.term to the last entry term.

Ack


pkg/raft/storage.go line 150 at r16 (raw file):

	//
	// TODO(pav-kv): the term field of the logSlice is populated with best effort
	// to keep the logSlice valid, but it must be sourced from the upper layer.

What is the actual todo here? Is it to make it less best effort?


pkg/raft/storage.go line 278 at r16 (raw file):

		ms.snapshot.Metadata.ConfState = *cs
	}
	ms.snapshot.Data = data

This method is not part of raft.Storage so I had to guess based on the code that CreateSnapshot means that this is installing a snapshot at index i.
It's odd that ms.ls is not being modified to truncate the log, to start from i+1. I suppose this is because CreateSnapshot is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?


pkg/raft/storage.go line 293 at r16 (raw file):

		raftlogger.GetLogger().Panicf("compact %d is out of bound lastindex(%d)", index, last)
	}
	ms.ls = ms.ls.forward(index)

Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about raftLog.applied means that a higher layer is guaranteeing that, though I don't see raftLog as a member of MemoryStorage so unsure how to interpret it.


pkg/raft/storage.go line 324 at r16 (raw file):

		prefix := ms.ls.sub(ms.ls.prev.index, first-1)
		// NB: protect the suffix of the old slice from rewrites.
		ms.ls.entries = append(prefix[:len(prefix):len(prefix)], entries...)

Is this code using a cap of len(prefix) to make a shallow copy of LogSlice immutable?
Is that a general property of all LogSlices, and if yes, should it be documented where LogSlice is declared?


pkg/raft/storage.go line 226 at r17 (raw file):

func (ms *MemoryStorage) LogSnapshot() LogStorageSnapshot {
	// Copy the log slice, and protect it from appends. MemoryStorage never
	// overwrites entries in the log slice, so this is sufficient to get an

That partially answers my previous question.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)


pkg/raft/storage.go line 150 at r16 (raw file):

Previously, sumeerbhola wrote…

What is the actual todo here? Is it to make it less best effort?

Yes, make it reflect the actual "last accepted term" / "leader term" of the log. Right now it's conservatively set to the last entry term.


pkg/raft/storage.go line 278 at r16 (raw file):

Previously, sumeerbhola wrote…

This method is not part of raft.Storage so I had to guess based on the code that CreateSnapshot means that this is installing a snapshot at index i.
It's odd that ms.ls is not being modified to truncate the log, to start from i+1. I suppose this is because CreateSnapshot is being used for installing new versions of the state machine, and will happen with each state machine application, and has nothing to do with truncation/compaction. Is my understanding correct?

Yeah, the log is not necessarily in sync with the state machine / snapshot. Also, this method usually comes together with a subsequent Compact.
Right now it's only used in tests (as well as the entire MemoryStorage struct).


pkg/raft/storage.go line 293 at r16 (raw file):

Previously, sumeerbhola wrote…

Shouldn't this also check that there is a snapshot at index >= index, i.e., we are not creating a gap between the snapshot and the LogSlice? I suppose the method comment about raftLog.applied means that a higher layer is guaranteeing that, though I don't see raftLog as a member of MemoryStorage so unsure how to interpret it.

Yeah, it's ensured at the higher level: applied >= snapshot.Index. If applied > snapshot.Index, it should be in theory possible to compact at index > snapshot.Index, esp. when we separate the raft log storage from state machine storage. The only requirement is that the compaction is <= the durable applied index.


pkg/raft/storage.go line 324 at r16 (raw file):

Previously, sumeerbhola wrote…

Is this code using a cap of len(prefix) to make a shallow copy of LogSlice immutable?
Is that a general property of all LogSlices, and if yes, should it be documented where LogSlice is declared?

Yes, LogSlice is immutable. The only allowed operation is appending to it. Will pop a note about this in LogSlicedefinition.

This capping is a self-defence mechanism of MemoryStorage. If the caller appends to this slice (and it may happen inside the LogSnapshot.LogSlice() call when we stitch together the storage part of the slice with the in-memory unstable slice), the original slice in MemoryStorage does not get corrupted.


pkg/raft/storage.go line 226 at r17 (raw file):

Previously, sumeerbhola wrote…

That partially answers my previous question.

I'll clarify the comment: it should say that there is both self-defence from the MemoryStorage (because the caller can append to the slice), but the caller is also defended from appends that MemoryStorage does to this slice.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 18, 2024

Improved the commenting a bit in response to the review comments.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 18, 2024

bors r=sumeerbhola

@craig craig bot merged commit c874696 into cockroachdb:master Oct 18, 2024
22 of 23 checks passed
@pav-kv pav-kv deleted the raft-mem-storage branch October 18, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants