Block backfilling #12968

kasey · 2023-09-28T17:19:35Z

What type of PR is this?

Feature

What does this PR do? Why is it needed?

This PR implements backfilling for blocks. When a node starts from checkpoint sync, a database record is created to track the slot and root of the origin root. If MIN_EPOCHS_FOR_BLOCK_REQUESTS is an earlier epoch than the lowest slot recorded in this db entry, the backfill service will initialize and start back-filling blocks until we have coverage over the MIN_EPOCHS_FOR_BLOCK_REQUESTS span.

Note that this PR does not include deneb blob syncing, which will be implemented in a follow-up PR.

Other notes for review

~~This is a draft PR. I will update this section with notes on issues that need to be addressed before this can be merged.~~

~~clean up crufty things that are obvious for a full PR, like exported defs without comments, typos, TODOs in comments~~
~~peer downscoring for bad batches~~
~~backfill should be disabled by default while this feature is in testing. Add a flag like --enable-experimental-backfill so that we can merge this branch sooner without turning it on by default.~~
~~There are a few parameters that we want to be able to tune with cli flags. The number of p2p workers and the size of the batches are top of mind.~~
follow-up issues:
- blob backfilling
- optimize for faster downloading
- move signature checking to a single thread
- Add a flag like --backfill-to-epochs that allows the user to specify an earlier backfill target than MIN_EPOCHS_FOR_BLOCK_REQUESTS. requires some ux discussion

Some commentary on the design to help with review:

backfill.Store exposes db methods needed to manage the backfill. A new db method has been added to update the finalized index based on imported blocks. I added the proto/dbval package for values like BackfillStatus that represent some structured data we want to store in the db. I'm ambivalent about adding another sort of protobuf dependency, maybe json would be better?
verifier exploits the fact that all blocks to be backfilled were signed by validators that are present in the checkpoint sync origin state. At startup the public keys for the validator registry are copied from the origin state.
batch is the unit of work for backfill. It encapsulates everything a worker needs to process it, and state information etc that other components can use to act on the results (retry, import blocks etc).
peers.Assigner is an attempt to extricate the initial-sync peer selection logic to be used by other components. It is used by the backfill pool to pick the next best peer when routing a batch to be worked on.
pool manages peer selection for batches and plumbing batches between the Service and the workers. It handles issues like persistently trying to find peers when one can't immediately be found, ensuring that the same peer doesn't receive multiple requests concurrently, and handling the special case where all batches have completed, shutting down the worker pool and telling the Service that everything is done.
worker is the simplest component. It represents a worker loop that runs in a separate goroutine, reading from a shared fan-out channel until it receives a batch, then does the work described by that batch.
backfill.Service reading the Start method of Service is probably the best way to get an overview of how the pieces come together. The run loop there orchestrates the overall backfill process. It uses batcherSequencer.sequence to populate the worker pool's todo queue, receives completed batches from the pool, updates batcher to notify it of state changes, requests importable batches from batcher, imports them, calls update with the results, rinse and repeat.

terencechain · 2023-09-28T21:24:56Z

beacon-chain/sync/backfill/status.go

+	if err != nil {
+		if errors.Is(err, db.ErrNotFound) {
+			return s, s.recoverLegacy(ctx)
+		}


todo: handle errors other than db.ErrNotFound

terencechain · 2023-09-28T21:44:32Z

beacon-chain/sync/backfill/service.go

+	s.batchSeq = newBatchSequencer(s.nWorkers, s.ms.minimumSlot(), primitives.Slot(status.LowSlot), primitives.Slot(s.batchSize))
+	// Exit early if there aren't going to be any batches to backfill.
+	if primitives.Slot(status.LowSlot) < s.ms.minimumSlot() {
+		return


It might be nice to log something here to indicate nothing needs to be done. Pretty relevant condition imo

terencechain · 2023-09-28T22:38:56Z

beacon-chain/sync/backfill/service.go

+}
+
+func (s *Service) initVerifier(ctx context.Context) (*verifier, error) {
+	cps, err := s.su.originState(ctx)


Could there be a scenario where the to-be-validated blocks are newer than the origin state?

No, I don't think so. The origin state is the state used to initiate the checkpoint sync, so it is by definition the state we are backfilling from, and it must be finalized, and the descendant of all the blocks that we backfill.

A future case to consider, which we don't currently support, is triggering a new checkpoint sync on top of an existing db - see #13003

Once we do enable that feature, we should always write the updated origin to the db and use that state for the validator registry. Note that we may need to track additional data in the status db record to support this, because currently we only track the low/high end of one range of blocks to backfill, and stategen uses the low end in this db record to figure out if blocks are available. if we allow multiple checkpoint syncs against the same db, we need to track the target of the backfill separately from the lowest available slot, and we need to account for "holes" in the range.

beacon-chain/db/kv/finalized_block_roots.go

beacon-chain/p2p/assigner.go

beacon-chain/p2p/peers/assigner.go

potuz · 2023-09-29T16:18:47Z

beacon-chain/state/state-native/getters_validator.go

@@ -145,6 +145,22 @@ func (b *BeaconState) PubkeyAtIndex(idx primitives.ValidatorIndex) [fieldparams.
 	return bytesutil.ToBytes48(b.validators[idx].PublicKey)
 }

+// PublicKeys builds a list of all validator public keys, with each key's index aligned to its validator index.
+func (b *BeaconState) PublicKeys() [][fieldparams.BLSPubkeyLength]byte {


Don't we have this cached? this is a large slice, it's probably better to grab the cache first, updating it here on calls to this functions with the new pubkeys from the state, and return the cache.

For backfill service, this is only called once at start-up, validator public keys re cached in verify.go's verifier's keys. With that said, I think your suggestion is still valid

We only run this once at startup, for the checkpoint state we're backfilling from. I don't know if it will be in cache at that point.

potuz · 2023-09-29T16:32:17Z

beacon-chain/sync/backfill/batch.go

+		b.scheduled = time.Now()
+		switch b.state {
+		case batchErrRetryable:
+			b.retries += 1


Do we want to log here?

potuz · 2023-09-29T16:34:34Z

beacon-chain/sync/backfill/batcher.go

+	seq     []batch
+}
+
+var errCannotDecreaseMinimum = errors.New("The minimum backfill slot can only be increased, not decreased")


Probably up with the other errors.

potuz · 2023-09-29T16:54:50Z

beacon-chain/sync/backfill/batch.go

+	if b.end != r.end {
+		return false
+	}
+	return b.seq >= r.seq


Should we worry about misuse where we replace a batch that has batchImportComplete with something that doesn't?

Specifically if we send this on update below coud cause trouble.

I can add a check that always returns false if r has state batchImportComplete.

terencechain · 2023-09-29T16:58:12Z

beacon-chain/sync/backfill/batch.go

+
+func (b batch) withRetryableError(err error) batch {
+	b.err = err
+	return b.withState(batchErrRetryable)


Side note: we might want to handle the error differently down he line. For example, a validation error is different than a p2p error. But I guess it's all the same now, if there's an error we just retry it

terencechain · 2023-09-29T17:25:24Z

beacon-chain/sync/backfill/status.go

+		if err := s.store.SaveBlock(ctx, b); err != nil {
+			return nil, errors.Wrapf(err, "error saving backfill block with root=%#x, slot=%d", b.Root(), b.Block().Slot())


Nit. I found that StatusUpdater not only updates statuses but also handles DB block saving and finalized root updates. This seems to diverge from the principle of single responsibility based on the name. Maybe renaming a few things fix this?

I renamed the type to Store to reflect the fact that it exposes a broader swath of the db interface. What do you think?

CLAassistant · 2023-10-04T21:35:53Z

All committers have signed the CLA.

terencechain · 2023-10-09T22:46:48Z

beacon-chain/p2p/peers/assigner.go

+	if len(peers) < required {
+		log.WithFields(logrus.Fields{
+			"suitable": len(peers),
+			"required": required}).Info("Unable to assign peer while suitable peers < required ")


This to me smells more like a WARN than INFO

terencechain · 2023-10-09T22:47:06Z

beacon-chain/p2p/peers/assigner.go

+// the number of outbound requests to each peer from a given component.
+func (a *Assigner) Assign(busy map[peer.ID]bool, n int) ([]peer.ID, error) {
+	best, err := a.freshPeers()
+	ps := make([]peer.ID, 0, n)


shouldn't this line be after the if err != nil {?

terencechain · 2023-10-09T22:48:28Z

beacon-chain/p2p/peers/assigner.go

+// the number of outbound requests to each peer from a given component.
+func (a *Assigner) Assign(busy map[peer.ID]bool, n int) ([]peer.ID, error) {
+	best, err := a.freshPeers()
+	ps := make([]peer.ID, 0, n)


do we want to handle the case if n=0 for the input?

hmm should that be an empty list or an error? Maybe an error since that would be a strange bug on the caller side.

terencechain · 2023-10-10T22:12:33Z

beacon-chain/sync/backfill/batcher.go

+		// Assumes invariant that batches complete and update is called in order.
+		// This should be true because the code using the sequencer doesn't know the expected parent
+		// for a batch until it imports the previous batch.
+		if c.seq[i].state == batchImportComplete {


What happens if b.state already has batchImportComplete. Should we even allow that?

Did you mean should we allow c.seq[i] = b if c.seq[i].state == batchImportComplete? In that instance replaces will return false, so we won't overwrite it.

terencechain · 2023-10-10T22:15:29Z

beacon-chain/sync/backfill/batcher.go

+			continue
+		}
+		// Move the unfinished batches to overwrite the finished ones.
+		c.seq[i-done] = c.seq[i]


Something smells here. If i=0 and is batchImportComplete, wouldn't we get c.seq[0-1] and underflow?

There's a continue in the batchImportComplete conditional where done is incremented, so i can't ever be less than done.

rkapka · 2023-10-11T11:41:27Z

beacon-chain/core/blocks/withdrawals.go

@@ -141,7 +141,7 @@ func ValidateBLSToExecutionChange(st state.ReadOnlyBeaconState, signed *ethpb.Si
 //	    next_validator_index = ValidatorIndex((expected_withdrawals[-1].validator_index + 1) % len(state.validators))
 //	    state.next_withdrawal_validator_index = next_validator_index
 //	else:
-//	    # Advance sweep by the max length of the sweep if there was not a full set of withdrawals
+//	    # FillFwd sweep by the max length of the sweep if there was not a full set of withdrawals


bad refactor

rkapka · 2023-10-11T11:44:11Z

beacon-chain/core/helpers/weak_subjectivity.go

@@ -202,3 +202,14 @@ func ParseWeakSubjectivityInputString(wsCheckpointString string) (*v1alpha1.Chec
 		Root:  bRoot,
 	}, nil
 }
+
+// MinEpochsForBlockRequests computes the number of epochs of block history that we need to maintain,
+// relative to the current epoch, per the p2p specs. This is used to compute the slot where backfill is complete.


backfill is complete

Does this mean I cannot backfill any further? I was imagining one could backfill all the way back to genesis, thus having the best of both worlds: an archival node that follows the head as soon as possible.

As mentioned in the PR description, we'll do a follow up feature to add a flag like --backfill-to-epoch that will allow the user to specify an earlier backfill target: --backfill-to-epoch=0 to go all the way to genesis. Actually this reminds me that I need to file a separate issue for this, currently we only have #13003 which is related but doesn't cover it. Just added this one #13031

While this is the minimum by the spec, shouldn't we have this value by default ?

backfill-to-epoch=0

Reasoning is that on node default's all of the beacon history is always maintained. If a user does not care about it, they can then just give a recent epoch

My stance is that the main purpose of beacon nodes is to participate in the consensus protocol. Following from the principals of finalization and weak subjectivity, most nodes are really only interested in blocks forward from the beginning of the weak subjectivity period. Older history is not useful for participating in consensus, it is only interesting for archival purposes. So our default behavior should not be to download older history that is not useful to most of the network.

I am open to having my mind changed on this issue :)

Rather than participating in consensus, having the ability to download and serve historical blocks should be an expected duty of a normal node. There is no alternate protocol to serve historical blocks from the consensus layer network. If all nodes simply stop saving historical blocks, you would end up with very few peers who would have full chain histories. It is the same reason execution layer nodes also save all historical blocks even when snap syncing.

I agree with @kasey here and would leave this default. Blocks are kept in the EL and that allows state recovery. We do not need the CL blocks more than for accounting purposes on archive nodes, and it's their business to have these data, instead of the default user providing them.

Strongly disagree, its the same reason execution clients haven not relinquished support for serving/persisting historical blocks. Even though those blocks are purely used to access historical state which is of no use now, users are still able to do so. We can continue the conversation offline

rkapka · 2023-10-11T11:45:09Z

beacon-chain/db/kv/backfill.go

+// BackfillStatus retrieves the most recently saved version of the BackfillStatus protobuf struct.
+// This is used to persist information about backfill status across restarts.
+func (s *Store) BackfillStatus(ctx context.Context) (*dbval.BackfillStatus, error) {
+	ctx, span := trace.StartSpan(ctx, "BeaconDB.SaveBackfillStatus")


Suggested change

ctx, span := trace.StartSpan(ctx, "BeaconDB.SaveBackfillStatus")

ctx, span := trace.StartSpan(ctx, "BeaconDB.BackfillStatus")

rkapka · 2023-10-11T11:51:28Z

beacon-chain/db/kv/error.go

+var errIncorrectBlockParent = errors.New("Unexpected missing or forked blocks in a []ROBlock")
+var errFinalizedChildNotFound = errors.New("Unable to find finalized root descending from backfill batch")
+var errNotConnectedToFinalized = errors.New("Unable to finalize backfill blocks, finalized parent_root does not match")


error should be lowercase

rkapka · 2023-10-11T12:06:38Z

beacon-chain/db/kv/finalized_block_roots.go

+		}
+	}
+
+	return s.db.Update(func(tx *bolt.Tx) error {


We can split this code into two parts and perform everything up to and including if !bytes.Equal(fcc.ParentRoot, blocks[len(blocks)-1].RootSlice()) at the beginning of the function. This will allow an early exit without having to encode input blocks.

The way things are currently written, we are fairly confident the tail root will be present, because the backfill service code checks that batches are all linked to the finalized origin. This is a defensive check to make sure that we don't break the correctness of the finalized index.

I'm not sure if you split this by using an additional db tx to check the tail, or if you were thinking that we would encode the checkpoints inside the func? In general my thinking is that the database is a more contentious resource than ssz encoding and compression - these are very small values, encoding them basically amounts to copying 2 32 byte slices. So IMO it's better to minimize the number of db transactions (hence check and set in one), and to minimize the amount of time spent inside the db tx (hence doing any work possible out of the db tx).

rkapka · 2023-10-11T12:17:15Z

beacon-chain/db/kv/wss.go

+	}
+
+	if err = s.SaveBackfillStatus(ctx, bf); err != nil {
+		return errors.Wrap(err, "unable to save backfill status data to db for checkpoint sync.")


Suggested change

return errors.Wrap(err, "unable to save backfill status data to db for checkpoint sync.")

return errors.Wrap(err, "unable to save backfill status data to db for checkpoint sync")

rkapka · 2023-10-11T12:19:49Z

proto/dbval/dbval.proto

+
+option go_package = "github.com/prysmaticlabs/prysm/v4/proto/dbval;dbval";
+
+message BackfillStatus {


Can you add comments explaining the meaning of these fields? I would expect both origin and low to hold low-end values, but this can't be true.

rkapka · 2023-10-11T12:23:58Z

beacon-chain/p2p/peers/assigner.go

+
+// NewAssigner assists in the correct construction of an Assigner by code in other packages,
+// assuring all the important private member fields are given values.
+// The finalized parameter is used to specify a minimum finalized epoch that the peers must agree on.


Suggested change

// The finalized parameter is used to specify a minimum finalized epoch that the peers must agree on.

// The finalized checkpointer parameter is used to specify a minimum finalized epoch that the peers must agree on.

rkapka · 2023-10-11T12:28:50Z

beacon-chain/sync/backfill/batch.go

+	log "github.com/sirupsen/logrus"
+)
+
+// ErrChainBroken is usedbatch with no results, skipping importer when a backfill batch can't be imported to the db because it is not known to be the ancestor


Suggested change

// ErrChainBroken is usedbatch with no results, skipping importer when a backfill batch can't be imported to the db because it is not known to be the ancestor

// ErrChainBroken is used for batch with no results, skipping importer when a backfill batch can't be imported to the db because it is not known to be the ancestor

rkapka · 2023-10-11T12:35:19Z

beacon-chain/sync/backfill/batcher.go

+
+var errMaxBatches = errors.New("backfill batch requested in excess of max outstanding batches")
+var errEndSequence = errors.New("sequence has terminated, no more backfill batches will be produced")
+var errCannotDecreaseMinimum = errors.New("The minimum backfill slot can only be increased, not decreased")


Suggested change

var errCannotDecreaseMinimum = errors.New("The minimum backfill slot can only be increased, not decreased")

var errCannotDecreaseMinimum = errors.New("the minimum backfill slot can only be increased, not decreased")

saolyn · 2023-10-12T13:21:55Z

beacon-chain/sync/backfill/metrics.go

+	batchesWaiting = promauto.NewGauge(
+		prometheus.GaugeOpts{
+			Name: "backfill_importable_batches_waiting",
+			Help: "Number of batches that are ready to be imported once the they can be connected to the existing chain.",


Extra the

Suggested change

Help: "Number of batches that are ready to be imported once the they can be connected to the existing chain.",

Help: "Number of batches that are ready to be imported once they can be connected to the existing chain.",

rkapka · 2023-10-12T11:17:25Z

proto/dbval/dbval.proto

+    // low_slot is the slot of the last block that backfill will attempt to download and import.
+    // This is determined by MIN_EPOCHS_FOR_BLOCK_REQUESTS, or by a user-specified override.
+    uint64 low_slot = 1;
+    // low_slot is the root of the last block that backfill will attempt to download and import.


Suggested change

// low_slot is the root of the last block that backfill will attempt to download and import.

// low_root is the root of the last block that backfill will attempt to download and import.

rkapka · 2023-10-12T11:19:33Z

beacon-chain/db/kv/finalized_block_roots.go

+// BackfillFinalizedIndex updates the finalized index for a contiguous chain of blocks that are the ancestors of the
+// given finalized child root. This is needed to update the finalized index during backfill, because the usual
+// updateFinalizedBlockRoots has assumptions that are incompatible with backfill processing.
+func (s *Store) BackfillFinalizedIndex(ctx context.Context, blocks []blocks.ROBlock, finalizedChildRoot [32]byte) error {


What is the meaning of the word Index here?

Index is being used as a noun here. The "finalized index" is comprised of values in the db bucket finalizedBlockRootsIndexBucket. Each value is keyed by the root of a finalized block and contains the roots of the parent and child of the given root. The finalized index is prysm's view on which block roots are canonical <= the most recently finalized root (which is the only one stored in fork choice).

Usually we update this index from the blockchain package when it is determined that we have determined a new finalized checkpoint. But in the case of backfill, we are adding blocks that bypass all those mechanisms, so we want to update the index to be aware of the earlier blocks.

rkapka · 2023-10-12T13:46:23Z

beacon-chain/sync/backfill/verify.go

+type VerifiedROBlocks []blocks.ROBlock
+
+type verifier struct {
+	// chkptVals is the set of validators from the state used to initialize the node via checkpoint sync.


There is no chkptVals field

rkapka · 2023-10-12T13:48:31Z

beacon-chain/sync/backfill/verify.go

+	forkDomains map[[4]byte][]byte
+}
+
+func (bs verifier) verify(blks []interfaces.ReadOnlySignedBeaconBlock) (VerifiedROBlocks, error) {


Receiver name should probably be v

good catch! This had some other bad name before where bs made sense. Fixed the bad name, kept the bad receiver.

rkapka · 2023-10-12T13:49:20Z

beacon-chain/sync/backfill/verify.go

+	"github.com/prysmaticlabs/prysm/v4/time/slots"
+)
+
+var errInvalidBatchChain = errors.New("parent_root of block does not match root of previous")


Suggested change

var errInvalidBatchChain = errors.New("parent_root of block does not match root of previous")

var errInvalidBatchChain = errors.New("parent_root of block does not match root of previous block")

rkapka · 2023-10-12T13:49:48Z

beacon-chain/sync/backfill/verify.go

+		if i > 0 && result[i-1].Root() != result[i].Block().ParentRoot() {
+			p, b := result[i-1], result[i]
+			return nil, errors.Wrapf(errInvalidBatchChain,
+				"slot %d parent_root=%#x, slot %d root = %#x",


Suggested change

"slot %d parent_root=%#x, slot %d root = %#x",

"slot %d parent_root=%#x, slot %d root=%#x",

rkapka · 2023-10-12T14:02:56Z

beacon-chain/sync/backfill/verify.go

+	// chkptVals is the set of validators from the state used to initialize the node via checkpoint sync.
+	keys        [][fieldparams.BLSPubkeyLength]byte
+	maxVal      primitives.ValidatorIndex
+	vr          []byte


Can you make this variable name a bit more clear? I suggest valRoot

rkapka · 2023-10-12T14:03:56Z

beacon-chain/sync/backfill/verify.go

+	maxVal      primitives.ValidatorIndex
+	vr          []byte
+	fsched      forks.OrderedSchedule
+	dt          [bls.DomainByteLength]byte


It was hard for me to figure out what dt means. Can you rename it to domain?

Short for DomainType. domain would be a confusing name, because the domain is derived jointly from the DomainType, Fork Version, and Genesis Validator Root. Maybe dtype?

In general I think that the meaning of a value can be made clear by context. This value is initialized like dt: params.BeaconConfig().DomainBeaconProposer (where the value assigned is a DomainType) and passed to ComputeDomain (where it's known to be the DomainType).

rkapka · 2023-10-12T14:05:12Z

beacon-chain/sync/backfill/verify.go

+// VerifiedROBlocks represents a slice of blocks that have passed signature verification.
+type VerifiedROBlocks []blocks.ROBlock
+
+type verifier struct {


The verifier code has 0 tests

rkapka

I'm ambivalent about adding another sort of protobuf dependency, maybe json would be better?

My opinion is that we should avoid protos unless we need SSZ code for the new type.

rkapka · 2023-10-13T10:32:30Z

beacon-chain/sync/backfill/batcher.go

+//   - They are in state batchImportable, which means their data has been downloaded and proposer signatures have been verified.
+//   - There are no batches that are not in state batchImportable between them and the start of the slice. This ensures that they
+//     can be connected to the canonical chain, either because the root of the last block in the batch matches the parent_root of
+//     the oldest block in the canonical chain, or because the root of the last block in the batch mathces the parent_root of the


Suggested change

// the oldest block in the canonical chain, or because the root of the last block in the batch mathces the parent_root of the

// the oldest block in the canonical chain, or because the root of the last block in the batch matches the parent_root of the

rkapka · 2023-10-13T11:46:25Z

beacon-chain/sync/backfill/pool.go

+func (p *p2pBatchWorkerPool) todo(b batch) {
+	// Intercept batchEndSequence batches so workers can remain unaware of this state.
+	// Workers don't know what to do with batchEndSequence batches. They are a signal to the pool that the batcher
+	// has stopped producing things for the workers to do and the pool is close to winding down. See Complete()


Suggested change

// has stopped producing things for the workers to do and the pool is close to winding down. See Complete()

// has stopped producing things for the workers to do and the pool is close to winding down. See complete()

rkapka · 2023-10-13T11:49:34Z

beacon-chain/sync/backfill/pool.go

+	case err := <-p.shutdownErr:
+		return batch{}, errors.Wrap(err, "fatal error from backfill worker pool")
+	case <-p.ctx.Done():
+		return batch{}, p.ctx.Err()


Adding the same log as in batchRouter might be useful

rkapka · 2023-10-16T11:37:04Z

beacon-chain/sync/backfill/service.go

+// Start begins the runloop of backfill.Service in the current goroutine.
+func (s *Service) Start() {
+	if !s.enabled {
+		log.Info("exiting backfill service; not enabled")


Suggested change

log.Info("exiting backfill service; not enabled")

log.Info("Exiting backfill service; not enabled")

rkapka · 2023-10-16T11:37:12Z

beacon-chain/sync/backfill/service.go

+	}()
+	clock, err := s.cw.WaitForClock(ctx)
+	if err != nil {
+		log.WithError(err).Fatal("backfill service failed to start while waiting for genesis data")


Suggested change

log.WithError(err).Fatal("backfill service failed to start while waiting for genesis data")

log.WithError(err).Fatal("Backfill service failed to start while waiting for genesis data")

rkapka · 2023-10-16T11:39:59Z

beacon-chain/sync/backfill/service.go

+
+type Service struct {
+	ctx           context.Context
+	su            *Store


Why is this called su? Not sure where the u comes from. Can you change it to store?

It was called StatusUpdater before and then renamed. I will fix!

rkapka · 2023-10-16T11:44:32Z

beacon-chain/sync/backfill/service.go

+	status := s.su.status()
+	s.batchSeq = newBatchSequencer(s.nWorkers, s.ms.minimumSlot(), primitives.Slot(status.LowSlot), primitives.Slot(s.batchSize))
+	// Exit early if there aren't going to be any batches to backfill.
+	if primitives.Slot(status.LowSlot) < s.ms.minimumSlot() {


Can this conditional be moved up to avoid initializing things that are not needed anyway, e.g. s.batchSeq?

Shouldn't this be a <=?

rkapka · 2023-10-16T11:56:26Z

beacon-chain/sync/backfill/service.go

+		log.WithError(err).Fatal("Non-recoverable error in backfill service, quitting.")
+		return false


Fatal will kill the app, so return false will not get triggered

rkapka · 2023-10-16T12:00:25Z

beacon-chain/sync/backfill/service.go

+	for i := range importable {
+		ib := importable[i]
+		if len(ib.results) == 0 {
+			log.WithFields(ib.logFields()).Error("batch with no results, skipping importer")


Suggested change

log.WithFields(ib.logFields()).Error("batch with no results, skipping importer")

log.WithFields(ib.logFields()).Error("Batch with no results, skipping importer")

rkapka · 2023-10-16T12:04:42Z

cmd/beacon-chain/sync/backfill/options.go

+	// network utilization at the cost of higher memory.
+	BackfillWorkerCount = &cli.IntFlag{
+		Name: backfillWorkerCountName,
+		Usage: "Number of concurrent backfill batch requests." +


Suggested change

Usage: "Number of concurrent backfill batch requests." +

Usage: "Number of concurrent backfill batch requests. " +

rkapka · 2023-10-16T12:07:11Z

beacon-chain/sync/backfill/coverage/coverage.go

+
+// AvailableBlocker can be used to check whether there is a finalized block in the db for the given slot.
+// This interface is typically fulfilled by backfill.Store.
+type AvailableBlocker interface {


Why is this interface defined in a file called coverage.go?

My rationale was that coverage describes which slice of the overall block history is covered. "Is this block available?" is analogous to "is this range in block history covered?".

rkapka · 2023-10-16T12:08:46Z

beacon-chain/p2p/peers/assigner.go

+// for the configured finalized epoch. At most `n` peers will be returned. The `busy` param can be used
+// to filter out peers that we know we don't want to connect to, for instance if we are trying to limit
+// the number of outbound requests to each peer from a given component.
+func (a *Assigner) Assign(busy map[peer.ID]bool, n int) ([]peer.ID, error) {


Some unit tests would be nice. The fact that this function calls freshPeers might make it hard though.

refactored to be easier to test and added unit tests here: e055269

kasey · 2023-10-17T18:22:28Z

beacon-chain/sync/backfill/status.go

-			s.genesisSync = true
-			return nil
+	for _, b := range blocks {
+		if err := s.store.SaveBlock(ctx, b); err != nil {


@terencechain pointed out that the db/kv block methods use an LRU under the hood. Note to self to look into SaveBlock and determine whether backfilling will invalidate this cache.

This has been addressed @terencechain please take a look.

nisdas · 2024-01-18T04:04:43Z

cmd/beacon-chain/main.go

@@ -138,6 +139,9 @@ var appFlags = []cli.Flag{
 	flags.JwtId,
 	storage.BlobStoragePathFlag,
 	storage.BlobRetentionEpochFlag,
+	backfill.EnableExperimentalBackfill,


can you add this as a feature flag instead ? since we eventually will deprecate it once the feature is the default.

potuz · 2024-01-18T18:59:18Z

beacon-chain/core/helpers/weak_subjectivity.go

@@ -202,3 +202,14 @@ func ParseWeakSubjectivityInputString(wsCheckpointString string) (*v1alpha1.Chec
 		Root:  bRoot,
 	}, nil
 }
+
+// MinEpochsForBlockRequests computes the number of epochs of block history that we need to maintain,
+// relative to the current epoch, per the p2p specs. This is used to compute the slot where backfill is complete.


I agree with @kasey here and would leave this default. Blocks are kept in the EL and that allows state recovery. We do not need the CL blocks more than for accounting purposes on archive nodes, and it's their business to have these data, instead of the default user providing them.

As long as we're sure the AvailableBlocker is initialized correctly during node startup, defaulting to assuming we aren't in a checkpoint sync simplifies things greatly for tests.

kasey force-pushed the backfill-blocks branch 2 times, most recently from fafbc6b to 94ff07f Compare September 28, 2023 18:05

terencechain reviewed Sep 28, 2023

View reviewed changes

kasey commented Sep 29, 2023

View reviewed changes

beacon-chain/db/kv/finalized_block_roots.go Outdated Show resolved Hide resolved

kasey commented Sep 29, 2023

View reviewed changes

beacon-chain/p2p/assigner.go Outdated Show resolved Hide resolved

kasey commented Sep 29, 2023

View reviewed changes

beacon-chain/p2p/peers/assigner.go Outdated Show resolved Hide resolved

potuz reviewed Sep 29, 2023

View reviewed changes

terencechain reviewed Sep 29, 2023

View reviewed changes

terencechain assigned kasey Oct 3, 2023

kasey force-pushed the backfill-blocks branch from c3e0a8a to 04b242b Compare October 6, 2023 19:14

terencechain reviewed Oct 9, 2023

View reviewed changes

kasey force-pushed the backfill-blocks branch from 86c815c to 8b5400e Compare October 10, 2023 17:54

kasey marked this pull request as ready for review October 10, 2023 18:00

kasey requested a review from a team as a code owner October 10, 2023 18:00

kasey requested review from prestonvanloon, saolyn and potuz October 10, 2023 18:00

kasey mentioned this pull request Oct 10, 2023

blob backfilling #13029

Closed

terencechain reviewed Oct 10, 2023

View reviewed changes

rkapka reviewed Oct 11, 2023

View reviewed changes

kasey force-pushed the backfill-blocks branch from 6f5e36e to d2dbaa5 Compare October 11, 2023 20:24

saolyn reviewed Oct 12, 2023

View reviewed changes

rkapka reviewed Oct 12, 2023

View reviewed changes

rkapka reviewed Oct 13, 2023

View reviewed changes

rkapka reviewed Oct 16, 2023

View reviewed changes

kasey commented Oct 17, 2023

View reviewed changes

kasey force-pushed the backfill-blocks branch 3 times, most recently from 8b8d486 to 9eb0f87 Compare January 17, 2024 20:27

nisdas reviewed Jan 18, 2024

View reviewed changes

kasey force-pushed the backfill-blocks branch from 554234c to 11c9ed0 Compare January 18, 2024 04:09

potuz previously approved these changes Jan 18, 2024

View reviewed changes

kasey dismissed potuz’s stale review via 71389c1 January 18, 2024 21:24

kasey force-pushed the backfill-blocks branch 2 times, most recently from c18ffa4 to eb3cdf3 Compare January 19, 2024 17:56

kasey added 14 commits January 22, 2024 11:19

backfill service

fab874d

fix bug where origin state is never unlocked

10abf71

support mvslice states

ea73296

use renamed interface

b20a3d7

refactor db code to skip block cache for backfill

c497112

lint

bc26f1b

add test for verifier.verify

316e637

enable service in service init test

7619573

cancellation cleanup

1a463c0

adding nil checks to configset juggling

6722712

assume blocks are available by default

1ef5ba7

As long as we're sure the AvailableBlocker is initialized correctly during node startup, defaulting to assuming we aren't in a checkpoint sync simplifies things greatly for tests.

block saving path refactor and bugfix

e577f2f

fix fillback test

d25d43c

fix BackfillStatus init tests

a76ec16

kasey force-pushed the backfill-blocks branch from 7e90f31 to a76ec16 Compare January 22, 2024 17:19

nisdas approved these changes Jan 23, 2024

View reviewed changes

nisdas added this pull request to the merge queue Jan 23, 2024

Merged via the queue into develop with commit 1df173e Jan 23, 2024
17 checks passed

nisdas deleted the backfill-blocks branch January 23, 2024 08:25

KolbyML mentioned this pull request Oct 18, 2024

Update checkpoint sync backfilling FAQ portion prysmaticlabs/documentation#980

Merged

		if err := s.store.SaveBlock(ctx, b); err != nil {
		return nil, errors.Wrapf(err, "error saving backfill block with root=%#x, slot=%d", b.Root(), b.Block().Slot())

	ctx, span := trace.StartSpan(ctx, "BeaconDB.SaveBackfillStatus")
	ctx, span := trace.StartSpan(ctx, "BeaconDB.BackfillStatus")

	return errors.Wrap(err, "unable to save backfill status data to db for checkpoint sync.")
	return errors.Wrap(err, "unable to save backfill status data to db for checkpoint sync")


		option go_package = "github.com/prysmaticlabs/prysm/v4/proto/dbval;dbval";

		message BackfillStatus {

	// The finalized parameter is used to specify a minimum finalized epoch that the peers must agree on.
	// The finalized checkpointer parameter is used to specify a minimum finalized epoch that the peers must agree on.

	// ErrChainBroken is usedbatch with no results, skipping importer when a backfill batch can't be imported to the db because it is not known to be the ancestor
	// ErrChainBroken is used for batch with no results, skipping importer when a backfill batch can't be imported to the db because it is not known to be the ancestor

	var errCannotDecreaseMinimum = errors.New("The minimum backfill slot can only be increased, not decreased")
	var errCannotDecreaseMinimum = errors.New("the minimum backfill slot can only be increased, not decreased")

	Help: "Number of batches that are ready to be imported once the they can be connected to the existing chain.",
	Help: "Number of batches that are ready to be imported once they can be connected to the existing chain.",

	// low_slot is the root of the last block that backfill will attempt to download and import.
	// low_root is the root of the last block that backfill will attempt to download and import.

	var errInvalidBatchChain = errors.New("parent_root of block does not match root of previous")
	var errInvalidBatchChain = errors.New("parent_root of block does not match root of previous block")

	"slot %d parent_root=%#x, slot %d root = %#x",
	"slot %d parent_root=%#x, slot %d root=%#x",

	// the oldest block in the canonical chain, or because the root of the last block in the batch mathces the parent_root of the
	// the oldest block in the canonical chain, or because the root of the last block in the batch matches the parent_root of the

	// has stopped producing things for the workers to do and the pool is close to winding down. See Complete()
	// has stopped producing things for the workers to do and the pool is close to winding down. See complete()

	log.Info("exiting backfill service; not enabled")
	log.Info("Exiting backfill service; not enabled")

	log.WithError(err).Fatal("backfill service failed to start while waiting for genesis data")
	log.WithError(err).Fatal("Backfill service failed to start while waiting for genesis data")

		log.WithError(err).Fatal("Non-recoverable error in backfill service, quitting.")
		return false

	log.WithFields(ib.logFields()).Error("batch with no results, skipping importer")
	log.WithFields(ib.logFields()).Error("Batch with no results, skipping importer")

	Usage: "Number of concurrent backfill batch requests." +
	Usage: "Number of concurrent backfill batch requests. " +

Block backfilling #12968

Block backfilling #12968

Conversation

kasey commented Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kasey Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kasey Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkapka Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkapka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kasey commented Sep 28, 2023 •

edited

Loading

CLAassistant commented Oct 4, 2023 •

edited

Loading

kasey Oct 11, 2023 •

edited

Loading

kasey Nov 2, 2023 •

edited

Loading

rkapka Oct 11, 2023 •

edited

Loading