Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM] Recoverable Random Beacon State Machine #6771

Conversation

durkmurder
Copy link
Member

@durkmurder durkmurder commented Dec 2, 2024

#6725

Context

The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.

Some key points:

  • badger.RecoverablePrivateBeaconKeyStateMachine implements the state machine itself.
  • State machine transitions are initiated from dkg.ReactorEngine and dkg.BeaconKeyRecovery.
  • State machine doesn't care about origin of the key. It could be obtained from different sources(successful DKG or manually injected by operator) all that matters that key is committed. Caller needs to ensure that respective public key is part of the EpochCommit for respective epoch.

Link to the diagram which describes the state machine:

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 20.96774% with 294 lines in your changes missing coverage. Please review.

Project coverage is 41.68%. Comparing base (af44135) to head (aa377c2).

Files with missing lines Patch % Lines
storage/mock/dkg_state_reader.go 0.00% 98 Missing ⚠️
storage/mock/epoch_recovery_my_beacon_key.go 0.00% 64 Missing ⚠️
storage/mock/dkg_state.go 0.00% 54 Missing ⚠️
cmd/consensus/main.go 0.00% 38 Missing ⚠️
engine/consensus/dkg/reactor_engine.go 37.50% 19 Missing and 1 partial ⚠️
model/flow/dkg.go 0.00% 10 Missing ⚠️
cmd/util/cmd/common/node_info.go 0.00% 3 Missing ⚠️
storage/badger/dkg_state.go 94.11% 2 Missing and 1 partial ⚠️
storage/errors.go 81.81% 2 Missing ⚠️
engine/common/grpc/forwarder/forwarder.go 0.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6771      +/-   ##
========================================================
- Coverage                 41.72%   41.68%   -0.05%     
========================================================
  Files                      2031     2033       +2     
  Lines                    180552   180748     +196     
========================================================
  Hits                      75341    75341              
- Misses                    99017    99202     +185     
- Partials                   6194     6205      +11     
Flag Coverage Δ
unittests 41.68% <20.96%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

engine/consensus/dkg/reactor_engine.go Outdated Show resolved Hide resolved
@@ -427,10 +431,10 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error {
// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer.
// Then this function just stops and returns without error.
e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex())
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateDKGFailure)
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
if err != nil {
if errors.Is(err, storage.ErrAlreadyExists) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetDKGState doesn't return this error type, it returns InvalidTransitionRandomBeaconStateMachineErr.

The comment above is outdated now, but it implies that we might already have persisted a failure state:

By convention, if we are leaving the happy path, we want to persist the first failure symptom

The state machine does not allow transitions from one state to itself (except for RandomBeaconKeyCommitted). If, as the current comment suggests, a failure state is already set at this point, we will throw InvalidTransitionRandomBeaconStateMachineErr as DKGStateFailure-> DKGStateFailure is not a valid transition. I don't think this is the case, but I'll return to this after reading further through the PR.

Suggestions:

  • update comment on lines 428-432
  • remove ErrAlreadyExists check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have allowed transition failure -> failure to handle such situations since the timing might be tricky and I have also added updated comment regarding possible error return. Let me know what do you think: https://github.com/onflow/flow-go/pull/6771/files/43d6a63349788084c057a42134326dcb4e721ad5..550fd3f739237b9fb241f13a69b19de5b7ce56b5

model/flow/dkg.go Outdated Show resolved Hide resolved
storage/badger/dkg_state.go Outdated Show resolved Hide resolved
storage/badger/dkg_state.go Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
@AlexHentschel
Copy link
Member

AlexHentschel commented Dec 10, 2024

I really appreciated this summary, which explains why the change is important 🙇 . I find that particularly helpful to clarify the focus of a PR.

The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. The code is a lot more expressive and rigorous about what is allowed. Very nice. While I looked at the core logic, I didn't quite get through the entire PR, but wanted to provide already my batch of comments so far.

storage/badger/dkg_state.go Outdated Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
storage/dkg.go Outdated Show resolved Hide resolved
engine/consensus/dkg/reactor_engine.go Outdated Show resolved Hide resolved
engine/consensus/dkg/reactor_engine.go Outdated Show resolved Hide resolved
engine/consensus/dkg/reactor_engine.go Outdated Show resolved Hide resolved
engine/consensus/dkg/reactor_engine.go Show resolved Hide resolved
// public key - therefore it is unsafe for use
if !nextDKGPubKey.Equals(localPubKey) {
log.Warn().
Str("computed_beacon_pub_key", localPubKey.String()).
Str("canonical_beacon_pub_key", nextDKGPubKey.String()).
Msg("checking beacon key consistency: locally computed beacon public key does not match beacon public key for next epoch")
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateInconsistentKey)
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️
I am worried that this code might be concurrently running with for example the recovery (if the node rebooted at a really unfortunate time). I think we should in each step assume that some other process might have concurrently advanced the state. So in each step, it might be possible that the flow.DKGState could have changed compared to what we just read from the data base.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively it seems to be a too relaxed behavior. For failure states we allow self-transitions, for everything else where we deviate from happy path and get an unexpected error I would be inclined to return an error so we can figure out what was wrong. For your particular scenario I am not very worried, it means that operator has to try again.

durkmurder and others added 2 commits December 10, 2024 18:14
Co-authored-by: Alexander Hentschel <[email protected]>
Co-authored-by: Alexander Hentschel <[email protected]>
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following batch of comments is on the tests for RecoverableRandomBeaconStateMachine.

storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
storage/badger/dkg_state_test.go Outdated Show resolved Hide resolved
epochCounter := setupState()
err = store.SetDKGState(epochCounter, flow.RandomBeaconKeyCommitted)
require.NoError(t, err, "should be possible since we have a stored private key")
err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️

In my opinion this test should fail, unless we are using the same key that was initially committed in line 320:

err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv())

I am not sure whether unittest.RandomBeaconPriv() returns different keys on each call (?) but if it does, this test should fail.

Can we please test both cases? Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take care of this in follow up PR.

Comment on lines +346 to +352
// perform this only if state machine is in initial state
if !started {
// store my beacon key for the first epoch post-spork
err = myBeaconKeyStateMachine.UpsertMyBeaconPrivateKey(epochCounter, beaconPrivateKey.PrivateKey)
if err != nil {
return fmt.Errorf("could not upsert my beacon private key for root epoch %d: %w", epochCounter, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is easy to do, I'd prefer if we can check that the random beacon key matches the information in the Epoch Commit event ... just to be safe against human configuration errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take care of this in follow up PR.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great work. The state transitions for the DKG are so much clearer now (at least for me). Only remaining request would be to include your state machine diagram in the code base:

  • you could add a short Readme in the folder docs ... maybe call it RecoverableRandomBeaconStateMachine or something similar.

To simplify your work, I have drafted a brief explanation, which you could include in the readme (if you like it):

The RecoverableRandomBeaconStateMachine formalizes the life-cycle of the Random Beacon keys for each epoch. On the happy path, each consensus participant for the next epoch takes part in a DKG to obtain a threshold key to participate in Flow's Random Beacon. After successfully finishing the DKG protocol, the node obtains a random beacon private key, which is stored in the database along with DKG current state flow.DKGStateCompleted. If for any reason the DKG fails, then the private key will be nil and DKG current state is set to flow.DKGStateFailure.
In case of failing Epoch switchover, the network goes into Epoch Fallback Mode [EFM]. The governance committee can recover the network via a special EpochRecover transaction. In this case, the set of threshold keys is specified by the governance committee.
The current implementation focuses on the scenario, where the governance committee re-uses the threshold key set from the last successful epoch transition. While injecting other threshold keys into the nodes is conceptually possible and supported, the utilities for this recovery path are not yet implemented.

diagram

@durkmurder durkmurder merged commit 6c251a3 into feature/efm-recovery Dec 12, 2024
55 checks passed
@durkmurder durkmurder deleted the yurii/6725-recoverable-random-beacon-state-machine branch December 12, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants