-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EFM] Recoverable Random Beacon State Machine #6771
[EFM] Recoverable Random Beacon State Machine #6771
Conversation
…nd state but rather current
…internals of state machine
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/efm-recovery #6771 +/- ##
========================================================
- Coverage 41.72% 41.68% -0.05%
========================================================
Files 2031 2033 +2
Lines 180552 180748 +196
========================================================
Hits 75341 75341
- Misses 99017 99202 +185
- Partials 6194 6205 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -427,10 +431,10 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error { | |||
// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer. | |||
// Then this function just stops and returns without error. | |||
e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex()) | |||
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateDKGFailure) | |||
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure) | |||
if err != nil { | |||
if errors.Is(err, storage.ErrAlreadyExists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetDKGState
doesn't return this error type, it returns InvalidTransitionRandomBeaconStateMachineErr
.
The comment above is outdated now, but it implies that we might already have persisted a failure state:
By convention, if we are leaving the happy path, we want to persist the first failure symptom
The state machine does not allow transitions from one state to itself (except for RandomBeaconKeyCommitted
). If, as the current comment suggests, a failure state is already set at this point, we will throw InvalidTransitionRandomBeaconStateMachineErr
as DKGStateFailure-> DKGStateFailure
is not a valid transition. I don't think this is the case, but I'll return to this after reading further through the PR.
Suggestions:
- update comment on lines 428-432
- remove
ErrAlreadyExists
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have allowed transition failure -> failure to handle such situations since the timing might be tricky and I have also added updated comment regarding possible error return. Let me know what do you think: https://github.com/onflow/flow-go/pull/6771/files/43d6a63349788084c057a42134326dcb4e721ad5..550fd3f739237b9fb241f13a69b19de5b7ce56b5
Co-authored-by: Jordan Schalm <[email protected]>
Co-authored-by: Jordan Schalm <[email protected]>
I really appreciated this summary, which explains why the change is important 🙇 . I find that particularly helpful to clarify the focus of a PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. The code is a lot more expressive and rigorous about what is allowed. Very nice. While I looked at the core logic, I didn't quite get through the entire PR, but wanted to provide already my batch of comments so far.
// public key - therefore it is unsafe for use | ||
if !nextDKGPubKey.Equals(localPubKey) { | ||
log.Warn(). | ||
Str("computed_beacon_pub_key", localPubKey.String()). | ||
Str("canonical_beacon_pub_key", nextDKGPubKey.String()). | ||
Msg("checking beacon key consistency: locally computed beacon public key does not match beacon public key for next epoch") | ||
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateInconsistentKey) | ||
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that this code might be concurrently running with for example the recovery (if the node rebooted at a really unfortunate time). I think we should in each step assume that some other process might have concurrently advanced the state. So in each step, it might be possible that the flow.DKGState
could have changed compared to what we just read from the data base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively it seems to be a too relaxed behavior. For failure states we allow self-transitions, for everything else where we deviate from happy path and get an unexpected error I would be inclined to return an error so we can figure out what was wrong. For your particular scenario I am not very worried, it means that operator has to try again.
Co-authored-by: Alexander Hentschel <[email protected]>
Co-authored-by: Alexander Hentschel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following batch of comments is on the tests for RecoverableRandomBeaconStateMachine
.
epochCounter := setupState() | ||
err = store.SetDKGState(epochCounter, flow.RandomBeaconKeyCommitted) | ||
require.NoError(t, err, "should be possible since we have a stored private key") | ||
err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️
In my opinion this test should fail, unless we are using the same key that was initially committed in line 320:
flow-go/storage/badger/dkg_state_test.go
Line 320 in 0687e30
err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv()) |
I am not sure whether unittest.RandomBeaconPriv()
returns different keys on each call (?) but if it does, this test should fail.
Can we please test both cases? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take care of this in follow up PR.
// perform this only if state machine is in initial state | ||
if !started { | ||
// store my beacon key for the first epoch post-spork | ||
err = myBeaconKeyStateMachine.UpsertMyBeaconPrivateKey(epochCounter, beaconPrivateKey.PrivateKey) | ||
if err != nil { | ||
return fmt.Errorf("could not upsert my beacon private key for root epoch %d: %w", epochCounter, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is easy to do, I'd prefer if we can check that the random beacon key matches the information in the Epoch Commit event ... just to be safe against human configuration errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take care of this in follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great work. The state transitions for the DKG are so much clearer now (at least for me). Only remaining request would be to include your state machine diagram in the code base:
- you could add a short Readme in the folder
docs
... maybe call itRecoverableRandomBeaconStateMachine
or something similar.
To simplify your work, I have drafted a brief explanation, which you could include in the readme (if you like it):
The
RecoverableRandomBeaconStateMachine
formalizes the life-cycle of the Random Beacon keys for each epoch. On the happy path, each consensus participant for the next epoch takes part in a DKG to obtain a threshold key to participate in Flow's Random Beacon. After successfully finishing the DKG protocol, the node obtains a random beacon private key, which is stored in the database along with DKG current stateflow.DKGStateCompleted
. If for any reason the DKG fails, then the private key will benil
and DKG current state is set toflow.DKGStateFailure
.
In case of failing Epoch switchover, the network goes into Epoch Fallback Mode [EFM]. The governance committee can recover the network via a special EpochRecover transaction. In this case, the set of threshold keys is specified by the governance committee.
The current implementation focuses on the scenario, where the governance committee re-uses the threshold key set from the last successful epoch transition. While injecting other threshold keys into the nodes is conceptually possible and supported, the utilities for this recovery path are not yet implemented.
Co-authored-by: Alexander Hentschel <[email protected]>
…m-beacon-state-machine
#6725
Context
The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.
Some key points:
badger.RecoverablePrivateBeaconKeyStateMachine
implements the state machine itself.dkg.ReactorEngine
anddkg.BeaconKeyRecovery
.EpochCommit
for respective epoch.Link to the diagram which describes the state machine: