-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DKG test loop #3034
DKG test loop #3034
Conversation
The existing group selection code (ticket submission etc.) was relevant for beacon v1 but does no longer make sense for v2. The new group creation process does not use ticket submission and relies on the sortition pool group selection. This changeset is the first step towards the support of the new group creation flow.
# Conflicts: # pkg/chain/local/local.go
Contract bindings are not ready yet so, we mitigate that problem by triggering the DKG loop every 100th block.
submission logic
`GroupRegistrationInterface` So far, this function was defined in the `DistributedKeyGenerationInterface` but looks like `GroupRegistrationInterface` is a better place.
That logic is now obsolete because the v2 beacon group creation looks differently and uses the result challenge and approval mechanisms. Because of that, we remove it from the deduplicator and leave a TODO to implement the new logic when the full new group creation flow will be implemented.
threshold This way we can use values from v2, and we can test how DKG behaves with them.
In v1, a result of a successful DKG was always a fully-operable on-chain group whose key material share was registered on-disk by each member. This is no longer a case in v2 where a DKG result is not approved immediately and is the subject of a challenge period. Because of that, we modify the group registry to allow storing non-operable candidate groups first and then register those groups as approved once on-chain approval takes place.
Here we detach the DKG result submission logic from the v1 contract in favor of a temporary code that will allow to test the DKG loop. Once the interim ends, real v2 contract binding should be integrated here.
submission
The real group selection is not available yet so we just return a group whose members correspond to the operator itself. In effect, the DKG test loop will be performed between members belonging to the same client instance.
If we trigger DKG every 100th block, subsequent DKG instances overlap and cause very high CPU usage.
Random beacon DKG will still use 64 members. We reflect that change in the temporary hardcoded variable.
The approach used so far was problematic because the maximum directory size is 128 characters which doesn't leave space for the `_candidate` suffix.
|
||
return false | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting from the PR description:
We removed the event deduplicator for the group selection events as it is no longer relevant for v2. We will need a new logic to handle v2 caveats. This was covered by a TODO in beacon.go
I am not sure if this is the case. As far as I remember, the deduplicator was mostly handling soft forks. We were receiving the same event twice but with a different block number: n
and then n+1
. The implementation is quite naive: it checks if the group selection had a chance to finish since the last emitted group selection started event and if not, it was considering the event as a duplicate of the last one.
Although this is naive, and we could improve it by, for example, waiting for N blocks, then checking the last N blocks if the event is included, I think the current implementation should do the work, even for v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember, the deduplicator was mostly handling soft forks
Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.
I think the current implementation should do the work, even for v2.
This is something I'm not sure about. In v2 the minimum time between two DkgStarted
events is DKG protocol time + publication time + challenge period time
. In theory, we can expose NotifyGroupCreationStarted
function from the event deduplicator to decide about group creation start. However, at the same time, we must be prepared for DkgResultChallenged
events that actually restart the process and make the time between two DkgStarted
events longer. The deduplicator must handle that. This is why I think we need to approach deduplication logic once we have challenges and approvals implemented together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.
Yeah, that's a good point!
The time it takes the DKG protocol to produce the result is 72 blocks ([1], [2]). Subscription monitoring mechanism pulls 100 past blocks by default and although it could make sense for the relay request subscription, it does not make sense for DKG because after the first 6 blocks of inactivity, the game for the given member is over.
This solution is not ideal but at the same time, it's the simplest one. I am afraid that dealing with challenges will make this code pretty complicated.
I think that this should work:
- Leave the deduplicator for DKG as-is.
- Make it clear
NotifyGroupSelectionStarted
filters out just forminGroupCreationDurationBlocks
to the past. - When setting up a subscription for DKG, disable subscription monitoring loop. For 6 blocks, we would need to monitor pretty aggressively (every 10 seconds or so).
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We synced with @lukasz-zimnoch and came up with a better solution:
- When setting up a subscription for DKG, disable subscription monitoring loop.
- Use a time-cache-based deduplicator and look at the DKG seed from the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is fbe3062
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And added a TODO about disabling the monitoring loop 7c122d5
// SelectGroup returns the group members for the group generated by | ||
// the given seed. This function can return an error if the relay chain's | ||
// state does not allow for group selection at the moment. | ||
SelectGroup(seed *big.Int) ([]StakerAddress, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3056 introduces chain.Address
, I think we could use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for the future implementation with a real SortitionPool
: fortunately, there is a function allowing to fetch the addresses in one shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, there is a comment about the deprecation of StakerAddress
. I can replace it with chain.Address
during the cleanup, I plan to do in the pkg/beacon
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we call SelectGroup
and assign a return value in node.go
we call that var groupMembers
and IMO that name is clearer because we return an array of members. WDYT about naming it SelectGroupMembers
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a follow-up PR that will add a real-world implementation of that function. I'll consider the rename then.
@@ -197,6 +221,12 @@ func (mgri *mockGroupRegistrationInterface) OnGroupRegistered( | |||
panic("not implemented") | |||
} | |||
|
|||
func (mgri *mockGroupRegistrationInterface) IsGroupRegistered( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, in TestUnregisterStaleGroups
and TestUnregisterStaleGroupsSkipLastGroupCheck
pkg/beacon/relay/registry/storage.go
Outdated
candidateMembershipPrefix = "candidate_membership" | ||
approvedMembershipPrefix = "approved_membership" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we shouldn't use the snapshot mechanism from ECDSA node. This is something our operators are familiar with and I think it makes sense here and in ECDSA:
- any key material produced by the node lands in
snapshot
- when the group is approved, we copy from
snapshot
tocurrent
(this would be new)
We would need to "snapshot" the state anyway, and I think adding candidate
/approved
does not make sense if the snapshot is already there (this is essentially candidate).
Instead of passing the GroupState
as a parameter, I would "snapshot" the signer as soon as the key material is produced. Then, once the group is approved on-chain, I would copy from the snapshot
to the current
. This way, we do not have to worry if the node has been restarted during the challenge period. Also, the "snapshot" would not store the group in the in-memory registry. This is exactly how it is working in ECDSA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about it, we could have something like:
- We would
snapshot(groupID string, membership *Membership) error
as soon as the key material is produced, - We would
promote(groupID string)
when we receive the signal from the chain that the result is approved.
We don't care if and how many times the node was restarted between snapshot
and promote
. Promote would just copy snapshot/{groupId} current/{groupId}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something our operators are familiar with and I think it makes sense here and in ECDSA:
- any key material produced by the node lands in snapshot
- when the group is approved, we copy from snapshot to current (this would be new)
Well, this is not that simple. The snapshot mechanism is write-only. There is no logic in the code to read from it. This is a last-resort mechanism for manual restoration in case something bad happens. What ECDSA does is rather:
- any key material produced by the node lands in snapshot
- when the public key is published on-chain it is just normally stored in the regular storage (snapshot is untouched)
If the client shuts down between those steps, the key would not be published at all. That works well in ECDSA because the time between a snapshot and a regular save is quite short. In beacon v2, the time between result publication and approval (the challenge period) will be much longer so the client restart risk is much higher. If we use snapshot on result publication + store on approval, we don't have automatic restore after the client restarts out of the box. This is a problem because the client must be aware of what candidate groups he is a member of.
In the current approach, we would address this by adding an in-mem cache for candidate groups once we add the result approvals and challenges feature. If we want to go with snapshots, we'll need to make a huge refactor of the entire storage code. This can be done but then, I propose to do it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution I proposed requires refactoring of storage. The snapshot would still be read-only but we would need to be able to copy from the snapshot.
On a high level, the idea is not to load candidate members to the memory - even a separate cache - and instead copy the directory when a group is approved. I think this is simpler and has the advantage that copy is one disk write operation, we do not need to write in the loop which I think is less risky.
What I do not like in the current implementation is that looking at just current
directory it is not possible to say which groups were approved and which were not. We are also mixing candidate key shares with approved key shares which is not the end of the world but I think we could make it cleaner.
If we want to treat snapshots as something special we write only one time and never touch them again, I think it's even better because by obeying this rule in the code, we avoid corrupting snapshot shares in case of a programming error. Looking at the storage level, we would have:
snapshot/
0xAAA
0xBBB
0xCCC
current/
candidate_0xAAA/
membership_1
membership_21
0xBBB
membership_33
membership_55
0xCCC
membership_64
Then, when 0xAAA
gets approved, we mv current/candidate_0xAAA current/0xAAA
and we finally land with:
snapshot/
0xAAA
0xBBB
0xCCC
current/
0xAAA/
membership_1
membership_21
0xBBB
membership_33
membership_55
0xCCC
membership_64
Once the directory is renamed, we load all the members from 0xAAA
into the memory.
If we are fine with this layout, I would propose something like:
groups.go
// RegisterCandidateGroup registers that a candidate group with the given
// public key was successfully created.
func (g *Groups) RegisterCandidateGroup(
signer *dkg.ThresholdSigner,
groupId string,
) error {
g.mutex.Lock()
defer g.mutex.Unlock()
groupPublicKey := groupKeyToString(signer.GroupPublicKeyBytes())
membership := &Membership{
Signer: signer,
ChannelName: groupId,
}
err := g.storage.saveCandidate(membership, groupID)
if err != nil {
return fmt.Errorf("could not persist membership to the storage: [%v]", err)
}
if groupState == GroupStateApproved {
g.myGroups[groupPublicKey] = append(g.myGroups[groupPublicKey], membership)
}
return nil
}
// ApproveGroup registers that a candidate group with the given
// public key was successfully approved.
func (g *Groups) ApproveGroup(
groupId string,
) error {
err := g.storage.approveCandidate(groupId)
if err != nil {
return fmt.Errorf("could not promote group to approved in the storage: [%v]", err)
}
loadGroup(groupId)
}
storage.go
func (ps *persistentStorage) saveCandidate(
membership *Membership,
groupID string // use this as a directory name
) error {
// (...)
}
func (ps *persistentStorage) approveCandidate(
groupID string // use this as a directory name
) error {
// (...)
}
func (ps *persistentStorage) read(groupID string) (<-chan *Membership, <-chan error) {
// (...) load a single group
}
// Alternatively, we can combine approveCandidate(groupID string) with read(groupID string):
func (ps *persistentStorage) approveCandidate(
groupID string // use this as a directory name
) (<-chan *Membership, <-chan error) {
// (...)
}
However, this is a bit of work and as you say, it's not the best idea to do it in this PR. If we agree with this approach, I'd clean up the code in this PR and prepare for the new version:
- Get rid of
GroupState
and candidate/approved prefixes. - Rename
save
instorage.go
tosaveCandidate
, acceptgroupID string
, and prependcandidate_
to group identifier when saving. - In
groups.go
, have onlyRegisterCandidateGroup
. - Modify
storage.go
readAll
to read candidate groups from disk. Make it a temporary change, until we refactor the storage. In a separate PR refactoring the storage, we should clean this up to read only approved groups and temporarily, approve the group right away, after it's saves as a candidate.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We synced with @lukasz-zimnoch and agreed to revert the changes to write candidate groups to current
in this PR, and revisit the approach to storage separately. Given there is a limit on the directory name length, we will probably end up with:
snapshot/
candidate/
approved/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 641466e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this thread as unresolved so that we can refer it back when reworking the storage.
// TODO: Fetch from RandomBeacon v2 contract. | ||
groupSize := 64 | ||
honestThreshold := 33 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or hardcode in constants. 🤷🏻
... Or read from the constants of BeaconDkgValidator
bindings.
Co-authored-by: Piotr Dyraga <[email protected]>
Co-authored-by: Piotr Dyraga <[email protected]>
There are some problems with running the client without v1 contracts deployed and initialized. In #3026 we reworked v1 scripts into v2 approach so it is no longer possible to set up v1 contracts locally on the dev environment. Also, the work we are doing in #3012 will set up testnet based on v2 contracts. We need to find a way to run a client now. I tried locally and I think we can do some workarounds for M1:
|
loop for `DkgStarted` event.
One more note, I think we shouldn't close issue #3013 with merging this PR. We still need to integrate with Sortition Pool group selection and ensure we can run this loop over the network with multiple clients. |
There is one more problem with v1 contracts not deployed. The DKG errors out with:
I would consider disabling |
Here it is 10e12b9 |
Refs: #3013
This pull request detaches the existing DKG mechanism from Random Beacon v1 contracts and introduces a DKG test loop that is triggered every 500th block and stores the result on disk. That test loop is an interim mechanism that will allow us to integrate with Random Beacon v2 contracts soon. All details about v1 and v2 DKG along with the test loop description are captured below.
DKG in Random Beacon V1
The DKG v1 flow is as follows:
KeepRandomBeaconOperator
contract emits theGroupSelectionStarted
event that contains a seed valueGroupSelectionStarted
event by starting the ticket submission process using the seed value held by the event.selectedParticipants
on-chain function.KeepRandomBeaconOperator
contract and monitors forDkgResultSubmittedEvent
event to confirm the result publication.DKG in Random Beacon V2
In Random Beacon V2, things happen differently:
RandomBeacon
contract emits theDkgStarted
event that contains a seed value.selectGroup
on-chain function exposed by theRandomBeacon
or directly by theSortitionPool
contract.RandomBeacon
contract and monitors forDkgResultSubmitted
event to confirm the result publication.Interim DKG test loop introduced by this PR
We decided to move from v1 DKG to v2 DKG incrementally. To do so, we made the following steps:
pkg/beacon/relay/groupselection
) and ticket submission logic, as it is no longer relevant for v2.KeepRandomBeaconOperator
contract by mocking the contact points between the DKG logic and the contract binding. This resulted in several modifications in thepkg/beacon/relay/chain/chain.go
interfaces:OnGroupSelectionStarted
event listener was removed and replaced by theOnDKGStarted
event listener with a temporary implementation that emits theDkgStarted
event every 500th block.SubmitDKGResult
implementation was removed and replaced with a local implementation that pipes the result directly to the handlers registered by theOnDKGResultSubmitted
event listenerOnDKGResultSubmitted
implementation was removed and replaced with a local implementation that registers a handler for results coming fromSubmitDKGResult
SelectGroup
that replaces the removed group selection logic. It has a temporary implementation that returns a group consisting ofN
group members controlled by the calling operatorDkgStarted
events. It's now a seed-based time cache.