DKG test loop #3034

lukasz-zimnoch · 2022-06-24T13:57:30Z

This pull request detaches the existing DKG mechanism from Random Beacon v1 contracts and introduces a DKG test loop that is triggered every 500th block and stores the result on disk. That test loop is an interim mechanism that will allow us to integrate with Random Beacon v2 contracts soon. All details about v1 and v2 DKG along with the test loop description are captured below.

DKG in Random Beacon V1

The DKG v1 flow is as follows:

The KeepRandomBeaconOperator contract emits the GroupSelectionStarted event that contains a seed value
The client reacts to the GroupSelectionStarted event by starting the ticket submission process using the seed value held by the event.
Once the ticket submission ends, the client checks its eligibility by calling the selectedParticipants on-chain function.
If the client is eligible for the given DKG instance, i.e controls some of the group members, it starts the DKG execution.
Once the group public key is generated, the client attempts to publish it to the KeepRandomBeaconOperator contract and monitors for DkgResultSubmittedEvent event to confirm the result publication.
Once the key is published, the client stores the key material on disk and holds the group participation info in an in-memory group cache to be ready for incoming relay entry requests.

DKG in Random Beacon V2

In Random Beacon V2, things happen differently:

The RandomBeacon contract emits the DkgStarted event that contains a seed value.
The client checks its eligibility by calling the selectGroup on-chain function exposed by the RandomBeacon or directly by the SortitionPool contract.
If the client is eligible for the given DKG instance, i.e controls some of the group members, it starts the DKG execution.
Once the group public key is generated, the client attempts to publish it to the RandomBeacon contract and monitors for DkgResultSubmitted event to confirm the result publication.
Once the key is published, the client stores the key material on disk. However, since v2 uses an optimistic DKG result validation with a challenge mechanism, the client cannot consider the group as fully-operable immediately. It must wait for the result's approval. To handle this, the client must store the key material aside and avoid adding the group participation info to the in-memory group cache right after result publication.

Interim DKG test loop introduced by this PR

We decided to move from v1 DKG to v2 DKG incrementally. To do so, we made the following steps:

We removed everything related to the current group selection (pkg/beacon/relay/groupselection) and ticket submission logic, as it is no longer relevant for v2.
We detached the DKG mechanism from the v1 KeepRandomBeaconOperator contract by mocking the contact points between the DKG logic and the contract binding. This resulted in several modifications in the pkg/beacon/relay/chain/chain.go interfaces:
- The OnGroupSelectionStarted event listener was removed and replaced by the OnDKGStarted event listener with a temporary implementation that emits the DkgStarted event every 500th block.
- The SubmitDKGResult implementation was removed and replaced with a local implementation that pipes the result directly to the handlers registered by the OnDKGResultSubmitted event listener
- The OnDKGResultSubmitted implementation was removed and replaced with a local implementation that registers a handler for results coming from SubmitDKGResult
- There is a new function SelectGroup that replaces the removed group selection logic. It has a temporary implementation that returns a group consisting of N group members controlled by the calling operator
We changed the event deduplicator logic for DkgStarted events. It's now a seed-based time cache.

The existing group selection code (ticket submission etc.) was relevant for beacon v1 but does no longer make sense for v2. The new group creation process does not use ticket submission and relies on the sortition pool group selection. This changeset is the first step towards the support of the new group creation flow.

# Conflicts: # pkg/chain/local/local.go

Contract bindings are not ready yet so, we mitigate that problem by triggering the DKG loop every 100th block.

submission logic

`GroupRegistrationInterface` So far, this function was defined in the `DistributedKeyGenerationInterface` but looks like `GroupRegistrationInterface` is a better place.

That logic is now obsolete because the v2 beacon group creation looks differently and uses the result challenge and approval mechanisms. Because of that, we remove it from the deduplicator and leave a TODO to implement the new logic when the full new group creation flow will be implemented.

threshold This way we can use values from v2, and we can test how DKG behaves with them.

In v1, a result of a successful DKG was always a fully-operable on-chain group whose key material share was registered on-disk by each member. This is no longer a case in v2 where a DKG result is not approved immediately and is the subject of a challenge period. Because of that, we modify the group registry to allow storing non-operable candidate groups first and then register those groups as approved once on-chain approval takes place.

Here we detach the DKG result submission logic from the v1 contract in favor of a temporary code that will allow to test the DKG loop. Once the interim ends, real v2 contract binding should be integrated here.

submission

The real group selection is not available yet so we just return a group whose members correspond to the operator itself. In effect, the DKG test loop will be performed between members belonging to the same client instance.

If we trigger DKG every 100th block, subsequent DKG instances overlap and cause very high CPU usage.

Random beacon DKG will still use 64 members. We reflect that change in the temporary hardcoded variable.

The approach used so far was problematic because the maximum directory size is 128 characters which doesn't leave space for the `_candidate` suffix.

pdyraga · 2022-07-05T13:12:31Z

pkg/beacon/relay/event/deduplicator.go

-
-	return false
-}
-


Quoting from the PR description:

We removed the event deduplicator for the group selection events as it is no longer relevant for v2. We will need a new logic to handle v2 caveats. This was covered by a TODO in beacon.go

I am not sure if this is the case. As far as I remember, the deduplicator was mostly handling soft forks. We were receiving the same event twice but with a different block number: n and then n+1. The implementation is quite naive: it checks if the group selection had a chance to finish since the last emitted group selection started event and if not, it was considering the event as a duplicate of the last one.

Although this is naive, and we could improve it by, for example, waiting for N blocks, then checking the last N blocks if the event is included, I think the current implementation should do the work, even for v2.

As far as I remember, the deduplicator was mostly handling soft forks

Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.

I think the current implementation should do the work, even for v2.

This is something I'm not sure about. In v2 the minimum time between two DkgStarted events is DKG protocol time + publication time + challenge period time. In theory, we can expose NotifyGroupCreationStarted function from the event deduplicator to decide about group creation start. However, at the same time, we must be prepared for DkgResultChallenged events that actually restart the process and make the time between two DkgStarted events longer. The deduplicator must handle that. This is why I think we need to approach deduplication logic once we have challenges and approvals implemented together.

Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.

Yeah, that's a good point!

The time it takes the DKG protocol to produce the result is 72 blocks ([1], [2]). Subscription monitoring mechanism pulls 100 past blocks by default and although it could make sense for the relay request subscription, it does not make sense for DKG because after the first 6 blocks of inactivity, the game for the given member is over.

This solution is not ideal but at the same time, it's the simplest one. I am afraid that dealing with challenges will make this code pretty complicated.

I think that this should work:

Leave the deduplicator for DKG as-is.

Make it clear NotifyGroupSelectionStarted filters out just for minGroupCreationDurationBlocks to the past.

When setting up a subscription for DKG, disable subscription monitoring loop. For 6 blocks, we would need to monitor pretty aggressively (every 10 seconds or so).

Does it make sense?

We synced with @lukasz-zimnoch and came up with a better solution:

When setting up a subscription for DKG, disable subscription monitoring loop.

Use a time-cache-based deduplicator and look at the DKG seed from the event.

Here it is fbe3062

And added a TODO about disabling the monitoring loop 7c122d5

pdyraga · 2022-07-05T13:16:38Z

pkg/beacon/relay/chain/chain.go

+	// SelectGroup returns the group members for the group generated by
+	// the given seed. This function can return an error if the relay chain's
+	// state does not allow for group selection at the moment.
+	SelectGroup(seed *big.Int) ([]StakerAddress, error)


#3056 introduces chain.Address, I think we could use it here.

Just a note for the future implementation with a real SortitionPool: fortunately, there is a function allowing to fetch the addresses in one shot.

Yep, there is a comment about the deprecation of StakerAddress. I can replace it with chain.Address during the cleanup, I plan to do in the pkg/beacon package.

When we call SelectGroup and assign a return value in node.go we call that var groupMembers and IMO that name is clearer because we return an array of members. WDYT about naming it SelectGroupMembers instead?

There will be a follow-up PR that will add a real-world implementation of that function. I'll consider the rename then.

pkg/beacon/relay/node.go

pdyraga · 2022-07-05T13:37:45Z

pkg/beacon/relay/registry/groups_test.go

@@ -197,6 +221,12 @@ func (mgri *mockGroupRegistrationInterface) OnGroupRegistered(
 	panic("not implemented")
 }

+func (mgri *mockGroupRegistrationInterface) IsGroupRegistered(


Is it used anywhere?

Yep, in TestUnregisterStaleGroups and TestUnregisterStaleGroupsSkipLastGroupCheck

pdyraga · 2022-07-05T13:50:03Z

pkg/beacon/relay/registry/storage.go

+	candidateMembershipPrefix = "candidate_membership"
+	approvedMembershipPrefix  = "approved_membership"


I am thinking if we shouldn't use the snapshot mechanism from ECDSA node. This is something our operators are familiar with and I think it makes sense here and in ECDSA:

any key material produced by the node lands in snapshot

when the group is approved, we copy from snapshot to current (this would be new)

We would need to "snapshot" the state anyway, and I think adding candidate/approved does not make sense if the snapshot is already there (this is essentially candidate).

Instead of passing the GroupState as a parameter, I would "snapshot" the signer as soon as the key material is produced. Then, once the group is approved on-chain, I would copy from the snapshot to the current. This way, we do not have to worry if the node has been restarted during the challenge period. Also, the "snapshot" would not store the group in the in-memory registry. This is exactly how it is working in ECDSA.

Thinking more about it, we could have something like:

We would snapshot(groupID string, membership *Membership) error as soon as the key material is produced,

We would promote(groupID string) when we receive the signal from the chain that the result is approved.

We don't care if and how many times the node was restarted between snapshot and promote. Promote would just copy snapshot/{groupId} current/{groupId}.

This is something our operators are familiar with and I think it makes sense here and in ECDSA:
- any key material produced by the node lands in snapshot
- when the group is approved, we copy from snapshot to current (this would be new)

Well, this is not that simple. The snapshot mechanism is write-only. There is no logic in the code to read from it. This is a last-resort mechanism for manual restoration in case something bad happens. What ECDSA does is rather:

any key material produced by the node lands in snapshot

when the public key is published on-chain it is just normally stored in the regular storage (snapshot is untouched)

If the client shuts down between those steps, the key would not be published at all. That works well in ECDSA because the time between a snapshot and a regular save is quite short. In beacon v2, the time between result publication and approval (the challenge period) will be much longer so the client restart risk is much higher. If we use snapshot on result publication + store on approval, we don't have automatic restore after the client restarts out of the box. This is a problem because the client must be aware of what candidate groups he is a member of.
In the current approach, we would address this by adding an in-mem cache for candidate groups once we add the result approvals and challenges feature. If we want to go with snapshots, we'll need to make a huge refactor of the entire storage code. This can be done but then, I propose to do it in a separate PR.

The solution I proposed requires refactoring of storage. The snapshot would still be read-only but we would need to be able to copy from the snapshot.

On a high level, the idea is not to load candidate members to the memory - even a separate cache - and instead copy the directory when a group is approved. I think this is simpler and has the advantage that copy is one disk write operation, we do not need to write in the loop which I think is less risky.

What I do not like in the current implementation is that looking at just current directory it is not possible to say which groups were approved and which were not. We are also mixing candidate key shares with approved key shares which is not the end of the world but I think we could make it cleaner.

If we want to treat snapshots as something special we write only one time and never touch them again, I think it's even better because by obeying this rule in the code, we avoid corrupting snapshot shares in case of a programming error. Looking at the storage level, we would have:

snapshot/ 0xAAA 0xBBB 0xCCC current/ candidate_0xAAA/ membership_1 membership_21 0xBBB membership_33 membership_55 0xCCC membership_64

Then, when 0xAAA gets approved, we mv current/candidate_0xAAA current/0xAAA and we finally land with:

snapshot/ 0xAAA 0xBBB 0xCCC current/ 0xAAA/ membership_1 membership_21 0xBBB membership_33 membership_55 0xCCC membership_64

Once the directory is renamed, we load all the members from 0xAAA into the memory.

If we are fine with this layout, I would propose something like:

groups.go

// RegisterCandidateGroup registers that a candidate group with the given // public key was successfully created. func (g *Groups) RegisterCandidateGroup( signer *dkg.ThresholdSigner, groupId string, ) error { g.mutex.Lock() defer g.mutex.Unlock() groupPublicKey := groupKeyToString(signer.GroupPublicKeyBytes()) membership := &Membership{ Signer: signer, ChannelName: groupId, } err := g.storage.saveCandidate(membership, groupID) if err != nil { return fmt.Errorf("could not persist membership to the storage: [%v]", err) } if groupState == GroupStateApproved { g.myGroups[groupPublicKey] = append(g.myGroups[groupPublicKey], membership) } return nil } // ApproveGroup registers that a candidate group with the given // public key was successfully approved. func (g *Groups) ApproveGroup( groupId string, ) error { err := g.storage.approveCandidate(groupId) if err != nil { return fmt.Errorf("could not promote group to approved in the storage: [%v]", err) } loadGroup(groupId) }

storage.go

func (ps *persistentStorage) saveCandidate( membership *Membership, groupID string // use this as a directory name ) error { // (...) } func (ps *persistentStorage) approveCandidate( groupID string // use this as a directory name ) error { // (...) } func (ps *persistentStorage) read(groupID string) (<-chan *Membership, <-chan error) { // (...) load a single group } // Alternatively, we can combine approveCandidate(groupID string) with read(groupID string): func (ps *persistentStorage) approveCandidate( groupID string // use this as a directory name ) (<-chan *Membership, <-chan error) { // (...) }

However, this is a bit of work and as you say, it's not the best idea to do it in this PR. If we agree with this approach, I'd clean up the code in this PR and prepare for the new version:

Get rid of GroupState and candidate/approved prefixes.

Rename save in storage.go to saveCandidate, accept groupID string, and prepend candidate_ to group identifier when saving.

In groups.go, have only RegisterCandidateGroup.

Modify storage.go readAll to read candidate groups from disk. Make it a temporary change, until we refactor the storage. In a separate PR refactoring the storage, we should clean this up to read only approved groups and temporarily, approve the group right away, after it's saves as a candidate.

WDYT?

We synced with @lukasz-zimnoch and agreed to revert the changes to write candidate groups to current in this PR, and revisit the approach to storage separately. Given there is a limit on the directory name length, we will probably end up with:

snapshot/ candidate/ approved/

See 641466e

Keeping this thread as unresolved so that we can refer it back when reworking the storage.

pdyraga · 2022-07-05T13:56:09Z

pkg/chain/ethereum/connect.go

+	// TODO: Fetch from RandomBeacon v2 contract.
+	groupSize := 64
+	honestThreshold := 33


Or hardcode in constants. 🤷🏻

... Or read from the constants of BeaconDkgValidator bindings.

pkg/chain/ethereum/ethereum.go

Co-authored-by: Piotr Dyraga <[email protected]>

pkg/beacon/relay/registry/storage.go

pdyraga · 2022-07-06T11:28:28Z

There are some problems with running the client without v1 contracts deployed and initialized.

In #3026 we reworked v1 scripts into v2 approach so it is no longer possible to set up v1 contracts locally on the dev environment. Also, the work we are doing in #3012 will set up testnet based on v2 contracts.

We need to find a way to run a client now. I tried locally and I think we can do some workarounds for M1:

Disable ec.initializeBalanceMonitoring(ctx) in connect.go.
Rework fetchChainConfig in connect.go to return hardcoded values.
Disable stakeMonitor.HasMinimumStake check in start.go.

loop for `DkgStarted` event.

pdyraga · 2022-07-06T11:44:14Z

One more note, I think we shouldn't close issue #3013 with merging this PR. We still need to integrate with Sortition Pool group selection and ensure we can run this loop over the network with multiple clients.

pdyraga · 2022-07-06T11:59:18Z

There is one more problem with v1 contracts not deployed. The DKG errors out with:

2022-07-06T13:54:39.750+0200	WARN	keep-dkg	[email protected]/log.go:180	[member:22] DKG result publication process failed [failed to initiate new state [could not check if the result is already submitted: [response [[]] was not long enough to interpret while resolving original error [no contract code at given address]]]]
2022-07-06T13:54:39.892+0200	ERROR	keep-relay	relay/node.go:136	failed to execute dkg: [DKG result publication timed out]

I would consider disabling dkgResult.Publish and decideMemberFate in dkg.go. We could theoretically short-circuit them but for v2 we need more sophisticated behavior when it comes to the result publication.

contracts

lukasz-zimnoch · 2022-07-06T12:48:36Z

There is one more problem with v1 contracts not deployed. The DKG errors out with:
2022-07-06T13:54:39.750+0200	WARN	keep-dkg	[email protected]/log.go:180	[member:22] DKG result publication process failed [failed to initiate new state [could not check if the result is already submitted: [response [[]] was not long enough to interpret while resolving original error [no contract code at given address]]]]
2022-07-06T13:54:39.892+0200	ERROR	keep-relay	relay/node.go:136	failed to execute dkg: [DKG result publication timed out]
I would consider disabling dkgResult.Publish and decideMemberFate in dkg.go. We could theoretically short-circuit them but for v2 we need more sophisticated behavior when it comes to the result publication.

Here it is 10e12b9

lukasz-zimnoch self-assigned this Jun 24, 2022

lukasz-zimnoch added the 📟 client label Jun 24, 2022

lukasz-zimnoch added this to the v2.0.0-m1 milestone Jun 24, 2022

lukasz-zimnoch added 14 commits June 29, 2022 14:14

Merge branch 'main' into dkg-loop

e895b9b

# Conflicts: # pkg/chain/local/local.go

Start DKG every 100th block

c310f80

Contract bindings are not ready yet so, we mitigate that problem by triggering the DKG loop every 100th block.

Cleanup the remaining parts related to ticket

e620aed

submission logic

Move IsGroupRegistered to

762b0d8

`GroupRegistrationInterface` So far, this function was defined in the `DistributedKeyGenerationInterface` but looks like `GroupRegistrationInterface` is a better place.

s/JoinGroupIfEligible/JoinDKGIfEligible

b0d2670

Temporarily hardcode group size and honest

66e9c3a

threshold This way we can use values from v2, and we can test how DKG behaves with them.

Detach DKG result submission from the v1 contract

a5b67cb

Here we detach the DKG result submission logic from the v1 contract in favor of a temporary code that will allow to test the DKG loop. Once the interim ends, real v2 contract binding should be integrated here.

Remove the last parts related with ticket

0d940da

submission

Integrate mock group selection

002b15a

The real group selection is not available yet so we just return a group whose members correspond to the operator itself. In effect, the DKG test loop will be performed between members belonging to the same client instance.

Remove obsolete deduplicator tests

1900757

Cache only approved groups

b04ec33

Additional logging around DKG trigger

0aea172

lukasz-zimnoch requested a review from pdyraga July 4, 2022 13:29

lukasz-zimnoch added 4 commits July 4, 2022 15:40

Trigger DKG loop every 500th block

05cc9ca

If we trigger DKG every 100th block, subsequent DKG instances overlap and cause very high CPU usage.

Return to 64 group members

e002bfb

Random beacon DKG will still use 64 members. We reflect that change in the temporary hardcoded variable.

Return after group registration error

1c5a815

Fix candidate group storage

c80c1ab

The approach used so far was problematic because the maximum directory size is 128 characters which doesn't leave space for the `_candidate` suffix.

lukasz-zimnoch marked this pull request as ready for review July 5, 2022 08:54

lukasz-zimnoch added 2 commits July 5, 2022 10:57

Merge branch 'main' into dkg-loop

d2e607c

Fix failing group registry test

6fd17ee

lukasz-zimnoch requested a review from dimpar July 5, 2022 11:27

pdyraga reviewed Jul 5, 2022

View reviewed changes

lukasz-zimnoch and others added 2 commits July 6, 2022 08:59

Improve docs of JoinDKGIfEligible function

e49262d

Co-authored-by: Piotr Dyraga <[email protected]>

s/100th/500th

185fe7f

Co-authored-by: Piotr Dyraga <[email protected]>

lukasz-zimnoch requested a review from pdyraga July 6, 2022 07:03

dimpar reviewed Jul 6, 2022

View reviewed changes

pkg/beacon/relay/registry/storage.go Outdated Show resolved Hide resolved

lukasz-zimnoch added 4 commits July 6, 2022 11:30

Revert changes related to group storage

641466e

s/channelName/groupPublicKey

5ac87df

TODO about key material snapshotting

ed315ca

Deduplication of DKGStarted events

fbe3062

Mention about disabling subscription monitoring

7c122d5

loop for `DkgStarted` event.

lukasz-zimnoch mentioned this pull request Jul 6, 2022

Refactoring of the chain packages #3062

Merged

Make the DKG test loop working without v1

10e12b9

contracts

pdyraga approved these changes Jul 6, 2022

View reviewed changes

pdyraga merged commit f86eab5 into main Jul 6, 2022

pdyraga deleted the dkg-loop branch July 6, 2022 13:13

pdyraga modified the milestones: v2.0.0-m0, v2.0.0-m4 Sep 30, 2022

pdyraga unassigned lukasz-zimnoch Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DKG test loop #3034

DKG test loop #3034

lukasz-zimnoch commented Jun 24, 2022 •

edited

Loading

pdyraga Jul 5, 2022

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 6, 2022 •

edited

Loading

pdyraga Jul 6, 2022

lukasz-zimnoch Jul 6, 2022

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 5, 2022

pdyraga Jul 5, 2022

lukasz-zimnoch Jul 6, 2022

dimpar Jul 6, 2022

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 5, 2022

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 5, 2022

pdyraga Jul 5, 2022 •

edited

Loading

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 6, 2022

pdyraga Jul 6, 2022

lukasz-zimnoch Jul 6, 2022

pdyraga Jul 6, 2022

pdyraga Jul 5, 2022

pdyraga commented Jul 6, 2022

pdyraga commented Jul 6, 2022

pdyraga commented Jul 6, 2022

lukasz-zimnoch commented Jul 6, 2022

		candidateMembershipPrefix = "candidate_membership"
		approvedMembershipPrefix = "approved_membership"

DKG test loop #3034

DKG test loop #3034

Conversation

lukasz-zimnoch commented Jun 24, 2022 • edited Loading

DKG in Random Beacon V1

DKG in Random Beacon V2

Interim DKG test loop introduced by this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdyraga Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdyraga Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdyraga commented Jul 6, 2022

pdyraga commented Jul 6, 2022

pdyraga commented Jul 6, 2022

lukasz-zimnoch commented Jul 6, 2022

lukasz-zimnoch commented Jun 24, 2022 •

edited

Loading

pdyraga Jul 6, 2022 •

edited

Loading

pdyraga Jul 5, 2022 •

edited

Loading