Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DKG test loop #3034

Merged
merged 29 commits into from
Jul 6, 2022
Merged

DKG test loop #3034

merged 29 commits into from
Jul 6, 2022

Conversation

lukasz-zimnoch
Copy link
Member

@lukasz-zimnoch lukasz-zimnoch commented Jun 24, 2022

Refs: #3013

This pull request detaches the existing DKG mechanism from Random Beacon v1 contracts and introduces a DKG test loop that is triggered every 500th block and stores the result on disk. That test loop is an interim mechanism that will allow us to integrate with Random Beacon v2 contracts soon. All details about v1 and v2 DKG along with the test loop description are captured below.

DKG in Random Beacon V1

The DKG v1 flow is as follows:

  1. The KeepRandomBeaconOperator contract emits the GroupSelectionStarted event that contains a seed value
  2. The client reacts to the GroupSelectionStarted event by starting the ticket submission process using the seed value held by the event.
  3. Once the ticket submission ends, the client checks its eligibility by calling the selectedParticipants on-chain function.
  4. If the client is eligible for the given DKG instance, i.e controls some of the group members, it starts the DKG execution.
  5. Once the group public key is generated, the client attempts to publish it to the KeepRandomBeaconOperator contract and monitors for DkgResultSubmittedEvent event to confirm the result publication.
  6. Once the key is published, the client stores the key material on disk and holds the group participation info in an in-memory group cache to be ready for incoming relay entry requests.

DKG in Random Beacon V2

In Random Beacon V2, things happen differently:

  1. The RandomBeacon contract emits the DkgStarted event that contains a seed value.
  2. The client checks its eligibility by calling the selectGroup on-chain function exposed by the RandomBeacon or directly by the SortitionPool contract.
  3. If the client is eligible for the given DKG instance, i.e controls some of the group members, it starts the DKG execution.
  4. Once the group public key is generated, the client attempts to publish it to the RandomBeacon contract and monitors for DkgResultSubmitted event to confirm the result publication.
  5. Once the key is published, the client stores the key material on disk. However, since v2 uses an optimistic DKG result validation with a challenge mechanism, the client cannot consider the group as fully-operable immediately. It must wait for the result's approval. To handle this, the client must store the key material aside and avoid adding the group participation info to the in-memory group cache right after result publication.

Interim DKG test loop introduced by this PR

We decided to move from v1 DKG to v2 DKG incrementally. To do so, we made the following steps:

  • We removed everything related to the current group selection (pkg/beacon/relay/groupselection) and ticket submission logic, as it is no longer relevant for v2.
  • We detached the DKG mechanism from the v1 KeepRandomBeaconOperator contract by mocking the contact points between the DKG logic and the contract binding. This resulted in several modifications in the pkg/beacon/relay/chain/chain.go interfaces:
    • The OnGroupSelectionStarted event listener was removed and replaced by the OnDKGStarted event listener with a temporary implementation that emits the DkgStarted event every 500th block.
    • The SubmitDKGResult implementation was removed and replaced with a local implementation that pipes the result directly to the handlers registered by the OnDKGResultSubmitted event listener
    • The OnDKGResultSubmitted implementation was removed and replaced with a local implementation that registers a handler for results coming from SubmitDKGResult
    • There is a new function SelectGroup that replaces the removed group selection logic. It has a temporary implementation that returns a group consisting of N group members controlled by the calling operator
  • We changed the event deduplicator logic for DkgStarted events. It's now a seed-based time cache.

The existing group selection code (ticket
submission etc.) was relevant for beacon v1
but does no longer make sense for v2. The new
group creation process does not use ticket
submission and relies on the sortition pool
group selection. This changeset is the first
step towards the support of the new group
creation flow.
@lukasz-zimnoch lukasz-zimnoch self-assigned this Jun 24, 2022
@lukasz-zimnoch lukasz-zimnoch added this to the v2.0.0-m1 milestone Jun 24, 2022
# Conflicts:
#	pkg/chain/local/local.go
Contract bindings are not ready yet so, we
mitigate that problem by triggering the
DKG loop every 100th block.
`GroupRegistrationInterface`

So far, this function was defined in the
`DistributedKeyGenerationInterface` but
looks like `GroupRegistrationInterface` is
a better place.
That logic is now obsolete because the v2
beacon group creation looks differently and
uses the result challenge and approval mechanisms.
Because of that, we remove it from the
deduplicator and leave a TODO to implement
the new logic when the full new group creation
flow will be implemented.
threshold

This way we can use values from v2, and we can
test how DKG behaves with them.
In v1, a result of a successful DKG was always a
fully-operable on-chain group whose key material
share was registered on-disk by each member.
This is no longer a case in v2 where a DKG
result is not approved immediately and is the
subject of a challenge period. Because of that,
we modify the group registry to allow storing
non-operable candidate groups first and then
register those groups as approved once on-chain
approval takes place.
Here we detach the DKG result submission logic
from the v1 contract in favor of a temporary
code that will allow to test the DKG loop.
Once the interim ends, real v2 contract binding
should be integrated here.
The real group selection is not available yet
so we just return a group whose members correspond
to the operator itself. In effect, the DKG test
loop will be performed between members belonging
to the same client instance.
If we trigger DKG every 100th block, subsequent
DKG instances overlap and cause very high
CPU usage.
Random beacon DKG will still use 64 members.
We reflect that change in the temporary
hardcoded variable.
The approach used so far was problematic because
the maximum directory size is 128 characters
which doesn't leave space for the `_candidate`
suffix.
@lukasz-zimnoch lukasz-zimnoch marked this pull request as ready for review July 5, 2022 08:54

return false
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from the PR description:

We removed the event deduplicator for the group selection events as it is no longer relevant for v2. We will need a new logic to handle v2 caveats. This was covered by a TODO in beacon.go

I am not sure if this is the case. As far as I remember, the deduplicator was mostly handling soft forks. We were receiving the same event twice but with a different block number: n and then n+1. The implementation is quite naive: it checks if the group selection had a chance to finish since the last emitted group selection started event and if not, it was considering the event as a duplicate of the last one.

Although this is naive, and we could improve it by, for example, waiting for N blocks, then checking the last N blocks if the event is included, I think the current implementation should do the work, even for v2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, the deduplicator was mostly handling soft forks

Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.

I think the current implementation should do the work, even for v2.

This is something I'm not sure about. In v2 the minimum time between two DkgStarted events is DKG protocol time + publication time + challenge period time. In theory, we can expose NotifyGroupCreationStarted function from the event deduplicator to decide about group creation start. However, at the same time, we must be prepared for DkgResultChallenged events that actually restart the process and make the time between two DkgStarted events longer. The deduplicator must handle that. This is why I think we need to approach deduplication logic once we have challenges and approvals implemented together.

Copy link
Member

@pdyraga pdyraga Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only soft forks. The event subscription mechanism from contract bindings can replay the event when fetching past events. This happens more often than soft forks and results in an event having the same block.

Yeah, that's a good point!

The time it takes the DKG protocol to produce the result is 72 blocks ([1], [2]). Subscription monitoring mechanism pulls 100 past blocks by default and although it could make sense for the relay request subscription, it does not make sense for DKG because after the first 6 blocks of inactivity, the game for the given member is over.

This solution is not ideal but at the same time, it's the simplest one. I am afraid that dealing with challenges will make this code pretty complicated.

I think that this should work:

  1. Leave the deduplicator for DKG as-is.
  2. Make it clear NotifyGroupSelectionStarted filters out just for minGroupCreationDurationBlocks to the past.
  3. When setting up a subscription for DKG, disable subscription monitoring loop. For 6 blocks, we would need to monitor pretty aggressively (every 10 seconds or so).

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We synced with @lukasz-zimnoch and came up with a better solution:

  1. When setting up a subscription for DKG, disable subscription monitoring loop.
  2. Use a time-cache-based deduplicator and look at the DKG seed from the event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is fbe3062

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And added a TODO about disabling the monitoring loop 7c122d5

// SelectGroup returns the group members for the group generated by
// the given seed. This function can return an error if the relay chain's
// state does not allow for group selection at the moment.
SelectGroup(seed *big.Int) ([]StakerAddress, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3056 introduces chain.Address, I think we could use it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for the future implementation with a real SortitionPool: fortunately, there is a function allowing to fetch the addresses in one shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, there is a comment about the deprecation of StakerAddress. I can replace it with chain.Address during the cleanup, I plan to do in the pkg/beacon package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we call SelectGroup and assign a return value in node.go we call that var groupMembers and IMO that name is clearer because we return an array of members. WDYT about naming it SelectGroupMembers instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a follow-up PR that will add a real-world implementation of that function. I'll consider the rename then.

pkg/beacon/relay/node.go Outdated Show resolved Hide resolved
pkg/beacon/relay/node.go Outdated Show resolved Hide resolved
@@ -197,6 +221,12 @@ func (mgri *mockGroupRegistrationInterface) OnGroupRegistered(
panic("not implemented")
}

func (mgri *mockGroupRegistrationInterface) IsGroupRegistered(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, in TestUnregisterStaleGroups and TestUnregisterStaleGroupsSkipLastGroupCheck

Comment on lines 14 to 15
candidateMembershipPrefix = "candidate_membership"
approvedMembershipPrefix = "approved_membership"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we shouldn't use the snapshot mechanism from ECDSA node. This is something our operators are familiar with and I think it makes sense here and in ECDSA:

  • any key material produced by the node lands in snapshot
  • when the group is approved, we copy from snapshot to current (this would be new)

We would need to "snapshot" the state anyway, and I think adding candidate/approved does not make sense if the snapshot is already there (this is essentially candidate).

Instead of passing the GroupState as a parameter, I would "snapshot" the signer as soon as the key material is produced. Then, once the group is approved on-chain, I would copy from the snapshot to the current. This way, we do not have to worry if the node has been restarted during the challenge period. Also, the "snapshot" would not store the group in the in-memory registry. This is exactly how it is working in ECDSA.

Copy link
Member

@pdyraga pdyraga Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, we could have something like:

  • We would snapshot(groupID string, membership *Membership) error as soon as the key material is produced,
  • We would promote(groupID string) when we receive the signal from the chain that the result is approved.

We don't care if and how many times the node was restarted between snapshot and promote. Promote would just copy snapshot/{groupId} current/{groupId}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something our operators are familiar with and I think it makes sense here and in ECDSA:
- any key material produced by the node lands in snapshot
- when the group is approved, we copy from snapshot to current (this would be new)

Well, this is not that simple. The snapshot mechanism is write-only. There is no logic in the code to read from it. This is a last-resort mechanism for manual restoration in case something bad happens. What ECDSA does is rather:

  • any key material produced by the node lands in snapshot
  • when the public key is published on-chain it is just normally stored in the regular storage (snapshot is untouched)

If the client shuts down between those steps, the key would not be published at all. That works well in ECDSA because the time between a snapshot and a regular save is quite short. In beacon v2, the time between result publication and approval (the challenge period) will be much longer so the client restart risk is much higher. If we use snapshot on result publication + store on approval, we don't have automatic restore after the client restarts out of the box. This is a problem because the client must be aware of what candidate groups he is a member of.
In the current approach, we would address this by adding an in-mem cache for candidate groups once we add the result approvals and challenges feature. If we want to go with snapshots, we'll need to make a huge refactor of the entire storage code. This can be done but then, I propose to do it in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution I proposed requires refactoring of storage. The snapshot would still be read-only but we would need to be able to copy from the snapshot.

On a high level, the idea is not to load candidate members to the memory - even a separate cache - and instead copy the directory when a group is approved. I think this is simpler and has the advantage that copy is one disk write operation, we do not need to write in the loop which I think is less risky.

What I do not like in the current implementation is that looking at just current directory it is not possible to say which groups were approved and which were not. We are also mixing candidate key shares with approved key shares which is not the end of the world but I think we could make it cleaner.

If we want to treat snapshots as something special we write only one time and never touch them again, I think it's even better because by obeying this rule in the code, we avoid corrupting snapshot shares in case of a programming error. Looking at the storage level, we would have:

snapshot/
  0xAAA
  0xBBB
  0xCCC
current/
  candidate_0xAAA/
    membership_1
    membership_21
  0xBBB
    membership_33
    membership_55
  0xCCC
     membership_64

Then, when 0xAAA gets approved, we mv current/candidate_0xAAA current/0xAAA and we finally land with:

snapshot/
  0xAAA
  0xBBB
  0xCCC
current/
  0xAAA/
    membership_1
    membership_21
  0xBBB
    membership_33
    membership_55
  0xCCC
     membership_64

Once the directory is renamed, we load all the members from 0xAAA into the memory.

If we are fine with this layout, I would propose something like:

groups.go

// RegisterCandidateGroup registers that a candidate group with the given
// public key was successfully created.
func (g *Groups) RegisterCandidateGroup(
	signer *dkg.ThresholdSigner,
	groupId string,
) error {
	g.mutex.Lock()
	defer g.mutex.Unlock()

	groupPublicKey := groupKeyToString(signer.GroupPublicKeyBytes())

	membership := &Membership{
		Signer:      signer,
		ChannelName: groupId,
	}

	err := g.storage.saveCandidate(membership, groupID)
	if err != nil {
		return fmt.Errorf("could not persist membership to the storage: [%v]", err)
	}

	if groupState == GroupStateApproved {
		g.myGroups[groupPublicKey] = append(g.myGroups[groupPublicKey], membership)
	}

	return nil
}

// ApproveGroup registers that a candidate group with the given
// public key was successfully approved.
func (g *Groups) ApproveGroup(
	groupId string,
) error {
	err := g.storage.approveCandidate(groupId)
	if err != nil {
		return fmt.Errorf("could not promote group to approved in the storage: [%v]", err)
	}
     
        loadGroup(groupId)
}

storage.go

func (ps *persistentStorage) saveCandidate(
	membership *Membership,
	groupID string // use this as a directory name
) error {
  // (...)
}

func (ps *persistentStorage) approveCandidate(
	groupID string // use this as a directory name
) error {
  // (...)
}

func (ps *persistentStorage) read(groupID string) (<-chan *Membership, <-chan error) {
 // (...) load a single group
}

// Alternatively, we can combine approveCandidate(groupID string) with read(groupID string):
func (ps *persistentStorage) approveCandidate(
	groupID string // use this as a directory name
) (<-chan *Membership, <-chan error)  {
  // (...)
}

However, this is a bit of work and as you say, it's not the best idea to do it in this PR. If we agree with this approach, I'd clean up the code in this PR and prepare for the new version:

  1. Get rid of GroupState and candidate/approved prefixes.
  2. Rename save in storage.go to saveCandidate, accept groupID string, and prepend candidate_ to group identifier when saving.
  3. In groups.go, have only RegisterCandidateGroup.
  4. Modify storage.go readAll to read candidate groups from disk. Make it a temporary change, until we refactor the storage. In a separate PR refactoring the storage, we should clean this up to read only approved groups and temporarily, approve the group right away, after it's saves as a candidate.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We synced with @lukasz-zimnoch and agreed to revert the changes to write candidate groups to current in this PR, and revisit the approach to storage separately. Given there is a limit on the directory name length, we will probably end up with:

snapshot/
candidate/
approved/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 641466e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this thread as unresolved so that we can refer it back when reworking the storage.

Comment on lines +308 to +310
// TODO: Fetch from RandomBeacon v2 contract.
groupSize := 64
honestThreshold := 33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or hardcode in constants. 🤷🏻

... Or read from the constants of BeaconDkgValidator bindings.

pkg/chain/ethereum/ethereum.go Outdated Show resolved Hide resolved
lukasz-zimnoch and others added 2 commits July 6, 2022 08:59
Co-authored-by: Piotr Dyraga <[email protected]>
@pdyraga
Copy link
Member

pdyraga commented Jul 6, 2022

There are some problems with running the client without v1 contracts deployed and initialized.

In #3026 we reworked v1 scripts into v2 approach so it is no longer possible to set up v1 contracts locally on the dev environment. Also, the work we are doing in #3012 will set up testnet based on v2 contracts.

We need to find a way to run a client now. I tried locally and I think we can do some workarounds for M1:

  • Disable ec.initializeBalanceMonitoring(ctx) in connect.go.
  • Rework fetchChainConfig in connect.go to return hardcoded values.
  • Disable stakeMonitor.HasMinimumStake check in start.go.

@pdyraga
Copy link
Member

pdyraga commented Jul 6, 2022

One more note, I think we shouldn't close issue #3013 with merging this PR. We still need to integrate with Sortition Pool group selection and ensure we can run this loop over the network with multiple clients.

@pdyraga
Copy link
Member

pdyraga commented Jul 6, 2022

There is one more problem with v1 contracts not deployed. The DKG errors out with:

2022-07-06T13:54:39.750+0200	WARN	keep-dkg	[email protected]/log.go:180	[member:22] DKG result publication process failed [failed to initiate new state [could not check if the result is already submitted: [response [[]] was not long enough to interpret while resolving original error [no contract code at given address]]]]
2022-07-06T13:54:39.892+0200	ERROR	keep-relay	relay/node.go:136	failed to execute dkg: [DKG result publication timed out]

I would consider disabling dkgResult.Publish and decideMemberFate in dkg.go. We could theoretically short-circuit them but for v2 we need more sophisticated behavior when it comes to the result publication.

@lukasz-zimnoch
Copy link
Member Author

There is one more problem with v1 contracts not deployed. The DKG errors out with:

2022-07-06T13:54:39.750+0200	WARN	keep-dkg	[email protected]/log.go:180	[member:22] DKG result publication process failed [failed to initiate new state [could not check if the result is already submitted: [response [[]] was not long enough to interpret while resolving original error [no contract code at given address]]]]
2022-07-06T13:54:39.892+0200	ERROR	keep-relay	relay/node.go:136	failed to execute dkg: [DKG result publication timed out]

I would consider disabling dkgResult.Publish and decideMemberFate in dkg.go. We could theoretically short-circuit them but for v2 we need more sophisticated behavior when it comes to the result publication.

Here it is 10e12b9

@pdyraga pdyraga merged commit f86eab5 into main Jul 6, 2022
@pdyraga pdyraga deleted the dkg-loop branch July 6, 2022 13:13
@pdyraga pdyraga modified the milestones: v2.0.0-m0, v2.0.0-m4 Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants