Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD 0072: Feature Gate Threshold Automation #72

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Changes from 22 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
013d0e7
feature-gate-threshold-automation init
buffalojoec Oct 12, 2023
9b601c1
bump to 0072
buffalojoec Oct 19, 2023
24bc2c5
address minor corrections
buffalojoec Oct 19, 2023
fa1f218
revise based on related SIMDs
buffalojoec Dec 15, 2023
4a65303
add garbage collection process
buffalojoec Dec 15, 2023
a7d6a49
add PDA details
buffalojoec Dec 15, 2023
843c83e
revise signal cadence
buffalojoec Dec 15, 2023
e513e82
add instruction layout example
buffalojoec Dec 15, 2023
d705e54
add state layout examples
buffalojoec Dec 15, 2023
a63abbb
add multi-sig activation gate
buffalojoec Dec 15, 2023
7a6c3fc
add additional elaboration
buffalojoec Jan 25, 2024
23f1b24
init new version of 0072
buffalojoec Feb 5, 2024
f830338
add reusable support signal PDA
buffalojoec Mar 14, 2024
2fb901b
update to use validator epoch stake syscall
buffalojoec Mar 14, 2024
ab95740
add updates from SIMD 0133
buffalojoec May 9, 2024
1a8f21a
add some context and wording
buffalojoec May 13, 2024
ea129ba
add c struct for PDA layout
buffalojoec May 13, 2024
bb1c797
hard-coded threshold
buffalojoec May 16, 2024
b85755d
change deadline to 4500 slots
buffalojoec May 16, 2024
66e6371
clarity suggestions
buffalojoec May 22, 2024
58cb436
add more program specification
buffalojoec May 22, 2024
da132f7
one PDA per epoch
buffalojoec Jun 3, 2024
4de5abd
remove slots remaining constraint
buffalojoec Jun 3, 2024
072344f
Update proposals/0072-feature-gate-threshold-automation.md
buffalojoec Jun 3, 2024
9296061
instruction clarity
buffalojoec Jun 3, 2024
458b3ac
update `StageFeatureForActivation` instruction
buffalojoec Aug 14, 2024
d9f399d
update `SignalSupportForStagedFeatures` instruction
buffalojoec Aug 14, 2024
b87cadb
update runtime step
buffalojoec Aug 14, 2024
6e74a91
update state init requirement
buffalojoec Dec 9, 2024
4591f55
mark as idea
buffalojoec Dec 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions proposals/0072-feature-gate-threshold-automation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
---
simd: '0072'
title: Feature Gate Threshold Automation
authors:
- Tyera Eulberg
- Joe Caulfield
category: Standard
type: Core
status: Draft
created: 2024-01-25
feature: (fill in with feature tracking issues once accepted)
---

## Summary

This SIMD outlines a proposal for automating the feature activation process
based on a stake-weighted support threshold, rather than manual human action.

With this new process, contributors no longer have to assess stake support for a
feature before activation. Instead, the assessment is done by the runtime.

## Motivation

Feature gates wrap new cluster functionality, and typically change the rules of
consensus. As such, a feature gate needs to be supported by a strong majority
of cluster stake when it is activated, or else it risks partitioning the
network. The current feature-gate process involves two steps:

1. An individual key-holder stages a feature gate for activation
2. The runtime automatically activates the feature on the next epoch boundary

The key-holder is the one who *manually* (with the help of the solana-cli)
assesses the amount of stake that recognizes the feature and decides whether
it is safe to activate. This is obviously brittle and subject to human error.

If instead the runtime was to assess this stake support for activating a
feature, this would eliminate the key-holder's responsibility to asses stake
support, reducing the risk of human error.

In a world where multiple clients will aim to push features and seek to agree on
their activation, a more automated and secure process will be extremely
beneficial.

## New Terminology

- **Feature Gate program:** The Core BPF program introduced in
[SIMD 0089](./0089-programify-feature-gate-program.md)
that will own all feature accounts.
- **Staged Features PDA:** The PDA under the Feature Gate program used to track
features submitted for activation.
- **Get Epoch Stake Syscall:** The new syscall introduced in
[SIMD 0133](./0133-syscall-get-epoch-stake.md)
that returns the current epoch stake for a given vote account address.

## Detailed Design

This proposal outlines a new feature activation process. The new process
includes changes to the runtime's feature activation process as well as the
Core BPF Feature Gate program proposed in
[SIMD 0089](./0089-programify-feature-gate-program.md).

The new process will utilize the Feature Gate program to enable the runtime to
activate staged features that meet the necessary stake support while preventing
the activation of those that do not.

Two new instructions, as well as two types of PDAs, will be added to the
Feature Gate program. They are detailed in this proposal.

The new process is comprised of the following steps:

1. **Feature Creation:** Contributors create feature accounts as they do now.
2. **Staging Features for Activation:** In some epoch `N-1`, a multi-signature
authority stages for activation some or all of the features created in step
1, to be activated at the end of the *next epoch* (epoch `N`).
3. **Signaling Support for Staged Features:** During the next epoch (epoch `N`),
validators signal which of the staged feature-gates they support in their
software.
4. **Feature Activation:** At the end of epoch `N`, the runtime activates the
feature-gates that have the required stake support.

### Step 1: Feature Creation

The first step is creation of a feature account, done by submitting a
transaction containing System instructions to fund, allocate, and assign the
feature account to `Feature111111111111111111111111111111111111`.

This step is unchanged from its original procedure.

### Step 2: Staging Features for Activation

A multi-signature authority, comprised of key-holders from Anza and
possibly other validator client teams in the future, will have the authority to
stage created features for activation.
In the future, this authority could be replaced by validator governance.

The multi-signature authority stages a feature for activation by invoking a new
Feature Gate program instruction: `StageFeatureForActivation`. This instruction
expects:

buffalojoec marked this conversation as resolved.
Show resolved Hide resolved
- Data: The feature ID
- Accounts:
- Staged Features PDA: writable
- Multi-signature authority: signer

A PDA will be created for each epoch in which features are staged to be
activated. If no features are staged for a given epoch, that epoch's
corresponding PDA is not created.

These PDAs will not be garbage collected and can be referenced for historical
purposes.

When the first feature for an epoch is staged, the PDA is created. The
`StageFeatureForActivation` processor will debit from the multisig enough
buffalojoec marked this conversation as resolved.
Show resolved Hide resolved
lamports to allocate the new Staged Features PDA.

The address of the Staged Features PDA for a given epoch is derived as follows:

```
"staged_features" + < epoch number >
buffalojoec marked this conversation as resolved.
Show resolved Hide resolved
```

buffalojoec marked this conversation as resolved.
Show resolved Hide resolved
The data for the Staged Features PDA will be structured as follows:

```c
#define FEATURE_ID_SIZE 32
#define MAX_FEATURES 8

/**
* A Feature ID and its corresponding stake support, as signalled by validators.
*/
typedef struct {
/** Feature identifier (32 bytes). */
uint8_t feature_id[FEATURE_ID_SIZE];
/** Stake support (little-endian u64). */
uint8_t stake[8];
} FeatureStake;

/**
* Staged features for activation.
*/
typedef struct {
/**
* Features staged for activation at the end of the current epoch, with
* their corresponding signalled stake support.
*/
FeatureStake current_feature_stakes[MAX_FEATURES];
} StagedFeatures;
```

`StageFeatureForActivation` may only be invoked during epoch `N-1`, where `N` is
the epoch number used to derive the Staged Features program-derived address.
This is checked by the Feature Gate program using the Clock sysvar.

Features staged during epoch `N-1` are staged to be activated at the end of
epoch `N`.

`StageFeatureForActivation` will add the provided feature ID to the list with
stake support initialized to `0`.

As depicted in the above layout, a maximum of 8 features can be staged for a
given epoch.

### Step 3: Signaling Support for Staged Features

With an on-chain reference point to determine the features staged for activation
for a particular epoch, nodes will signal their support for the staged features
supported by their software.

A node signals its support for staged features by invoking another new Feature
Gate program instruction: `SignalSupportForStagedFeatures`. This instruction
expects:

- Data: A `u8` bit mask of the staged features.
buffalojoec marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need to include the epoch? As the list of staged features changes epoch-to-epoch, we need a way of ensuring that the bit mask in each validator support signal transaction is interpreted for the correct epoch. If the transaction is delayed or is executed in a different epoch than the one it was sent, for whatever reason, then the support signal will be mis-interpreted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess the Staged Features PDA supplied should indicate the intended epoch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the PDA for the validator's support signal. When they send a bitmask, it gets recorded in their signal PDA, and if they send again, it's overwritten. The current epoch is checked via Clock sysvar when the program is invoked with this instruction.

- Accounts:
- Staged Features PDA: writable
- Vote account: signer

A bit mask is used as a compressed ordered list of indices. This has two main
benefits:

- Minimizes transaction size for nodes, requiring only one byte to describe
256 bytes (8 * 32) of data. The alternative would be sending `n` number of
32-byte addresses in each instruction.
- Reduce compute required to search the list of staged features for a matching
address. The bitmask's format provides the Feature Gate program with the
indices it needs to store supported stake without searching through the staged
features for a match.

A `1` bit represents support for a feature. For example, for staged features
`[A, B, C, D, E, F, G, H]`, if a node wishes to signal support for all features
except `E` and `H`, their `u8` value would be 246, or `11110110`.

The `SignalSupportForStagedFeatures` instruction processor will provide the
vote account's address to the `GetEpochStake` syscall to retrieve the stake
delegated to that vote account for the epoch. Then, using the `1` values
provided in the bitmask, the processor will add this stake value to each
corresponding feature ID's stake support in the Staged Features PDA.

Just like the `StageFeatureForActivation` instruction, the Clock sysvar will be
used to ensure the Staged Features PDA corresponding to the *current* epoch was
provided.
buffalojoec marked this conversation as resolved.
Show resolved Hide resolved

Nodes should submit a transaction containing this instruction:

- Any time at least 4500 slots before the end of the epoch.
- During startup after any reboot.

Transactions sent too late (< 4500 slots before the epoch end) will be rejected
by the Feature Gate program.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need any deadline/restriction? I see the program cache being alluded to, but I don't see any specification that changes to the Staged Features PDA must begin or conclude at some point earlier than the epoch boundary. It's not obvious why the Feature Gate program can't continue to modify the Staged Features PDA up to the epoch boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to explicitly explain the reasoning for this constraint in the SIMD, or are you asking why we actually need such a constraint? In either case, I agree it should be more explicit in the proposal.

It seems to me that @Lichtso intends to increase the cache preparation phase's initiation to be sooner in the epoch (comment), so we probably need a more concrete explanation as to why the Feature Gate program should have this constraint.

In this proposal, it was initially set to 128 slots remaining in the epoch because the cache preparation stage begins at a maximum of 128 slots before the end of the epoch, as computed here.

It's worth calling out that if we configure this constraint in the Feature Gate program and then later need to adjust it for an earlier cache preparation start, we'll be bound to using a feature gate for that change to the program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking why we need a constraint at all. The cache being referenced is supposedly an agave-specific implementation, not part of the runtime protocol. If feature activation needs to work around any aspect of cache behavior, that seems like (a) a problem; and (b) a thing that definitely needs to be specified for all clients.

Copy link
Contributor Author

@buffalojoec buffalojoec May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really nothing stopping the cache from being able to anticipate the upcoming feature environment by querying the PDA under the Feature Gate program that would tell it what the current stake support is for each feature. It's actually arguably equally or more deterministic than the way it's done now.

The cache could run the calculation at its current 128 slots before epoch boundary and probably get a pretty good understanding of which features have enough stake support. Increasing that slot buffer beyond 128 could just mean increasing the likelihood that not enough stake has been signaled for a feature. Perhaps this is a compromise the cache implementation should consider? After all, I believe recompilation is an optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit removing the "slots remaining" constraint, since I think the above is a sensible direction to head in. If there's explicit opposition, it can be added back in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my imprecise formulation, I meant scheduled for activation.
If a feature is scheduled for activation in some slot close to the boundary, then it is possible that the TX is not yet rooted. Thus, some validator nodes will not have it scheduled, and will not activate it on the boundary.

Copy link
Contributor

@CriesofCarrots CriesofCarrots Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how this is different from the cluster coming to consensus on any other tx/block on which other logic depends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For features which only affect their own fork this is irrelevant. But for features which change validator global behavior this would break. And I am not sure if there is much to be gained from making exceptions or harving different feature types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding an additional point of clarity here, if it helps.

Right now, in Agave, the cache is calling compute_active_feature_set, which will compute the feature set based on the on-chain accounts that have been created under Feature11111... (ie. "activated"). The cache does this ~128 slots before the epoch boundary to prepare the cache entries for the anticipated upcoming environment.

With the proposed system in place, the cache can still call compute_active_feature_set the same way it does now. The function will just compute the anticipated feature set from the proposed PDA (with stake support) rather than just the presence of on-chain accounts.

In almost every way it's the same as what we have now. The subtle difference is that someone can no longer activate a feature after the cache has done its feature-set computation, but nodes could still cast votes (signals) on a feature, bringing it to "eligible" status. This edge case exists today and just causes cache preparation/recompilation to be moot.

What @CriesofCarrots is suggesting is that we don't explicitly architect this mechanism around the cache, which is sensible, especially considering the cache preparation may change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lichtso do you still have concerns here? I'd like to move this SIMD along.


If a node does not send this transaction or it is rejected, their stake is not
tallied. This is analogous to a node sending this transaction at a valid slot
in the epoch signalling support for zero features.

If a feature is revoked, the list of staged features will not change, and nodes
may still signal support for this feature. However, the runtime will not
activate this feature if its corresponding feature account no longer exists
on-chain.

### Step 4: Feature Activation

During the epoch rollover, the runtime must load the Staged Features PDA
and calculate the stake - as a percentage of the total epoch stake - in support
for each feature ID to determine which staged features to activate.
buffalojoec marked this conversation as resolved.
Show resolved Hide resolved

Every feature whose stake support meets the required threshold must be
activated. This threshold will be hard-coded in the runtime to 95% initially,
but future iterations on the process could allow feature key-holders to set a
custom threshold per-feature.

As mentioned previously, if a feature was revoked, it will no longer exist
on-chain, and therefore will be not activated by the runtime, regardless of
calculated stake support.

If a feature is not activated, either because it has been revoked or it did not
meet the required stake support, it must be resubmitted according to Step 2.

## Alternatives Considered

## Impact

This new process for activating features directly impacts core contributors and
validators.

Core contributors will no longer bear the responsibility of ensuring the proper
stake supports their feature activation. However, this proposal does not include
a mechanism for overriding or customizing the stake requirement. This capability
should be proposed separately.

Validators will be responsible for signaling their vote using a transaction
which they've previously not included in their process. They also will have a
more significant impact on feature activations if they neglect to upgrade their
software version.

## Security Considerations

This proposal increases security for feature activations by removing the human
element from ensuring the proper stake supports a feature.

This proposal could also potentially extend the length of time required for
integrating feature-gated changes, which may include security fixes. However,
the feature-gate process is relatively slow in its current state, and neither
the current process or this proposed process would have any implications for
critical, time-sensitive issues.

Loading