-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disabling Strategy Implementers Guide #2955
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Approving modulo nits + CI checks on md are failing
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work @Overkillus ! Thank you!
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/finality-stall-on-kusama-15-02-2024-post-mortem/6398/9 |
|
||
The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election. | ||
|
||
Disabling generally makes automatic-chilling after slash events redundant and disabled nodes can be considered for re-election which ensures that we do not push honest validators out of the validator set. ([**Point 6.**](#system-overview)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chilling was meant to protect nominators from being slashed further if there is a bug or issue with the validator node setup. An honest validator would fix the issue and unchill itself (of course this does not protect against a malicious validator).
Disabling nodes would not protect nominators as the validator would get re-elected in the next era and may get slashed for the same offence again. Chilling works similarly but its up to validator to signal that they have fixed the issue that got them slashed and ready to be considered for re-election. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense. I updated the doc to mention what automatic chilling was achieving and what was its goal. Altough despite achieving its goal in ideal scenarios (no attackers, no lazy nominators), it opened new vulnerabilities for attackers. The biggest issue was that chilling in case of honest node slashes (potentially by abusing PVF nondeterminism) could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.
With gas metering this would be a good feature, otherwise it is risky.
|
||
Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network. | ||
|
||
The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An honest node slash may happen again if the validator is re-elected in the next era and the underlying issue isn't fixed? If the validator fixes the issue and unchills itself quickly, they will still be considered for the next election (if they make it before the snapshot which is taken 1 session before the election).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but due to disabling you will only be slashed once per era. Fixing determinism issues is not something operators can easily do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents here is that the ultimate implementation in substrate should be not a static ChillingStrategy
/DisablingStrategt
baked into the code but an open function like fn should_chill(active: u32, inactive:u32) -> bool
whereby this it can inspect the number of active validators and the number of chilled validators and make a decision based on that.
Then, the exact value and parameter, I would be happy to delegate to research and experiment to figure out.
My opinion, in an ideal world, is:
- We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.
- Validators themselves should chilled upon slash as well.
- Note that if we chill too many validators and nominators, it is not am existential risk because in
pallet-election-provider-multi-phase
we already have a notion of "minimum stake" for any validator set that wants to be enacted. A super weak validator set, because most have been chilled, will not pass this gate. In this case, we will stick to the previous validator set. - Then, for the matter of disabling validators, I think we should do it with a function implemented based on the number that have already been disabled. If none are disabled (a solo-slash), 100% disable it. Same up to 1/3. We should gradually tighten the threshold and stop disabling at around 2/3. This is not sound to me, I am sure it can be attacked such that the last 1/3 are malicious, but it is simple enough to at least prevent the common scenario of everyone getting slashed because of a bug and the network being left without any validators, which is why I am proposing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should go back to un-nominating nominators.
I'd be very happy to explore this ONCE we (hopefully) get PolkaVM deterministic gas metering. Otherwise, especially when we enable minor slashes for approval voters on the wrong side of the dispute PVF nondeterminism can cause honest nodes to be slashed which in combination with chilling could be fatally abused to push out honest nodes out of the consenus.
Note that if we chill too many validators and nominators, it is not am existential risk because in pallet-election-provider-multi-phase we already have a notion of "minimum stake"
That unfortunately doesn't fix the issue. We have 900+ validators in Polkadot and active set is roughly 300. You can force chill a bunch of honest nodes and still not trigger the minimum stake route and every chilled honest validator makes it so malicious validators have a better chance of getting elected.
We should gradually tighten the threshold and stop disabling at around 2/3.
parachain consenus assumes 2/3 nodes are responsive and we could not operate with 2/3 disabled. 1/3 is the limit for disabling or it would require significant changes to the protocol AFAIK
All and all... The current disabling/chilling/punishing strategy is not perfect but it aims to reduce existential threats. Once we get time disputes or deterministic gas metering we can tighten the rules again and not fear about punishing honest nodes. For now we have to be lenient and focus on protecting the vitals and pushing out validators is too risky as of now. In the end they will be slashed if malicious. Costs will be paid in full. But it gives the network enough time to react.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.
I don't get what you are saying here. By not removing the nominators, they will get slashed over and over again. while by removing them, they are safe after just one slash. The former incentivizes to pay attention, that latter does not.
Honestly, I think the system we came up with is pretty good. I don't see any real downsides at all.
polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md
Outdated
Show resolved
Hide resolved
|
||
Chilling is process of a validator dropping theirs intent to validate. This removes them from the upcoming NPoS solutions and effectively pushes them out of the validator set as quickly as of the next era (or 2 era in case of late offenses). All nominators of that validator were also getting unsubscribed from that validator. Validator could re-register their intent to validate at any time. | ||
|
||
Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and his nominators were getting unsubscribed
as noted, this is not correct anymore.
An underlying flaw that is being shown here is that, do we have any tests rn that forks polkadot and simulates a slash in it?
I think both for staking devs (@Ank4n and @gpestana) and the parachain runtime team this is super important to have.
We have built pallet-root-offence
exactly with this idea in mind, but never exercised it AFAIK. It should be possible with chopsticks to run an altered Polkadot runtime that is the same + this pallet, then trigger a slash in the UI.
Again, I think this is a super important scenario to have ready both as a manual test for monkey testing, and automated for integration testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted, this is not correct anymore.
Which is good. We assume this is not the case. Otherwise we would still be having the problem that determinism issues could be used to kick out honest nodes.
I agree about more tests. But I think what @Overkillus is saying is not conflicting with what you are saying:
He is saying that it used to be like that (not any more), and you are saying the same - right?
Closes #1966, #1963 and #1962. Disabling strategy specification [here](#2955). (Updated 13/02/2024) Implements: * validator disabling for a whole era instead of just a session * no more than 1/3 of the validators in the active set are disabled Removes: * `DisableStrategy` enum - now each validator committing an offence is disabled. * New era is not forced if too many validators are disabled. Before this PR not all offenders were disabled. A decision was made based on [`enum DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54). Some offenders were disabled for a whole era, some just for a session, some were not disabled at all. This PR changes the disabling behaviour. Now a validator committing an offense is disabled immediately till the end of the current era. Some implementation notes: * `OffendingValidators` in pallet session keeps all offenders (this is not changed). However its type is changed from `Vec<(u32, bool)>` to `Vec<u32>`. The reason is simple - each offender is getting disabled so the bool doesn't make sense anymore. * When a validator is disabled it is first added to `OffendingValidators` and then to `DisabledValidators`. This is done in [`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325) from staking pallet. * In [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) the `end_session` also calls [`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490) when an era ends. In this case `OffendingValidators` are cleared **(1)**. * Then in [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) `DisabledValidators` are cleared **(2)** * And finally (still in `rotate_session`) a call to [`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430) repopulates the disabled validators **(3)**. * The reason for this complication is that session pallet knows nothing abut eras. To overcome this on each new session the disabled list is repopulated (points 2 and 3). Staking pallet knows when a new era starts so with point 1 it ensures that the offenders list is cleared. --------- Co-authored-by: ordian <[email protected]> Co-authored-by: ordian <[email protected]> Co-authored-by: Maciej <[email protected]> Co-authored-by: Gonçalo Pestana <[email protected]> Co-authored-by: Kian Paimani <[email protected]> Co-authored-by: command-bot <> Co-authored-by: Ankan <[email protected]>
Closes #1966, #1963 and #1962. Disabling strategy specification [here](#2955). (Updated 13/02/2024) Implements: * validator disabling for a whole era instead of just a session * no more than 1/3 of the validators in the active set are disabled Removes: * `DisableStrategy` enum - now each validator committing an offence is disabled. * New era is not forced if too many validators are disabled. Before this PR not all offenders were disabled. A decision was made based on [`enum DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54). Some offenders were disabled for a whole era, some just for a session, some were not disabled at all. This PR changes the disabling behaviour. Now a validator committing an offense is disabled immediately till the end of the current era. Some implementation notes: * `OffendingValidators` in pallet session keeps all offenders (this is not changed). However its type is changed from `Vec<(u32, bool)>` to `Vec<u32>`. The reason is simple - each offender is getting disabled so the bool doesn't make sense anymore. * When a validator is disabled it is first added to `OffendingValidators` and then to `DisabledValidators`. This is done in [`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325) from staking pallet. * In [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) the `end_session` also calls [`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490) when an era ends. In this case `OffendingValidators` are cleared **(1)**. * Then in [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) `DisabledValidators` are cleared **(2)** * And finally (still in `rotate_session`) a call to [`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430) repopulates the disabled validators **(3)**. * The reason for this complication is that session pallet knows nothing abut eras. To overcome this on each new session the disabled list is repopulated (points 2 and 3). Staking pallet knows when a new era starts so with point 1 it ensures that the offenders list is cleared. --------- Co-authored-by: ordian <[email protected]> Co-authored-by: ordian <[email protected]> Co-authored-by: Maciej <[email protected]> Co-authored-by: Gonçalo Pestana <[email protected]> Co-authored-by: Kian Paimani <[email protected]> Co-authored-by: command-bot <> Co-authored-by: Ankan <[email protected]>
The CI pipeline was cancelled due to failure one of the required jobs. |
This reverts commit 82b48c4.
Closes paritytech#1966, paritytech#1963 and paritytech#1962. Disabling strategy specification [here](paritytech#2955). (Updated 13/02/2024) Implements: * validator disabling for a whole era instead of just a session * no more than 1/3 of the validators in the active set are disabled Removes: * `DisableStrategy` enum - now each validator committing an offence is disabled. * New era is not forced if too many validators are disabled. Before this PR not all offenders were disabled. A decision was made based on [`enum DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54). Some offenders were disabled for a whole era, some just for a session, some were not disabled at all. This PR changes the disabling behaviour. Now a validator committing an offense is disabled immediately till the end of the current era. Some implementation notes: * `OffendingValidators` in pallet session keeps all offenders (this is not changed). However its type is changed from `Vec<(u32, bool)>` to `Vec<u32>`. The reason is simple - each offender is getting disabled so the bool doesn't make sense anymore. * When a validator is disabled it is first added to `OffendingValidators` and then to `DisabledValidators`. This is done in [`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325) from staking pallet. * In [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) the `end_session` also calls [`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490) when an era ends. In this case `OffendingValidators` are cleared **(1)**. * Then in [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) `DisabledValidators` are cleared **(2)** * And finally (still in `rotate_session`) a call to [`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430) repopulates the disabled validators **(3)**. * The reason for this complication is that session pallet knows nothing abut eras. To overcome this on each new session the disabled list is repopulated (points 2 and 3). Staking pallet knows when a new era starts so with point 1 it ensures that the offenders list is cleared. --------- Co-authored-by: ordian <[email protected]> Co-authored-by: ordian <[email protected]> Co-authored-by: Maciej <[email protected]> Co-authored-by: Gonçalo Pestana <[email protected]> Co-authored-by: Kian Paimani <[email protected]> Co-authored-by: command-bot <> Co-authored-by: Ankan <[email protected]>
Closes #1961