-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[DNM] approval-voting: approvals signature checks optimizations #7482
base: master
Are you sure you want to change the base?
[DNM] approval-voting: approvals signature checks optimizations #7482
Conversation
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
... to ease up quick testing in versi Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of concerns regarding the proposed change in the issue #7458.
The first one is minor. The current state machine in approval-distribution was carefully designed by @rphmeier with spam/memory growth (DoS) protection in mind with no malicious/wasteful sequence of messages leading to a non-negative reputation change/unbounded growth. I'm not sure this is the case now, especially when accepting approvals before assignments.
Now the second one. If we're not checking signatures of approvals when they are sent directly, does that mean an attacker gets free tries avoiding slashing of approval-voters? I guess that's fine as long as the backers are getting slashed. But that also means we're loosing incentives for approval-voting as outlined in #7229.
So I'd like to hear more from @burdges and @eskimor how these features would interact.
gum::trace!( | ||
target: LOG_TARGET, | ||
"Approval checking signature was invalid, so just ignore it" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should never happen in practice though, unless the approval-voter was malicious, no?
if so, I would suggest to elevate this to a warn
@@ -48,6 +48,20 @@ use std::{ | |||
time::Duration, | |||
}; | |||
|
|||
// TODO: Disable will be removed in the final version and will be replaced with a runtime configuration | |||
// const ACTIVATION_BLOCK_NUMBER: u32 = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at
polkadot/node/core/provisioner/src/lib.rs
Line 396 in 5bd7292
let disputes = match has_required_runtime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do, that part is unfinished for now.
@@ -210,34 +224,30 @@ struct Knowledge { | |||
// When there is no entry, this means the message is unknown | |||
// When there is an entry with `MessageKind::Assignment`, the assignment is known. | |||
// When there is an entry with `MessageKind::Approval`, the assignment and approval are known. | |||
known_messages: HashMap<MessageSubject, MessageKind>, | |||
known_messages: HashMap<MessageSubject, Vec<MessageKind>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this vec of at most 2 entries? could you elaborate why do we need a vec here? is it related to accepting approvals before assignments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the current implementation it is not needed, because it works with the correct assumption that approvals won't arrive before the assignments, so the approval will always override the assignment and then functions like contains
assume that if we have an approval stored here, then we definitely have seen an assignment for this.
However, with the changes in this PR, that is not the case anymore, since we are sending directly the approvals to all nodes, so an approval might arrive before an assignment, so we can't assume anymore that if we have received an approval then a subsequent assignment is duplicate. Hence, we have to store both of them.
Signed-off-by: Alexandru Gheorghe <[email protected]>
My thinking here is that we should have a limit to how many of these orphan (potentially malicious) approvals we accept and keep in memory which could be up to the first level of distribution aggression threshold (which is 12 blocks IIRC) -
That's a good point, later, if the sig is found to be invalid we would just have to drop the vote as authenticity of the message cannot be proved on chain. |
Could we still batch check all of these approvals before considering the candidate approved? and discard invalid signatures at this point. Doing all of them in a loop could be much faster and we could do it in a separate task to not block the hot loop of processing the messages. |
Signed-off-by: Alexandru Gheorghe <[email protected]>
Looking in versi right now with 250 validators it seems there are like 15-20 approvals per second that arrive early, so I tend to think we can find a reasonable number to limit the growth of this queue.
Yeah, that was one of my concern as well, so I think there are two worlds here:
|
Are we sure gossip can be disabled here? I'm unsure if finality could safely block on a non-gossiped messages, aka messages which malicious actors could send to only some peers. It's likely fine since nodes woould simply see different no-shows, which likely causes only liveness problems, but this depends upon doing some clearer analysis. cc @AlistairStewart
We expect skipping checks saves more than merging multiple approval votes, which itself saves more signature checks than core groups, although core groups produce savings elsewhere too, but do we have any idea how big are these gaps? If say, core groups saves 3x, merging saves another 2x beyond core groups, and skipping saves another 2x beyond merging, then actually skipping saves relatively little beyond merging, when compared with our current usage.
Again fast vs slow paths risk an adversary pushing us onto the slow path (and this one sounds particularly irksome). |
Skipping signature check + no-gossip is proposed as an alternative to merging, the possible ball-park gains are:
And that seems to also reflect in some cpu reduction for the network-worker(disclaimer this are preliminary tests). The merits of this approach(if it can safely be implemented) against coalescing of approvals :
Could you develop a bit more, what you would want to see here, is it something I can address by myself? |
It's not performance related. It's a conversation between @chenda-w3f and @AlistairStewart and I probably. |
Yes, this should be easy. So easy, it might happen accidentally.
Basically if approvals from some people only get trhough to 99% of recipients, then without gossip, we'll end up in the sutuation where for each candidate, when most validators see it as having enough approvals, 1% of validators will not. They will see a no show or two, but it's unliekly that any of the 1% will assign themselves to replace them. Then we are left waitinf for 100 or so tranches until some of the 1% assign themselves. If this happens for different 1%s of validators for 40 different candidates in 1 block, then this will be over the 34% of validators required to halt finality. So we'd fallback if that kicks in before 100 tranches. |
In fact, the eclipsed 1% would eventually assign themselves. I'm unsure if we treat our own approval vote as overriding what others do, but all those empty tranches count as overcoming the no-shows they observe. If I recall, an empty tranche cannot override being below needed_approvals though. We could change this but doing so risk eclipse attacks. |
Took me a while to understand why we need the 1% percent to assign themselves. So, because the 99% percent think the candidate is approved they won't trigger their tranche anymore, hence we would have to wait till the 1% self assign themselves. In that case does it make sense for us to build some resilience in there, by just making the nodes that are missing the approvals to ask for it from other topology peers or random-nodes. In that case, I think we not would need the
N. Ask for the approval from all topology peers. The benefit here is that we don't increase the number of messages in the un-necessarily on the fast path and on the slow path we just give the opportunity to honest nodes to obtain the needed approval, without increasing unnecessarily the number of messages we process. What do you think ? |
We should discuss if our own vote overrides what we see from others: Alice approved X in R, nobody else approved X but every other Y in R is approved by others. Can Alice vote for R in grandpa? I suspect currently no, but maybe the answer should be yes. This is really a corner case, but also wholly orthogonal. I think fast vs slow path optimization create compounding risks: An adversary could typically push nodes onto the slow path, so this becomes fast vs secure path. We ourselves never optimize the slow path enough. We'll never review or audit the slow path as carefully, so we'll miss more serious vulnerabilities there. Imagine the odds of a security critical bug increases like paths^2 or something. We risk fast path optimizations that're really pessimisations for the slow secure path. I've therefore always held not having fast v slow paths as an overarching design principle. I always expected we'd add some later of course, but I wanted to delay that until we felt the system was basically sound. I do not mind the availability fast v slow paths because we kinda understand everything which could happen there and other optimizations really should happen anyways. I'm more worried about lazy signature checks though.. As concrete objections:
Anyways I think this sits in a problematic place that's kinda the worst of two world, gossiped vs ungossiped approvals. I'm fine to think about each world separately, but right now I do not feel like we can realistically mix these two worlds. |
... add option to malus nodes that witholds messages sent by approval-distribution to simulate approvals couldn't be sent to some of the peers and test finality is not lagging Signed-off-by: Alexandru Gheorghe <[email protected]>
2020bf8
to
8abed2a
Compare
The CI pipeline was cancelled due to failure one of the required jobs. |
@burdges ... still processing the rest of the message :) |
I do agree having two paths increases the security risk.
If we are lazily checking the signature before counting an approval and discard if it is not signed, it would force everyone to sign, so we can reconcile it with paritytech/polkadot-sdk#635, am I missing something ?
Agree it would be more complicate, if I understand it correctly your suggestion is to go down this path paritytech/polkadot-sdk#701, rather than on the paths from this PR.
What, about the above idea where we wouldn't gossip the approval vote at all(#7482 (comment)), and just allow nodes to fetch it from peers if they didn't get it directly, that would be clean enough and it has the advantage to better utilize the network and cpu, rather than gossip approval votes blindly. |
Putting this on-hold in favor of paritytech/polkadot-sdk#701 |
What kind of scenario is that? How would that even be possible? Connections that work for one candidate, but not for the other and this on a significant number of nodes? Also in principle we do have a reliable transport mechanism: TCP: This means we know to which peers we have a working open connection with and which not and we should also be able to know what messages got through. We can either have application level confirmations or just track TCP status - as long as the connection was never closed, we can assume messages got through. If we detect a re-connect, we can resend our approvals. If we want to be more robust against direct connections not being possible, we could also think of techniques like tunneling via some random node. This would require some work as well, but the nice thing is, it could be part of the networking stack: We just tell it, we want a connection, then it tries direct connection, if that fails tries tunneling via a random peer, .. Not a short-term solution, but longer term relying on direct connections should give us way better performance than sending the same message multiple times always. |
Assignment VRFs must not be sent like this, so there is a politeness failure which you've never acknowledged, aka we must retain all votes related to all imported blocks, without VRFs limiting their accumulation. If we've many relay chain forks, then adversaries voting for all candidates adds up fast.
If we do not gossip votes then adversaries could choose which nodes see their signatures, so only a small portion of nodes see them no-show. As nobody replaces them quickly, I'd expect those nodes cover the adversarial no-shows by empty tranches, assuming rob copied that "odd" feature of my prototype. It's possible rob changed this empty no-show tranche rule. It's also possible someone argues successfully against the empty no-show tranche rule, like maybe @AlistairStewart or @chenda-w3f. If so, then we'd need the self override rule I mentioned above, but this adds another sort of fragility, and anyways risks the long delays which worried @AlistairStewart there. This proposal is not orthogonal to how no-shows work, hence fragile. In this, I've only discussed the simplest most direct attack, but we've no idea what's really possible with say BGP attacks, etc. or what happens by accident.
I'd typically assume TCP makes everything worse, in that an ideal adversary can halt the connection exactly when they desire, and anyways adversaries know TCP far better than we do. In principle, an "unreliable" UDP transport could've reliability designed around our specific problems, but dong so requires mountains of work.
I'd expect this incurs considerably latency. I doubt TCP exposes what succeeded or failed, so we'd need our own ack-of-ack layer on top.
Again more special case complexity which we'll never know if we've properly implemented or debugged.. worse routing is rarely a simple task. In this proposal, we'd need a new politeness design which become problematic, special TCP handling which breaks other transports, and some special extra routing layer. We need some really heroic testing effort since none of these trigger curing normal execution.
I do not rule this out entirely, and we'll happily discuss it more with @chenda-w3f etc, but afaik it requires re-inventing too much of how distributed systems work both in practice & theory. This proposal is hard! It demands trashing paritytech/polkadot-sdk#635 near-term for these reasons. We need topology fixes anyways so assuming those work, then I'd wildly guesstimate:
We cannot have 1+3, but 1+2 and 2+3 work, but doing two maybe already excessive. |
Overview
This implements the lazy signatures checks solution proposed here: approval-voting: approvals gossip and lazy signature checks optimizations proposed here: paritytech/polkadot-sdk#604
Problem
While profiling
versi
for paritytech/polkadot-sdk#732, the conclusions are that during peak time with 250 validator,approval-distribution and approval-voting
need to process between 10k-15k messages per second, a rough breakdown on where the time is spent looks like thisapproval-voting
to check theassignment certificate
( This will be dramatically improved by approval-voting improvement: include all tranche0 assignments in one certificate #6782 )approval-voting
to check the signature of an approval vote.Proposed Solution
The amount of work for assignments is already being addressed by #6782, so this PR is trying to address and reduce the amount of work we have to do when processing an approval and it is achieved by implementing a fast path for processing approval-votes, consisting of the following steps:
FAQ
Why is it ok to not check approval votes signatures ?
If we received the message directly from the originator we can trust that the vote is legit, and since we aren't enforcing any accountability on the approvals votes, checking the signature before taking the vote in account is not really necessary and it is not security critical. For the future, we might envision a system where we lazily check the vote if accountability is required.
Note! The signature will be later checked just in the cases when we are dealing with a dispute and we have to import the votes on chain. So, the validators are incentivised to correctly sign their votes because otherwise their votes won't be imported on chain and they will miss on disputes rewards.
Wouldn't this make it risk-free for attackers to try to approve invalid blocks ?
This is already possible in our current implementation because we do not punish in any way validators that try to approve invalid blocks and we allow them to change their votes during disputes.However, that is already included in our threat model, see paritytech/polkadot-sdk#635 on why it is alright(we do slash the backers).
What happens if network is segmented and peers can not reach each other directly ?
Finality will be lagging and when the configured threshold will be passed we will enable the aggression which falls back on gossiping approvals.
Does this increases the risks of malicious nodes attacking honest nodes without the network observing ?
No, we are still gossiping assignments, so they will spread-out much faster and randomly than the approvals, so if honest nodes get DoS-ed before sending all of their approvals, they will be marked as no-show and a new tranche would kick-in
Can approvals arrive before their corresponding assignments
Yes, and to deal with this case we will have to store the approval and process it later on when an assignment is received. The buffer for storing this approval would have to be reasonably bounded, but given that we already store all the assignments and approvals we received in
approval-distribution
this shouldn't affect our memory profile too much. All validators get just one slot per candidate per un-finalized block.Alternative solutions paths
There are two other paths that I explored for optimization:
Coalescing approvals paritytech/polkadot-sdk#701
This is an alternative optimization that we can go instead of this MR, which should give us major gains, but I would estimate would be lower than the ones in this MR and it will also increase the size of the votes we have to import on chain for disputes/slashing, see: paritytech/polkadot-sdk#701
Parallelization of processing
approval-distribution/approval-voting
The measurements I did here: #7393 (comment), suggests that this should give us some gains, but they won't be enough for our goal of 1k validators, so we would still need either this MR or the coalescing of approvals describe above. However, the good part is that the parallelization could be stacked on top of optimisations we do for reducing the amount of work. So, once we reached consensus which of the two options(lazy checking/approval coalescing) we want to implement, I think we should see some benefit from parallelising the work done in
approval-distribution and approval-voting
TODOs