Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF: propagate hs_*_messages #1548

Closed
Tracked by #1508
heifner opened this issue Aug 23, 2023 · 25 comments · Fixed by #1676
Closed
Tracked by #1508

IF: propagate hs_*_messages #1548

heifner opened this issue Aug 23, 2023 · 25 comments · Fixed by #1676
Assignees

Comments

@heifner
Copy link
Member

heifner commented Aug 23, 2023

Need to support configuration where BPs are not directly connected. For example: BP <-> relay-node <-> relay-node <-> BP.
Also need to prevent infinite message loop. This is accomplished for trx and blocks by keep track of what messages have been received on each connection and not echoing back the message to that connection.

@arhag
Copy link
Member

arhag commented Aug 23, 2023

Changes may instead need to be outside of net plugin, e.g. changes to qc_chain to support an qc tree analogous to fork database for reversible blocks. This would be needed to detect duplicates to drop them on the floor and avoid the infinite message loop problem.

@fcecin
Copy link

fcecin commented Aug 28, 2023

I'm thinking that we want the rebroadcasting to be encapsulated inside of the IF engine, since it is already tasked with calling "broadcast hs message (x)" as a side-effect of message processing. So the IF engine will also make the decision to disseminate (route) the message to the network during the processing of each incoming message.

Here's my current thinking about what these decisions would look like, for each message type:

  • New View: rebroadcast if a message triggers a local view change (is this right?);
  • New Block: rebroadcast if local qc_chain didn't know about that block ID (is this right?);
  • New Proposal: rebroadcast if qc_chain considers the incoming proposal a repetition, irrelevant or obsolete (how exactly?)
  • New Vote: rebroacast if... (??)

This is in addition to the broadcast-this-new-message requests that are already inside qc_chain. That would mean qc_chain is also a gossip router now.

A free optimization can be achieved by not ever rebroadcasting a message to the original sender. This can be achieved by passing an identifier of the sender to the IF engine together with the incoming message notification, which can then pass that as an optional argument to the send/broadcast requests (i.e. do broadcast, but exclude this given destination/connection/node).

A further optimization then is the creation of a "hotstuff" connection/channel/subscription type, similar to what is being done to block connections, transaction connections, and the new peer addressbook/discovery feature (last time I checked, month+ ago). All these broadcasts are limited to those connections only, which would allow e.g. hotstuff messages to be disseminated inside of the network core, eventually reaching all infrastructure nodes (and any external/downstream nodes interested in collecting this information and which have a "hotstuff" connection to a core network node).

I am assuming that all of the hotstuff messages are going to be what we are seeing in libraries/chain/include/eosio/chain/hotstuff.hpp, i.e. small messages.

I don't think the net_plugin should be making these decisions, as it would perhaps have to understand what the messages mean to be able to make optimal routing decisions. For a gossip network, pushing the routing into the IF engine will not needlessly complicate the interface between the two modules. All you need, in addition, is the original sender (net relayer, not HS message creator) information, which is ultimately meaningless to the IF engine, but is meaningful to the net_plugin, which just gets it back on a rebroadcast, without having to ever store it in the meantime.

Depending on what information is required to implement all of the routing decisions, we will end up needing some additional data structures or a change to the existing one(s), as @arhag mentioned above.

This spec can be slightly or completely off the target, so any discussions, completions, corrections here are of course welcome (and needed).

@heifner
Copy link
Member Author

heifner commented Aug 28, 2023

I agree with the approach of the IF engine handling this functionality.

It is not clear to me the complete logic for rebroadcast of a message. However, the following must hold: All hs messages must either generate a new/modified hs message OR forward the message to all peers (except the peer it came from). This ensures all message propagate throughout the entire network.

I don't think limiting transmission of hs message to hs configured peers is needed. These messages are not large and I don't think the network overhead will warrant the added complexity. If we find it is useful, it can be added post-5.0.

See #1559 for latest hs_message formats.

@fcecin
Copy link

fcecin commented Aug 28, 2023

EDIT: NOTE: This comment is obsolete.

I believe this may be a good direction for vote rebroadcasting. I would use a second (local) bitset inside of quorum_certificate that is omitted by the serialization macro (like bool quorum_met is now), and would simply drop votes if I don't have that proposal ID in my proposal store.

   void qc_chain::process_vote(const hs_vote_message & vote){

      //auto start = fc::time_point::now();
#warning check for duplicate or invalid vote. We will return in either case, but keep proposals for evidence of double signing
      //TODO: check for duplicate or invalid vote. We will return in either case, but keep proposals for evidence of double signing

      bool am_leader = am_i_leader();

      if (!am_leader) {
         // TODO(1548): Rebroadcast the incoming message as-is, unless we have already seen it or it is obsolete.
         //             Here we have the finalizer public key and the proposal ID to make that decision.
         //             If the proposal ID is unknown, the vote is dropped (not gossiped/rebroadcast).
         //             If proposal ID is known, we use the bitset on the proposal's QC to see if we already have the sig from that
         //                finalizer in the proposal:
         //                - if it is already there, we drop the message
         //                - if it is not there, we aggregate the signature and set the bit
         //                  OR, we set *another*, local-only bitfield of voters whose votes we have already seen,
         //                  if the quorum is already met, to avoid needlessly enlarging an aggsig that is already enough.
         //
         //             A proposal that is GC'd is no longer relevant and will cause votes for it to die on receipt here.
         //
         //             If we *are* the leader, then we do NOT rebroadcast the message, since it is intended for the leader only;
         //               the other nodes just relay votes so it can ultimately arrive at us, and that means this is the only point
         //               in this method (vote processing) where we gossip/disseminate votes.
         //
         //             This strategy *may* cause less broadcasts than would otherwise happen, if a proposal can reach a node
         //                that then issues its vote before some or all of the would-be-relayer-of-this-vote nodes get to see
         //                the proposal in time. However, we do not *need* all votes to get through, just a quorum of them
         //                (whose minimum value is 50%+1), so the protocol should already be naturally resistant to this at
         //                that level (and at a higher level, there is no safety breakage if NO messages at all get through).
         //
         //             An alternative is to retain votes for unknown proposal IDs and keep them until something happens
         //                (e.g. ~3 seconds pass, or something like that) and if a proposal ID matching them finally arrives,
         //                we then broadcast them, although they could be obsolete at that point. Since votes have a signature
         //                of a whitelisted finalizer public key in them, these votes cannot be spam (at least not from non
         //                finalizer nodes). If we do this, we can add the cleanup of those to the existing garbage collect
         //                method (or we write another gc method). We can also add a limit to how many of those we buffer,
         //                like 4K votes or something like that. It's an optimization; not a safety requirement.
         return;
      }

EDIT: If either of the bitsets on a proposal is set, the vote is new and is thus rebroadcasted.

EDIT: I think this is wrong: only the leader should be aggregating signatures. A non-leader just reads the active_finalizers bitset, and sets the secondary bitset only.

EDIT: I think that, aside any missing signature checking / authentication, this pseudocode does it:

   void qc_chain::process_vote(const hs_vote_message & vote){
(...)
      bool am_leader = am_i_leader();

      if (!am_leader) {

         hs_proposal_message* vp = get_proposal(vote.proposal_id);
         if (vp == nullptr)
            return;
         
         if bit_is_set (  vp->active_finalizers  , vote.finalizer )
            return; 

         // extra_finalizers is an exact copy of the active_finalizers field, including "=0" init, 
         //   but is not sent over the wire (no FC_REFLECT)
         if bit_is_set (  vp->extra_finalizers  , vote.finalizer )
            return; 

         set_bit( vp->extra_finalizers , vote.finalizer );
            
         rebroadcast vote message;

         return;
      }
(...)

EDIT: The above is still wrong; I confused Proposal with QC; Proposal doesn't have a sig bitset.

@fcecin
Copy link

fcecin commented Aug 28, 2023

EDIT: NOTE: This comment is obsolete.

Tentative spec for the New Block message gossip.

   void qc_chain::process_new_block(const hs_new_block_message & msg){

      // If I'm not a leader, I probably don't care about hs-new-block messages.
#warning check for a need to gossip/rebroadcast even if it's not for us (maybe here, maybe somewhere else).
      // TODO: check for a need to gossip/rebroadcast even if it's not for us (maybe here, maybe somewhere else).
      if (! am_i_leader()) {

         // TODO(1548): This message connects the proposer to the leader; the proposer is asking the leader to create a proposal
         //                 for a given block whose finalization is being proposed by the finalization proposer.
         //
         //             If we *are* the leader, then we do NOT rebroadcast the message, since it is intended for the leader only.
         //                 (see comment on votes).
         //
         //             If this block ID is already mentioned anywhere in my proposal store, drop it. That means the leader has
         //                already received and told everyone else about it.
         //
         //             If the block ID is in an "already advertised block IDs" list (let's say, a circular buffer, or a store
         //                 that will gc these block IDs on a time basis) then drop it.
         //
         //             Otherwise broadcast it and add it to the "already advertised block IDs" structure.
         //
         //             I imagine that for Leap 5 this message follows in the coattails of advertising a newly produced block,
         //                 since producer(block assembler and dispatcher)==proposer(finalization process starter) for now. (?)
         //                 And if so, then we cannot avoid rebroadcasting it just because it is a block ID the local node
         //                 never heard of, since the message could conceivably jump ahead of block propagation (?). So this one
         //                 is going to get taken care of by the "already advertised block IDs" store.

         fc_tlog(_logger, " === ${id} process_new_block === discarding because I'm not the leader; block_id : ${bid}, justify : ${just}", ("bid", msg.block_id)("just", msg.justify)("id", _id));
         return;
      }

EDIT: the quorum_certificate justify; in the New Block message can be null (at hotstuff activation), which I think is a hint that we don't want to be demanding it be something in particular before we will deem the message "valid" for a rebroadcast. Not sure yet if it has any use here, like helping to GC seen/rebroadcasted block IDs faster.

@fcecin
Copy link

fcecin commented Aug 28, 2023

Tentative spec for the New View message gossip.

This has to be confirmed/reviewed by someone else, since I don't know enough about Hotstuff to know whether this is correct or not; however this cannot prevent liveness if we still have leader timeouts and/or leader rotation.

   void qc_chain::process_new_view(const hs_new_view_message & msg){
      fc_tlog(_logger, " === ${id} process_new_view === ${qc}", ("qc", msg.high_qc)("id", _id));
      auto increment_version = fc::make_scoped_exit([this]() { ++_state_version; });
      if (!update_high_qc(msg.high_qc)) {
         increment_version.cancel();
      } else {
         // TODO(1548): Rebroadcast New View message.
      }
   }

@fcecin
Copy link

fcecin commented Aug 28, 2023

Tentative spec for the Proposal message gossip.

Again, I don't know enough to be sure that this is correct, so I need a confirmation here.

If you receive a proposal that you do insert in your store, then it is new and valid, and thus you rebroadcast it. If you are already discarding that proposal for whatever reasons, you do not rebroadcast it -- I don't think there's a case where you are inserting it on your store and don't rebroadcast, or a scenario where you are discarding it and do rebroadcast. And you do rebroadcast it before you broadcast votes on it (i.e. stuff it in the socket send buffers for each peer connection before the votes).

   void qc_chain::process_proposal(const hs_proposal_message & proposal){

            (...)

      //update internal state
      update(proposal);

      //TODO(1548): Rebroadcast Proposal message here. 

      for (auto &msg : msgs) {
         send_hs_vote_msg(msg);
      }

      //check for leader change
      leader_rotation_check();

      //auto total_time = fc::time_point::now() - start;
      //fc_dlog(_logger, " ... process_proposal() total time : ${total_time}", ("total_time", total_time));
   }

@fcecin
Copy link

fcecin commented Aug 28, 2023

@heifner suggests GCs will be at LIB, meaning not time-based or based on a maximum buffer size and I like/agree.

@fcecin
Copy link

fcecin commented Aug 30, 2023

EDIT: NOTE: I am not sure this is relevant, but I'm leaving it here for now.

If we decide(d) to merge the leader role with the proposer role for Leap 5, then we can perhaps defer the actual use of the New Block message, since its role as an actual network message seems to be to enable a separation between the proposer and the leader roles (and to enable unit tests which don't have another block propagation mechanism). If this separation does not happen in Leap 5, then we don't need that message, and thus we don't need to deal with it in the context of this ticket (i.e. coming up with a block ID cache to implement its dissemination). The message type still exists (for tests, for future use, and maybe as a synthetic message generated locally for the proposer role to communicate with the leader role inside the same node) but in the Leap 5 protocol it wouldn't be actually sent (over the actual network).

Since IF removes the need for quick rotation in the producer schedule, the Leap protocol could also get rid of the 6-second/12-block producer window and enlarge it to match what a healthy window for Leader alternation would look like. In other words, if it would make more sense to rotate the leader (say) once every 12 seconds or 24 seconds, then you just make that the new production window and that allows you to have a single producer = (proposer + leader) window rotation that also eschews any need to design an explicit, event-driven, network-messaging-based, negotiated, active leader timeout and replacement solution, just like the producer window eschews any need to design such a thing for failing producers as well.

In my view, the 6-second window for production that we are accustomed to serves two purposes. One is the purpose above: to be fast enough that we can accumulate implicit confirmations of block finality faster by spreading signature collection over the production of blocks (which each collect one signature; the "quorum certificate" as it were is implicit and spread over time). And the second purpose is to be small enough that when an elected and scheduled producer is completely disrupted, the disruption of service to the network is "not noticeable", in comparison with the typical expected QoS of blockchain networks -- for example, a 6-second outage due to a completely botched production window is roughly half of the block time of normal operation of the Ethereum network (something like 12 seconds per block), so even if one producer is completely down, you still get latency at least equivalent to e.g. Ethereum.

But a block producer being down is a rare occurrence, especially with redundant production infrastructure that BPs should have. With IF, we would no longer need to worry about the total latency of a full production round. And if, in addition, we discard the preoccupation with "decreased network QoS due to producers that are completely down" (which is a problem addressable at the IT & governance level) then that would mean we are free to increase the production window to match something that would make sense for a forced leader rotation window which -- as discussed above -- frees us from having to worry about leader failure substitution; a failed leader role is equivalent to a failed proposer role, and both are equivalent to a failed producer entity, and thus we can assume the entire problem solves itself every X seconds, where X can be something other than 6 seconds. X can be anything that mitigates whatever costs (if any!) an automatic leader rotation imposes.

Maybe I'm off here, but anyway, if that's true, then omitting the New Block message from the Leap 5 release would simplify this ticket.

EDIT: Actually, having message dissemination for New Block solved (even if not optimally) is good in any case, since it keeps the engine ready for future use of the message in a Leap protocol that does decide to separate proposer and leader.

EDIT: Even if you have redundant producer infrastructure (e.g. 2 block producers with the same block signing key) only one of them can be set up as the leader, with its unique BLS key, right? So even in that scenario, the proposer and the leader roles are in the same node (same BLS key).

@fcecin
Copy link

fcecin commented Aug 31, 2023

Current design direction, summed up:

  • Decide on one of the solutions to the "don't broadcast to original sender" optimization;
  • Do not do any application-aware routing decision at the qc_chain (e.g. "this view number is irrelevant") for now, but leave option open for later; all messages are routed based on whether they have been "seen" (sent, rebroadcasted) recently or not, which is maximally robust against dropping messages that you should have forwarded instead;
  • GC the data structures that accumulate "seen messages" for all message types based on the cutoff parameter from gc_proposals (seen-message buckets will be attached to the highest proposal height at the time of their recording, or some higher height number derived from that present height).

EDIT: For the first item, going with this, which will be cleaner than passing connection_id as an additional argument:

   using hs_message = std::variant<hs_vote_message, hs_proposal_message, hs_new_block_message, hs_new_view_message>;

   struct hs_message_send {
      std::optional<uint32_t>     connection_id;
      hs_message                  message;
   };

@BenjaminGormanPMP BenjaminGormanPMP moved this from Todo to In Progress in Team Backlog Aug 31, 2023
@fcecin
Copy link

fcecin commented Sep 1, 2023

For this to be secure, to avoid attacks where malicious peers are trying to censor message propagation, we need to compute a secure hash on top of the full byte serialization of every incoming and outgoing HS message. Other secure alternatives would involve signing and signature verification, which we don't want. Which means we probably want to do that at the net_plugin after all. So I will try to move the duplicate message filtering logic to the net_plugin.

@heifner
Copy link
Member Author

heifner commented Sep 1, 2023

For this to be secure, to avoid attacks where malicious peers are trying to censor message propagation, we need to compute a secure hash on top of the full byte serialization of every incoming and outgoing HS message.

I don't follow, how does a hash help here? The malicious peer can just create a hash of its malicious message. The bls signatures in the messages are over the data that must be secure I thought.

@fcecin
Copy link

fcecin commented Sep 1, 2023

For this to be secure, to avoid attacks where malicious peers are trying to censor message propagation, we need to compute a secure hash on top of the full byte serialization of every incoming and outgoing HS message.

I don't follow, how does a hash help here? The malicious peer can just create a hash of its malicious message. The bls signatures in the messages are over the data that must be secure I thought.

If you can guarantee that you are already and always going to pay the cost to verify a signature (of whatever type) over the message that you are receiving (which is to be determined by the application) then you can use that entire signature (say, 384 bits) as a key to a "don't rebroadcast this message in the future". In any other case, an attacker can just replay that signature (that is not verified and re-verified every time you receive it) over garbled content on the rest of the message, preventing the propagation of the "good" version of the message, in theory, which at that point could be irrelevant to node A receiving it (i.e. in application logic it is not verifying the signature, aggregating it, etc.) but is relevant to node B, which is counting on node A to rebroadcast it.

Each HS message has its own structure. Some have a global signature object, others have a block ID and a justify-qc, like the New Block message. So you have to think about what piece of that message you are using to store on an index of "seen messages" for that specific message type. Each HS message type is getting its own protocol, in a sense.

Maybe there's a way to leverage whatever bits of intelligence the qc_chain has about the meaning of the messages, but it is safer to, at least at first, treat messages as opaque bags of bits and the net_plugin just rebroadcasts any hotstuff message that it has "never seen before" (never seen recently, actually). It is better to deal with the spam problem later (i.e. an attacker is just manufacturing fake hotstuff messages to spam the network), than having to deal with a message censorship attack. In a spam attack you are just wasting bandwidth, but in a censorship attack the honest messages are being dropped.

So my concern is that I don't want an honest message to be wrongly recognized as a message that has been already "seen". To do that, the easiest way is to just hash the byte serialization of messages. This way, an attacker cannot censor a message by sending that exact message, byte-by-byte, to be broadcast.

The spam problem can be countered with an additional decision layer which is Hotstuff-specific; the net_plugin asks the qc_chain, after the message passes the duplicate check, if the message is "irrelevant". The node has to know that the qc_chain state it has is sufficient to make that call, and then know how to make that evaluation of what the message means to decide whether it's going to be dropped or not, even if it is apparently not a byte-by-byte duplicate of something previously seen (which is the case with spam).

@fcecin
Copy link

fcecin commented Sep 1, 2023

The root of the problem is the Multicast Problem, which is the reason why the open internet doesn't route multicast, but will happily route any unicast packet, no matter if it is meaningless and part of a DDoS attack or not (which it has no automated way of knowing).

If we are introducing message broadcasting to the protocol and we don't have simple messages like "transactions" and "blocks", which have strict ordering, are always verified, etc., but instead we have these control messages that together implement a complex, critical and latency-sensitive protocol, then it becomes tricky to combine the fact that nodes all just trust each other to blindly multicast each other's messages to everyone.

Maybe we should think about introducing the concept of a trusted mesh among the elected producer infrastructure. That would be analogous to a permissioned IP multicast subnet, for example.

@fcecin
Copy link

fcecin commented Sep 2, 2023

Byte-level duplicate message filtering will be done at net_plugin. If a message is a duplicate, it isn't even sent to the qc_chain.

The decision to rebroadcast will be done at qc_chain. This can be a decision to rebroadcast always (always excluding the original sender), or a decision to rebroadcast conditionally.

@heifner
Copy link
Member Author

heifner commented Sep 5, 2023

For this to be secure, to avoid attacks where malicious peers are trying to censor message propagation, we need to compute a secure hash on top of the full byte serialization of every incoming and outgoing HS message.

I don't follow, how does a hash help here? The malicious peer can just create a hash of its malicious message. The bls signatures in the messages are over the data that must be secure I thought.

If you can guarantee that you are already and always going to pay the cost to verify a signature (of whatever type) over the message that you are receiving (which is to be determined by the application) then you can use that entire signature (say, 384 bits) as a key to a "don't rebroadcast this message in the future".

It was my impression that we were going to verify the signature on each and every hs message and that could be used as the key. I thought, each node would be maintaining its own view of finality and verifying all messages.

@fcecin
Copy link

fcecin commented Sep 5, 2023

We have decided today that we will be doing signature checks on all messages in one way or another, pretty much giving a good baseline solution or limiter to network-wide spam.

We have also decided today that we will be doing duplicate checks by detecting the more general condition of an irrelevant IF message (in which duplicates are contained) -- a message that is discarded upon receipt by the qc_chain because it decides that the message serves no purpose, based on the contents of the message's fields and its own IF state. (EDIT: that means removing the SHA256 check that is there now, which may or may not be repurposed later -- see below).

We have also decided that we want to get rid of the connection_id being passed into qc_chain. There is a way to get rid of the connection_id: add a return value to the entire chain of calls all the way from net_plugin connection::handle_message( const hs_message& msg ) to the qc_chain::process_proposal, process_vote, etc. methods that are all returning void now. They would return a status that would, for example, notify the caller connection (or the caller connection's my_impl-> (net_plugin_impl)) to rebroadcast the same message (but excluding that connection), or to notify of an invalid signature check or of an irrelevant message (to increase a spam/grief counter). On top of that, if we really need to, we can count peer spam/grief specifically on replay attacks by adding a specialized message sha256 check on each connection as well. Do we want a return value?

EDIT: We have also decided to remove New Block as an actual HS message that is sent and seen on the wire (for starters).

This is pretty much what I got for now.

@heifner
Copy link
Member Author

heifner commented Sep 5, 2023

We have also decided that we want to get rid of the connection_id being passed into qc_chain. There is a way to get rid of the connection_id: add a return value to the entire chain of calls all the way from net_plugin connection::handle_message( const hs_message& msg ) to the qc_chain::process_proposal, process_vote, etc. methods that are all returning void now. They would return a status that would, for example, notify the caller connection (or the caller connection's my_impl-> (net_plugin_impl)) to rebroadcast the same message (but excluding that connection), or to notify of an invalid signature check or of an irrelevant message (to increase a spam/grief counter). On top of that, if we really need to, we can count peer spam/grief specifically on replay attacks by adding a specialized message sha256 check on each connection as well. Do we want a return value?

I think we want to keep the connection_id passed into qc_chain. A return value requires the implementation to remain synchronous. I find it very likely we will want to move to an async model in the future. For example, we will likely want to do some of the aggregate signature validate in a thread pool.

@fcecin
Copy link

fcecin commented Sep 5, 2023

(...) On top of that, if we really need to, we can count peer spam/grief specifically on replay attacks by adding a specialized message sha256 check on each connection as well. (...)

That part of my comment is nonsense since we can (if we want to) detect replay spam at qc_chain. The signature that is going to be verified for every hs message is the unique key we can hash and test for duplicates (or signature replay spam) at qc_chain itself (don't need to serialize the entire message; just hash the signature that is first verified for every message before we get to a duplicate check). EDIT: The parameters that are passed in/out (net_plugin <--> qc_chain) via callbacks can include a notification of any spam detected at qc_chain.

@arhag arhag moved this from In Progress to Blocked in Team Backlog Sep 6, 2023
@bhazzard
Copy link

bhazzard commented Sep 7, 2023

Removed lgtm label because issue doesn't address how to do the work.

@fcecin
Copy link

fcecin commented Sep 9, 2023

PR #1613 will add the infrastructure that will allow the qc_chain to make Hotstuff message propagation decisions, and the Net plugin to make decisions on disconnecting peers that are sending (too frequent) invalid, irrelevant or duplicate Hotstuff messages.

@fcecin
Copy link

fcecin commented Sep 12, 2023

After today's discussion, I think the way forward to close this ticket is to:

1 - Let all the new finalizer_set logic make its way inside of qc_chain;

2 - Settle on a final spec for all the hotstuff message fields;

Then (when everything that's required to make message filter-or-propagate decisions settles),

3 - Use all the information that is inside of qc_chain to implement all the filter-or-propagate-this-message decisions (probably inside of the process_... methods for each hotstuff message);

4 - If duplicate checks for message propagation are not addressed organically at that point (as part of naturally filtering messages that are irrelevant to the protocol, with the knowledge that any and every qc_chain can have at any point), then, as part of this ticket, add specific code for that (using for example a cache of SHA256 "seen signatures" or, the nuclear option, of "seen message content" [at the net_plugin level, that had been done before]);

5 - Write a specific multi-hop test for that; run tests; see that it works (e.g. no infinite propagation flooding; protocol still works), close this;

Then, in the future, if needed (tracked in other, new issues -- other things that were discussed in the context of this ticket):

6 - Add logic at the net_plugin to react to notifications of lots of duplicate, invalid, etc. hotstuff messages and disconnect peer;

7 - Start pushing costly operations outside of qc_chain (or whatever other solution(s) to reduce qc_chain contention);

8 - Possibly refactor the IF engine (rename/reshape pacemaker etc.).

fcecin added a commit that referenced this issue Sep 19, 2023
- test_pacemaker allows each test to define a specific connection mesh between nodes
- added a simple unit test (#7) that can be used to test a solution for #1548 (just needs to introduce a disconnect between nodes)
- added support to write multi-hop unit tests (with selective message propagation by type)

The hotstuff protocol has "phases", and between two adjacent phases, we are typically using different message types. This allows us to write tests where we loop propagating a specific message type across the network, for as many hops as necessary, until the `dispatch` method on the `test_pacemaker` returns an empty vector, signaling no more message propagation happened. And then, we can start propagating the message type that's expected for the next phase of the protocol (these were generated during propagation of the previous type, but were just retained).

All unit tests still use a full connection mesh, which is easily created by calling `test_pacemaker::connect(vector-with-all-node-names)`. The added test is meant to be changed by the inclusion of an explicit disconnect between the two rotating leaders; this can be used as the first test of a solution to message propagation.

Closes #1658
@fcecin
Copy link

fcecin commented Sep 21, 2023

(EDIT) As discussed today, I'm going to write a propagation solution to the hotstuff messages as they are now, and test them with unit tests, and have that up in a branch.

fcecin added a commit that referenced this issue Sep 23, 2023
- qc_chain does not aggregate a signature from a finalizer that has already voted on a proposal
- added test_pacemaker::duplicate()
- added hotstuff_8 unit test to test for duplicate votes

This fix is related to #1548 (lack of duplicate filtering only manifests with message propagation)
@arhag arhag linked a pull request Oct 2, 2023 that will close this issue
@bhazzard bhazzard added this to the Leap v6.0.0-rc1 milestone Nov 1, 2023
@BenjaminGormanPMP BenjaminGormanPMP moved this from Blocked to In Progress in Team Backlog Nov 7, 2023
@arhag
Copy link
Member

arhag commented Nov 8, 2023

This issue is focused on the improvements to handling of network messages prior to unification of proposals with blocks. Unification will require further changes to this code, but that will be handled in another issue.

@arhag arhag removed the discussion label Nov 8, 2023
@fcecin
Copy link

fcecin commented Nov 16, 2023

Implemented by #1676

@fcecin fcecin closed this as completed Nov 16, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Team Backlog Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants