-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IF: Hotstuff message propagation/gossip #1583
Conversation
- connection_id of the sender of a Hotstuff message is passed to IF engine, and IF engine optionally passes it back to net_plugin as peer to exclude from a broadcast of the same message - IF engine filters some messages it deems irrelevant for a rebroadcast, otherwise it does gossip incoming messages to all its neighbors (except original sender) - REVIEW: IF engine clears New Block and Vote message filters on proposal GC
For the broadcast back to sender protection, I agree that should be in Currently this could be achieved by just saving the last Therefore, I propose the following:
|
I'm looking at the complexity of testing for message equality just for this, and I'm thinking that since those messages are all tiny anyways, we could simply ignore this optimization and let messages be sent to everyone, including the original sender. They will be filtered at the original sender anyways, since they will be considered irrelevant. This is already a broadcast protocol, so one more tiny message per hop will not make any difference. If not, passing around the connection_id (or storing the connection ID in the hs_message as metadata) seems cleaner. |
Hum, I don't think it is a good idea. Even if the message is small, I think network i/o can be expensive. Also wouldn't that create the possibility of infinite loops between peers connected in a triangle config for example? |
The message has to be discarded at the original sender when it gets it echoed back to itself. In fact, that's a bug of the current implementation that you just spotted: you need to (EDIT: also) add the identifier of every message that you originally send (not [EDIT: just] relay) to the "do not relay this" lists. |
How does it prevent infinite message loops if the original sender is not in the loop? It is an uneven fight between N nodes echoing all the messages they receive, and a single node discarding them. |
When the sender sends any message (original or relay), it will add a description/identifier of the message to a list that forbids it from relaying it going forward. All nodes have to discard all messages they have seen if they ever reach them. |
I understand that the sender will not resent the same message. How does that prevent loops between 3 other nodes which do not include the sender. |
Each node (e.g. middle-of-the-road, relay nodes that aren't the original senders or even the intended recipients) has a list of everything that has passed through it. Once you see a message for the first time, you add "it" (an identifier) to a local data structure in qc_chain, then you relay it to everyone you know. If it returns to you for whatever reason, you drop it on the floor. |
Yes, this is what we need to do (maybe in |
Yeah, I was not very clear, and this whole thing is a bit confusing. I know what to do here now. I am positive that we do not want this decision to be in the net_plugin, because leaving it on qc_chain gives us the option, later, of complementing "message byte comparsion" routing decisions with application-specific routing decisions. If we ever put it on the net plugin, it has to be a general mechanism that would support any broadcast/gossip protocol extension, not just hotstuff. EDIT: What I am saying is that it is very possible that at some point in the future, this gets refactored into a general routing decision filter that is "bitwise" at the net plugin, plus a complementary filter that each protocol extension can define and implement. |
Unless we have a good reason to make something specific for hotstuff, I think a general mechanism is preferable. I don't know what is used now in |
On one hand, we have the serialized message as a byte buffer that we can hash before we send out at the net_plugin; that would work. On the other hand, we will need some general criteria that is known at the net_plugin to flush those hashes. In qc_chain, we might be able to get away with the GC that we already use for proposals. But then again that might be a bad idea. |
We could use a time based approach, such as removing the hashes we stored over 2 hours ago. |
That, or, for example, a memory-size based approach with two buckets. You have a current bucket where you insert hashes, and once it "fills up" (to an arbitrary amount of memory that we want to allow the duplicate checker to consume) you switch to the other bucket, clear it, and start filling it up. Insertions of seen-messages are done at the active bucket (O(1)), and lookups are done against both buckets (O(2)). The rationale is that the rate of operation of the protocol, plus the span (distance) of the network, and the expected number of (honest, non-spam) messages can be estimated into a bucket size. If someone is managing to spam the protocol, all that's happening is that there's potential recirculation of messages that you have already rebroadcasted moments ago, since the bucket size would not be able to record all of the seen messages as they propagate. But it's better to have a size limit for the buckets to avoid spam in the network also causing potentially unbounded memory consumption at the nodes (in addition to consuming bandwitdh). And even if spam causes messages to "not stop" for a while (while the spam is going on), you still have the problem that you do have broadcast spam, which you have to address anyway. These buckets can also be bloom filters but that's probably too fancy and we don't want the filter to filter messages that are not duplicates. |
- optional connection_id of incoming hs message now relayed via wrapper class hs_message (hs_message_inner is the actual hs message variant), simplifying callbacks; - incoming duplicate hs messages filtered at net_plugin by sha256 hash of full serialized hs_message_inner - hs message propagation decision made at qc_chain (simplified)
I think this is along the lines of what we want, hopefully. Basic testing done to ensure it makes sense and basically works.
|
Non-parallelizable unit tests sometimes fail at random on my machine. It was looking like it was caused by this patch (say, excessive messaging, since it seems many/all tests enable IF), but as I investigate it and run it more times, I can no longer reproduce it. Might just be a coincidence. Hashing on broadcast and on receipt seems to yield the same result as expected. Saving message hashes in the hash table and matching hashes seems to work. The seen_msgs... mutex looks correct. Message "which" peek at net_plugin message read (prior to handling the object) seems correct. Will test more, but I think this is worthy of reviews. |
I'm still not 100% sure that propagating only new proposals at a relay node is the correct behavior. However, this can be changed easily now, or later. The main decision in this patch is that the qc_chain is able to do further filtering of message propagation, after the hash-duplicate test passes in the net_plugin. We can tweak that second filter as much as we want. |
Tested with a 6-node setup:
All BPs get IF messages and I can see proposals being committed on all of them. This is validating that messages are indeed being propagated across the network. There is another issue that I have seen in earlier tests and that I was going to investigate next, which is probably not caused by message propagation (this kind of error is printed at every node):
I don't know what that's about yet, but I don't think mere message propagation can cause this. At most message propagation would be exercising an existing bug. ("default" is the name/ID of the IF engine in chain_pacemaker mode; it's not part of the message). |
} | ||
|
||
// called from net threads | ||
void chain_pacemaker::on_hs_msg(const eosio::chain::hs_message &msg) { | ||
uint32_t connection_id = msg.connection_id.value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would throw an exception if the connection_id
optional does not contain a value. It is not clear to me when we expect the connection_id
to contain a value and when not, and what is the benefit of including it in the hs_message
rather than passing it as a separate parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change/fix.
@@ -483,6 +484,11 @@ namespace eosio { | |||
mutable fc::mutex chain_info_mtx; // protects chain_info_t | |||
chain_info_t chain_info GUARDED_BY(chain_info_mtx); | |||
|
|||
mutable fc::mutex seen_hs_msgs_mtx; | |||
std::vector<std::unordered_set<fc::sha256>> seen_hs_msgs_buckets GUARDED_BY(seen_hs_msgs_mtx); | |||
uint64_t seen_hs_msgs_current_bucket = 0 GUARDED_BY(seen_hs_msgs_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't build with clang-16.
uint64_t seen_hs_msgs_current_bucket = 0 GUARDED_BY(seen_hs_msgs_mtx); | |
uint64_t seen_hs_msgs_current_bucket GUARDED_BY(seen_hs_msgs_mtx) = 0; |
seen_hs_msgs_buckets.clear(); | ||
seen_hs_msgs_buckets.resize(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be strictly needed here, but maybe still add a lock_guard
so we don't get these warnings.
[72/164] Building CXX object plugins/net_plugin/CMakeFiles/net_plugin.dir/net_plugin.cpp.o
/home/greg/github/enf/leap/plugins/net_plugin/net_plugin.cpp:4200:7: warning: reading variable 'seen_hs_msgs_buckets' requires holding mutex 'seen_hs_msgs_mtx' [-Wthread-safety-analysis]
seen_hs_msgs_buckets.clear();
^
/home/greg/github/enf/leap/plugins/net_plugin/net_plugin.cpp:4201:7: warning: reading variable 'seen_hs_msgs_buckets' requires holding mutex 'seen_hs_msgs_mtx' [-Wthread-safety-analysis]
seen_hs_msgs_buckets.resize(2);
^
2 warnings generated.
[this, connection_id](const hs_vote_message& m) { on_hs_vote_msg(connection_id, m); }, | ||
[this, connection_id](const hs_proposal_message& m) { on_hs_proposal_msg(connection_id, m); }, | ||
[this, connection_id](const hs_new_block_message& m) { on_hs_new_block_msg(connection_id, m); }, | ||
[this, connection_id](const hs_new_view_message& m) { on_hs_new_view_msg(connection_id, m); }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we just call _qc_chain
directly here, instead of going through these 4 almost identical functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed when we decide to delete the "core profiling" code.
using hs_message_inner = std::variant<hs_vote_message, hs_proposal_message, hs_new_block_message, hs_new_view_message>; | ||
|
||
struct hs_message { | ||
std::optional<uint32_t> connection_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be added to the message. This requires the hs_message_inner
to be copied. Better to pass it along as extra params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do that.
This PR is obsolete:
|
Should address #1548
NOT TESTED YET on a networked setup (with chain_pacemaker).
This patch needs some discussion (maybe in the August 31st meeting):
The connection_id wiring serves to exclude the original sender during a rebroadcast/gossip of an incoming message. The connection_id std::optional is also used in some places to differentiate the sending of an original message from a rebroadcast (if it is std::nullopt, it is an original message, otherwise it is a rebroadcast whose original sender is the given connection_id).
Unless I am missing something, the uint32_t connection_id is how you identify the original sender connection, and how you exclude it in a rebroadcast of the same message.