Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make z2 work again - and incidentally, make networks more robust to stalls. #1947

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions z2/src/node_spec.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use std::{
collections::{HashMap, HashSet},
collections::{BTreeMap, HashSet},
default::Default,
fmt,
};
Expand All @@ -15,7 +15,7 @@ pub struct NodeDesc {

#[derive(Clone, Debug, Serialize, Deserialize, Default)]
pub struct Composition {
pub nodes: HashMap<u64, NodeDesc>,
pub nodes: BTreeMap<u64, NodeDesc>,
}

#[derive(Clone, Debug, Serialize, Deserialize, Default)]
Expand Down Expand Up @@ -63,7 +63,7 @@ fn indices_from_string(input: &str) -> Result<HashSet<u64>> {
impl Composition {
pub fn parse(from: &str) -> Result<Self> {
let mut components = from.split('/');
let mut nodes = HashMap::new();
let mut nodes = BTreeMap::new();
if let Some(val) = components.next() {
for v in indices_from_string(val)? {
nodes.insert(v, NodeDesc { is_validator: true });
Expand All @@ -73,13 +73,13 @@ impl Composition {
}

pub fn single_node(is_validator: bool) -> Self {
let mut nodes = HashMap::new();
let mut nodes = BTreeMap::new();
nodes.insert(0, NodeDesc { is_validator });
Self { nodes }
}

pub fn small_network() -> Self {
let mut nodes = HashMap::new();
let mut nodes = BTreeMap::new();
for i in 0..4 {
nodes.insert(i, NodeDesc { is_validator: true });
}
Expand Down
58 changes: 54 additions & 4 deletions z2/src/setup.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ use alloy::{
};
use anyhow::{anyhow, Context, Result};
use k256::ecdsa::SigningKey;
use libp2p::{Multiaddr, PeerId};
use serde::{Deserialize, Serialize};
use serde_yaml;
use tera::Tera;
Expand Down Expand Up @@ -295,6 +296,10 @@ impl Setup {
self.config.base_port + 2002
}

pub fn get_p2p_port(&self, index: u16) -> u16 {
index + 401 + self.config.base_port
}

pub fn get_explorer_url(&self) -> String {
format!("http://localhost:{0}", self.get_otterscan_port())
}
Expand All @@ -307,8 +312,31 @@ impl Setup {
format!("0.0.0.0:{0}", self.get_docs_port())
}

pub fn get_p2p_multiaddr(&self, index: u16) -> Multiaddr {
// unwrap() is safe because this is a constant string - it's a bug in the program if
// it fails to parse.
format!("/ip4/127.0.0.1/tcp/{0}", self.get_p2p_port(index))
.parse()
.unwrap()
}

pub fn get_peer_id(&self, index: u16) -> Result<PeerId> {
let node_data = self
.config
.node_data
.get(&u64::from(index))
.ok_or(anyhow!("Cannot find node data for node {index}"))?;
let node_key = SecretKey::from_hex(&node_data.secret_key)?;
let libp2p_keypair = node_key.to_libp2p_keypair();
Ok(PeerId::from_public_key(&libp2p_keypair.public()))
}

pub fn get_port_map(&self) -> String {
let mut result = String::new();
result.push_str(&format!(
"🦏 p2p ports are at {0}+<node_index>\n",
self.get_p2p_port(0)
));
result.push_str(&format!(
"🦏 JSON-RPC ports are at {0}+<node_index>\n",
self.get_json_rpc_port(0, false)
Expand Down Expand Up @@ -494,14 +522,36 @@ impl Setup {
self.config.shape.nodes.len(),
&self.config_dir
);

let bootstrap_address =
if let Some((first_index, _)) = self.config.shape.nodes.iter().next() {
let idx = u16::try_from(*first_index)?;
println!(
"Bootstrap_address has idx {idx} with node_data {0:?}",
self.config.node_data.get(&u64::from(idx))
);

Some((self.get_peer_id(idx)?, self.get_p2p_multiaddr(idx)))
} else {
None
};

for (node_index, _node_desc) in self.config.shape.nodes.iter() {
println!("🎱 Generating configuration for node {node_index}...");
let node_index_u16 = u16::try_from(*node_index)?;
let mut cfg = zilliqa::cfg::Config {
otlp_collector_endpoint: Some("http://localhost:4317".to_string()),
bootstrap_address: None,
bootstrap_address: bootstrap_address.clone(),
nodes: Vec::new(),
p2p_port: 0,
external_address: None,
p2p_port: self.get_p2p_port(node_index_u16),
// libp2p's autonat will attempt to infer an external address by having
// the called peer call back. The caller attempts to facilitate this by
// careful choice of outgoing port.
// Sometimes this isn't possible, external address discovery fails, and in
// z2's case, the network cannot form. Specify the external address so that
// we never need to ask (autonat will still fail, but kademlia will be happy
// and the network will operate)
external_address: Some(self.get_p2p_multiaddr(node_index_u16)),
};
// @todo should pass this in!
let port = self.get_json_rpc_port(*node_index as u16, false);
Expand Down Expand Up @@ -564,6 +614,7 @@ impl Setup {
tokio::fs::create_dir(&data_dir_path).await?;
node_config.eth_chain_id = CHAIN_ID | 0x8000;
node_config.data_dir = Some(utils::string_from_path(&data_dir_path)?);
node_config.enable_ots_indices = *node_index == 0;
node_config
.consensus
.genesis_deposits
Expand All @@ -582,7 +633,6 @@ impl Setup {

cfg.nodes = Vec::new();
cfg.nodes.push(node_config);
cfg.p2p_port = 0;
// Now write the config.
let config_path = self.get_config_path(*node_index)?;
println!("🪅 Writing configuration file for node {0} .. ", node_index);
Expand Down
27 changes: 26 additions & 1 deletion zilliqa/src/p2p_node.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ use tokio::{
signal::{self, unix::SignalKind},
sync::mpsc::{self, error::SendError, UnboundedSender},
task::JoinSet,
time::{self, Instant},
};
use tokio_stream::wrappers::UnboundedReceiverStream;
use tracing::*;
Expand Down Expand Up @@ -239,6 +240,9 @@ impl P2pNode {
}
}

let sleep = time::sleep(Duration::from_millis(5));
tokio::pin!(sleep);

let mut terminate = signal::unix::signal(SignalKind::terminate())?;

loop {
Expand Down Expand Up @@ -294,7 +298,6 @@ impl P2pNode {
}
}
}

SwarmEvent::Behaviour(BehaviourEvent::RequestResponse(request_response::Event::Message { message, peer: _source })) => {
match message {
request_response::Message::Request { request, channel: _channel, request_id: _request_id, .. } => {
Expand Down Expand Up @@ -434,6 +437,28 @@ impl P2pNode {
break;
}
}
() = &mut sleep => {
let net_info = self.swarm.network_info();
trace!("p2p_node tick {0} / {1} ", net_info.num_peers(), net_info.connection_counters().num_connections() );
if net_info.num_peers() == 0 && net_info.connection_counters().num_connections() == 0 {
// We have no peers and no connections. Try bootstrapping..
if let Some((peer, address)) = &self.config.bootstrap_address {
if self.swarm.local_peer_id() == peer {
debug!("p2p_node: can't bootstrap against myself");
} else {
debug!("p2p_node: no peers and no connections - bootstrapping!");
self.swarm
.behaviour_mut()
.kademlia
.add_address(peer, address.clone());
self.swarm.behaviour_mut().kademlia.bootstrap()?;
Comment on lines +450 to +454
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kademlia already does this periodically by itself:
https://github.com/libp2p/rust-libp2p/blob/master/protocols/kad/src/behaviour.rs#L990-L996

It also has logic to do this more sanely by triggering a bootstrap when new peers are added and avoid making more than one bootstrap at once.

Does z2 work without this change? Can you see if the automatic bootstrap is working? We can also adjust the automatic_bootstrap_throttle configuration.

Copy link
Contributor Author

@rrw-zilliqa rrw-zilliqa Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

z2 doesn't work without this change, sadly...

The logic here is that, at least at startup, a node starts before the bootstrap; it fails to add the peer (because it's not up yet), and drops the bootstrap. Now it won't add the peer again, even if it comes up, and z2 then deadlocks essentially forever.

I'll see if automatic_bootstrap_throttle will work, but if it's currently set to less than a few hours, I suspect it is already triggering. We could perhaps omit the .kademlia_bootstrap(), but the .add_address() definitely isn't optional.

This also relates to one of the crashing bugs which I have utterly failed (sorry!) to tag as such and now can't find, where all peers get dropped and the network then dissociates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesHinshelwood - OK. I've checked. You can't get away without re-adding the address, but (as expected) kademlia will eventually bootstrap itself automatically.

}
} else {
debug!("p2p_node: no peers and no connections, but no bootstrap either! We may be stuck");
}
}
sleep.as_mut().reset(Instant::now() + Duration::from_millis(10000));
},
_ = terminate.recv() => {
self.shard_threads.shutdown().await;
break;
Expand Down
Loading