-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Add support for automatically joining new OmniPaxos nodes #2478
base: main
Are you sure you want to change the base?
[POC] Add support for automatically joining new OmniPaxos nodes #2478
Conversation
crates/bifrost/src/service.rs
Outdated
|
||
// todo we seem to have a race condition between this call and the provision step which might | ||
// write a different logs configuration | ||
// self.bifrost.admin().init_metadata().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem should already exist in the current main. The problem is that we first provision the NodesConfiguration
and then try to write the Logs
with the configured LogsConfiguration
. If now the node succeeds at joining the cluster and reaches first the point where it starts the BifrostService
, then it can happen that this line writes the initial Logs
which does not respect the LogsConfiguration
that is specified in the provision command.
rpc ConnectTo(stream NetworkMessage) returns (stream NetworkMessage); | ||
|
||
// Try to join an existing metadata store cluster | ||
rpc JoinCluster(JoinClusterRequest) returns (JoinClusterResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to properly document the contract and which status codes are returned in which cases.
if response.as_ref().is_err_and(|err| err.is_network_error()) { | ||
self.choose_different_endpoint(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poor man's solution to handle multiple endpoints. Requests that are rejected by non-leaders return an Unavailable
status. That's why this + the retry policy will help us to find the actual leader by retrying. This also helps with dealing with metadata store nodes that are down. Very simplistic solution.
// This is potentially dangerous because we might have provisioned before on a | ||
// different node but delivering of the response failed. | ||
// todo harden by choosing oneself if one runs the metadata store. If not, then pick | ||
// a single node to reach out to. | ||
self.choose_different_endpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous because we might end up provisioning different metadata store nodes.
// Not really happy about making the `KvMemoryStorage` aware of the NodesConfiguration. I | ||
// couldn't find a better way to let a restarting metadata store know about the latest | ||
// addresses of its peers which it reads from the NodesConfiguration. An alternative could | ||
// be to not support changing addresses. Changing addresses will also only be possible as | ||
// long as we maintain a quorum of running nodes. Otherwise, the nodes might not find each | ||
// other to form quorum. | ||
if let Some(metadata_writer) = self.metadata_writer.as_mut() { | ||
if key == NODES_CONFIG_KEY { | ||
let mut data = self | ||
.kv_entries | ||
.get(&key) | ||
.expect("to be present") | ||
.value | ||
.as_ref(); | ||
match StorageCodec::decode::<NodesConfiguration, _>(&mut data) { | ||
Ok(nodes_configuration) => { | ||
metadata_writer.submit(Arc::new(nodes_configuration)); | ||
} | ||
Err(err) => { | ||
debug!("Failed deserializing NodesConfiguration: {err}"); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better to not rely on the NodesConfiguration
to learn about the other nodes' addresses and instead require that their addresses don't change because in the general case this cannot work (stopping all nodes and restarting them with different addresses).
332157b
to
9364667
Compare
// make sure that the storage is initialized with a storage id to be able to detect disk losses | ||
let storage_id = if let Some(storage_id) = rocksdb_storage | ||
.get_storage_id() | ||
.map_err(|err| BuildError::InitStorage(err.to_string()))? | ||
{ | ||
storage_id | ||
} else { | ||
let storage_id = random(); | ||
rocksdb_storage | ||
.set_storage_id(storage_id) | ||
.map_err(|err| BuildError::InitStorage(err.to_string()))?; | ||
storage_id | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The durable storage_id together with the node's PlainNodeId
constitute the MemberId
which is used to check whether a node is part of a configuration or not. In case of a disk loss we will generate a new StorageId
which requires a reconfiguration for a node to rejoin an omni paxos cluster.
// Try to read a persisted nodes configuration in order to learn about the addresses of our | ||
// potential peers and the metadata store states. | ||
if let Some(nodes_configuration) = self | ||
.rocksdb_storage | ||
.get_nodes_configuration() | ||
.map_err(|err| BuildError::InitStorage(err.to_string()))? | ||
{ | ||
metadata_writer | ||
.update(Arc::new(nodes_configuration)) | ||
.await? | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the information like what are the node's addresses or what is my MetadataStoreState
are stored in the NodesConfiguration
. Since we cannot guarantee that we can join a cluster w/o it (e.g. when restarting all nodes at the same time or when failing over during a reconfiguration where we haven't received the omnipaxos configuration yet), we persist it ourselves.
TaskCenter::spawn_unmanaged(TaskKind::Background, "provision-responder", async move { | ||
while let Some(request) = provision_rx.recv().await { | ||
let _ = request.result_tx.send(Ok(false)); | ||
} | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reject all future provision requests.
} | ||
|
||
node_config.current_generation.as_plain() | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is needed if another node received the provision signal (e.g. a non metadata store node) and tries to provision the metadata store with a NodesConfiguration
that does not contain the metadata store node.
// todo remove additional indirection from Arc | ||
connection_manager.store(Some(Arc::new(new_connection_manager))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConnectionManager
needs to know our own metadata peer id to validate that connections are set up correctly. Since we only know what our id will be once we have joined an OmniPaxos cluster, we need to instantiate it lazily. This is not super beautiful :-(
Ok(()) = nodes_config_watch.changed() => { | ||
self.update_node_addresses(&Metadata::with_current(|m| m.nodes_config_ref())); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to support changing node/peer addresses that are communicated via the NodesConfiguration
.
// we lost leadership :-( notify callers that their requests might not get committed | ||
// because we don't know whether the leader will start with the same log as we have. | ||
self.kv_storage | ||
.fail_callbacks(|| RequestError::Unavailable("lost leadership".into())); | ||
self.fail_join_callbacks(|| JoinClusterError::NotLeader); | ||
} else if !previous_is_leader && self.is_leader { | ||
debug!(configuration_id = %self.cluster_config.configuration_id, "Won leadership"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem I had was that OmniPaxos does not tell you what is the lsn for a given proposal. Therefore, it is hard to track whether a proposal made it or was replaced by a different proposal (e.g. because of a leadership change and a different log suffix). That's why I am quite defensive here to avoid hanging requests.
The correlation between request and log entries is currently done by having an explicit ulid that is part of a request entry.
Self::reset_storage_for_new_configuration( | ||
&mut rocksdb_storage, | ||
&omni_paxos_configuration, | ||
last_decided_index, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop signs are not part of the actual log. Therefore, we need to manually reset the decided index and clear some fields in the RocksDbStorage
in order to reuse the same storage for the next configuration.
// remember the latest configuration we have seen for future checks | ||
Self::reset_storage_for_new_configuration( | ||
&mut rocksdb_storage, | ||
&OmniPaxosConfiguration { | ||
own_member_id: self.own_member_id, | ||
cluster_config: stop_sign.next_config, | ||
members: metadata.members, | ||
}, | ||
last_decided_index, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be helpful if this node is removed from the next configuration but some other nodes ask this node whether it knows which nodes are part of the new configuration and what the log is.
// todo only try joining if MetadataStoreState::Candidate | ||
let mut join_cluster = std::pin::pin!(Self::join_cluster(None, storage_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential follow-up is to use the MetadataStoreState
stored in the NodesConfiguration
to control which nodes should join a metadata store cluster or not. That could give us a declarative way to define the nodes that should run an active metadata store if they succeed at joining the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on the NodesConfiguration
to store metadata store relevant information (e.g. the MetadataStoreState
and addresses) makes things considerably harder because of the circular dependency (we need the metadata store to write/read the NodesConfiguration
but it also controls the behavior of the metadata store). This becomes especially problematic in the case where we fail during a reconfiguration w/o having received the log yet. Especially, if this node is now needed for the metadata store to become available there is no easy way to obtain the NodesConfiguration
by following the normal join flow.
// We cannot assume that we have already joined the cluster and obtained our generational node id. | ||
// That's why we need to retrieve the plain node id based on our name from the latest known | ||
// NodesConfiguration. If the node has no assigned plain node id, then it first needs to obtain | ||
// it by following the regular join path before it can join the metadata store cluster. | ||
let my_node_id = if let Some(node_config) = | ||
nodes_config.find_node_by_name(Configuration::pinned().common.node_name()) | ||
{ | ||
node_config.current_generation.as_plain() | ||
} else { | ||
bail!("The node with name '{}' has not obtained a node id yet. W/o the node id, it cannot join the metadata store cluster.", Configuration::pinned().common.node_name()); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example where we need a persisted NodesConfiguration
to remember our PlainNodeId
if we fail during reconfiguration but then are needed for the availability of the metadata store.
let active_metadata_store_node = nodes_config.iter().filter_map(|(node, config)| { | ||
if config.has_role(Role::MetadataStore) && node != my_node_id { | ||
Some(node) | ||
} else { | ||
None | ||
} | ||
}).choose(&mut thread_rng()).ok_or(anyhow::anyhow!("No other metadata store present in the cluster. This indicates a misconfiguration."))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should consider the MetadataStoreState::Active
in the future.
// once the log grows beyond the configured grpc max message size (by default 4 MB) this | ||
// will no longer work :-( If we shared the snapshot we still have the same problem once the | ||
// snapshot grows beyond 4 MB. Then we need a separate channel (e.g. object store) or | ||
// support for chunked transfer. | ||
let log_prefix = flexbuffers::from_slice(response.log_prefix.as_ref())?; | ||
let omni_paxos_configuration = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bound to break eventually.
fyi @pcholakov for the future testing efforts. |
let entry_bytes = self.serialize_entry(&entry)?; | ||
self.db.put_cf_opt( | ||
&self.get_log_cf_handle(), | ||
self.next_log_key.to_be_bytes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for those keys to overlap/overwrite with other keys? probably best is to also prefix those records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the log entries and the other variables (promise, decided index, stop sign, etc.) are stored in separate CFs to avoid conflicting keys.
Instead of escalating errors from the ClusterController service which ultimately leads to a panic, this commit catches those errors and logs them on warning in the hope that a retry will fix the problem. The down- side is that a buggy cluster controller will never step down so that another cluster controller can take over as the leader. This fixes restatedev#2276.
This commit makes it configurable which metadata will be run by the Node when starting the Restate server.
This commit adds the skeleton of the Raft metadata store. At the moment only a single node with memory storage is supported. This fixes restatedev#1785.
The raft metadata store does not accept new proposals if there is no known leader. In this situation, request failed with an internal ProposalDropped error. This commit changes the behavior so that a ProposalDropped error will be translated into an unavailable Tonic status. That way, the request will get automatically retried.
This commit adds RocksDbStorage which implements raft::Storage. The RocksDbStorage is a durable storage implementation which is used by the RaftMetadataStore to store the raft state durably. This fixes restatedev#1791.
The OmniPaxos metadata store stores its state in memory.
This commit introduces the ability to specify multiple addresses for the metadata store endpoint. On error, the GrpcMetadataStoreClient randomly switches to another endpoint. Moreover, this commit makes the OmniPaxosMetadataStore only accept requests if it is the leader. Additionally, it fails all pending callbacks if it loses leadership to avoid hanging requests if the request was not decided.
The Restate version enables OmniPaxos to run with a single peer.
… on NodesConfiguration With this change, nodes can change their addresses as long as a quorum of running metadata store nodes keeps running so that the NodesConfiguration can be updated. It is a bit questionable how much worth this feature is. However, w/o it, metadata store nodes behave slightly differently than normal nodes (worker, log store, etc.) as they cannot change their address.
Using the generational node id as the OmniPaxos NodeId is a bit problematic because it will get bumped on every restart. If we fail while joining an existing OmniPaxos cluster and before persisting the configuration, we will try joining with a different id. If we allow higher generations to be accepted, then this is problematic because we can't distinguish between this situation and the loss of the disk. If we are strict about the id equality, then we might get the system stuck because the restarted node cannot join the OmniPaxos cluster which might expect its previous generation to be present. Therefore, this commit introduces the StorageId which is a random number stored in the RocksDbStorage. Whenever we try to join an OmniPaxos cluster we send this StorageId along with the PlainNodeId. Based on the StorageId we can detect if a node lost its disk because it would generate a new one.
The provision grpc call translates into a put command because the local metadata store does not need to be provisioned.
7a5d200
to
6b97053
Compare
This PR contains a first variant of an embedded metadata store based on OmniPaxos. The metadata store has a durable log and supports reconfiguration. The metadata store awaits provisioning. Once provisioned a single Restate node that runs the metadata store role acts as the metadata store cluster. As more nodes join the cluster, those that run the metadata store role will try to join the metadata store cluster by requesting a reconfiguration. Once enough metadata store nodes have joined one should be able to kill floor((n - 1)/2) nodes and things should continue working.
If you want to try things out, you can spawn three Restate server's with the configuration files you find here. Then you need to provision the cluster via
After a short while you should see all nodes running the partition processors. If you see the log line
Run as active metadata store node
on every node, then the metadata store cluster should contain all nodes and you should be able to kill a single random node at your will.Internally, the
OmniPaxosMetadataStore
works the following way:OmniPaxosConfiguration
persisted to disk. If yes, then start as an active metadata storeNodesConfiguration
is known. If yes, then this indicates a prior provisioning. Start as passive metadata storeNodesConfiguration
and start as active metadata storeWhen being a passive metadata store, then try to join an existing cluster by randomly asking any node that runs the metadata store node role. Joining a cluster entails a reconfiguration and sending of the whole log to the new joiner.
An active metadata store runs the
OmniPaxos
library as well as processing metadata store requests and join requests. A metadata store only reacts to requests if it is the leader. This is a simplification to avoid hanging requests because they are never committed.The draft PR has still many rough edges. The main goal is to verify the overall direction and discuss the design decisions (some of them being quite questionable). I've tried to highlight the parts that reviewers should take a closer look at. The best way to look at the PR is probably to take a look at the full result of the
OmniPaxosMetadataStore
struct in thecrates/metadata-store/src/omnipaxos/store.rs
.Some of the missing features are (non-exhaustive):
OmniPaxosMetadataStore
into theMetadataStoreState
lifecycleMetadataStoreState::Active
.This PR is based on #2467.