-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Locality] Location-aware NodeSetChecker #2500
base: main
Are you sure you want to change the base?
Conversation
crates/bifrost/src/providers/replicated_loglet/replication/checker.rs
Outdated
Show resolved
Hide resolved
crates/bifrost/src/providers/replicated_loglet/replication/checker.rs
Outdated
Show resolved
Hide resolved
43b2609
to
2c47453
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left 2 really minor comments (mainly questions)
I would love to see explanation of how f-majority algorithm works. I still can't fully comprehend how it works.
This introduces a new sophisticated and performance-optimized nodeset checker for bifrost flexible quorums. It pre-computes aggregates to reduce the cost of repeatative quorum checks. The new nodeset checker has extensive test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work @AhmedSoliman! It is quite a bit of brain jogging to follow you along here. As far as I can tell, the code looks good to me. I left my usual nitpicks about duplicate words in the comments ;-) +1 for merging.
Nice trick with the raw pointers!
} | ||
|
||
pub fn is_empty(&self) -> bool { | ||
self.node_attribute.is_empty() | ||
/// The number of nodes in that nodeset that has matches the storage state predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The number of nodes in that nodeset that has matches the storage state predicate | |
/// The number of nodes in that nodeset that matches the storage state predicate |
for (_, v) in self.node_attribute.iter_mut() { | ||
*v = Default::default(); | ||
} | ||
/// How many nodes have attributes set. Note that this doesn't count nodes with has not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// How many nodes have attributes set. Note that this doesn't count nodes with has not | |
/// How many nodes have attributes set. Note that this doesn't count nodes which have not |
.expect("total domains must fit into u32"); | ||
debug_assert!(scope_state.replication_factor > 0); | ||
let f_majority_needed = | ||
if candidate_domains >= u32::from(scope_state.replication_factor - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why - 1
is needed here. Wouldn't this mean that we need to find f_majority
if the replication factor is 1
and candidate_domains == 0
? However, if candidate_domains == 0
, then there is no way to form a valid write quorum given a replication of 1 and, thus, f_majority is already reached?
// strings at this scope (the node scope has the most number of domains) | ||
node_id.to_string() | ||
} else if location.is_scope_defined(*scope) { | ||
location.domain_string(*scope, Some(node_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we providing the node_id
as an argument if we exclude the Node
scope from this branch? Wouldn't it be ignored in all other NodeLocationScope
cases?
if authoritative >= fmajority_requires { | ||
return FMajorityResult::SuccessWithRisk; | ||
fn rm_attr_internal(&mut self, node_id: PlainNodeId, attr: &Attr, storage_state: StorageState) { | ||
// Apply this attribute on every scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Apply this attribute on every scope | |
// Remove this attribute on every scope |
num_complete_domains: HashMap<Attr, u32>, | ||
/// For each Attribute, set of the domains at that scope that have at least one node with the | ||
/// attribute. Those are the failure domains considered for | ||
/// replication/write-quorum/write-quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// replication/write-quorum/write-quorum. | |
/// replication/write-quorum. |
if fd.as_ref().is_domain_complete(&attr) { | ||
*scope_state | ||
.num_complete_domains | ||
.entry(attr.clone()) | ||
.or_default() += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the implementation of is_domain_complete
which requires num authoritative nodes > 0, won't we double count a failure domain if it contains more than one authoritative node?
{ | ||
// For each attribute, merge the set of FailureDomainState that have at least one node | ||
// with the attribute here. We stop merging immediately after the set's size reaches | ||
// the replication target. This makes `check_write_quorum` O(replication * num_domain_matches * num_scopes) in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This O notation does not refer to this check_write_quorum
but the one belonging to the NodeSetChecker
, does it? If so, maybe move it there. The num_scopes is a bit confusing in this context.
Probably it's already to late for my brain. I don't see how replication * num_domain_matches * num_scopes
is the worst case. Isn't it a bit better because of the early break condition: O(num_scopes * (#attributes + replication))
(worst case for every scope we need to go over all attributes and in the last attribute our predicate matches and we find replication
domains)?
replication_factor: u8, | ||
/// For each attribute, the number of domains at that scope that have all their nodes with the | ||
/// attribute. | ||
num_complete_domains: HashMap<Attr, u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what is this field used? I can only see that we update it but we seem to never really read it.
checker.set_attribute(10, false); | ||
// trying 3 instead | ||
checker.set_attribute(3, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have set [1, 2, 3] to true right away instead of going through 10?
This introduces a new sophisticated and performance-optimized nodeset checker for bifrost flexible quorums. It pre-computes aggregates to reduce the cost of repeatative quorum checks.
The new nodeset checker has extensive test coverage