Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Locality] Location-aware NodeSetChecker #2500

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[Locality] Location-aware NodeSetChecker #2500

wants to merge 1 commit into from

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jan 16, 2025

This introduces a new sophisticated and performance-optimized nodeset checker for bifrost flexible quorums. It pre-computes aggregates to reduce the cost of repeatative quorum checks.

The new nodeset checker has extensive test coverage

Copy link

github-actions bot commented Jan 16, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 13s ⏱️ -1s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 4c47f6c. ± Comparison against base commit 6de7e0f.

♻️ This comment has been updated with latest results.

@AhmedSoliman AhmedSoliman marked this pull request as ready for review January 16, 2025 12:59
@AhmedSoliman AhmedSoliman force-pushed the pr2500 branch 2 times, most recently from 43b2609 to 2c47453 Compare January 16, 2025 15:06
Copy link
Contributor

@muhamadazmy muhamadazmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left 2 really minor comments (mainly questions)

I would love to see explanation of how f-majority algorithm works. I still can't fully comprehend how it works.

This introduces a new sophisticated and performance-optimized nodeset checker for bifrost flexible quorums. It pre-computes aggregates to reduce the cost of repeatative quorum checks.

The new nodeset checker has extensive test coverage
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work @AhmedSoliman! It is quite a bit of brain jogging to follow you along here. As far as I can tell, the code looks good to me. I left my usual nitpicks about duplicate words in the comments ;-) +1 for merging.

Nice trick with the raw pointers!

}

pub fn is_empty(&self) -> bool {
self.node_attribute.is_empty()
/// The number of nodes in that nodeset that has matches the storage state predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The number of nodes in that nodeset that has matches the storage state predicate
/// The number of nodes in that nodeset that matches the storage state predicate

for (_, v) in self.node_attribute.iter_mut() {
*v = Default::default();
}
/// How many nodes have attributes set. Note that this doesn't count nodes with has not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// How many nodes have attributes set. Note that this doesn't count nodes with has not
/// How many nodes have attributes set. Note that this doesn't count nodes which have not

.expect("total domains must fit into u32");
debug_assert!(scope_state.replication_factor > 0);
let f_majority_needed =
if candidate_domains >= u32::from(scope_state.replication_factor - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why - 1 is needed here. Wouldn't this mean that we need to find f_majority if the replication factor is 1 and candidate_domains == 0? However, if candidate_domains == 0, then there is no way to form a valid write quorum given a replication of 1 and, thus, f_majority is already reached?

// strings at this scope (the node scope has the most number of domains)
node_id.to_string()
} else if location.is_scope_defined(*scope) {
location.domain_string(*scope, Some(node_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we providing the node_id as an argument if we exclude the Node scope from this branch? Wouldn't it be ignored in all other NodeLocationScope cases?

if authoritative >= fmajority_requires {
return FMajorityResult::SuccessWithRisk;
fn rm_attr_internal(&mut self, node_id: PlainNodeId, attr: &Attr, storage_state: StorageState) {
// Apply this attribute on every scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Apply this attribute on every scope
// Remove this attribute on every scope

num_complete_domains: HashMap<Attr, u32>,
/// For each Attribute, set of the domains at that scope that have at least one node with the
/// attribute. Those are the failure domains considered for
/// replication/write-quorum/write-quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// replication/write-quorum/write-quorum.
/// replication/write-quorum.

Comment on lines +527 to +532
if fd.as_ref().is_domain_complete(&attr) {
*scope_state
.num_complete_domains
.entry(attr.clone())
.or_default() += 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the implementation of is_domain_complete which requires num authoritative nodes > 0, won't we double count a failure domain if it contains more than one authoritative node?

{
// For each attribute, merge the set of FailureDomainState that have at least one node
// with the attribute here. We stop merging immediately after the set's size reaches
// the replication target. This makes `check_write_quorum` O(replication * num_domain_matches * num_scopes) in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This O notation does not refer to this check_write_quorum but the one belonging to the NodeSetChecker, does it? If so, maybe move it there. The num_scopes is a bit confusing in this context.

Probably it's already to late for my brain. I don't see how replication * num_domain_matches * num_scopes is the worst case. Isn't it a bit better because of the early break condition: O(num_scopes * (#attributes + replication)) (worst case for every scope we need to go over all attributes and in the last attribute our predicate matches and we find replication domains)?

replication_factor: u8,
/// For each attribute, the number of domains at that scope that have all their nodes with the
/// attribute.
num_complete_domains: HashMap<Attr, u32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what is this field used? I can only see that we update it but we seem to never really read it.

Comment on lines +1151 to +1153
checker.set_attribute(10, false);
// trying 3 instead
checker.set_attribute(3, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have set [1, 2, 3] to true right away instead of going through 10?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants