Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] Integrate dag fetcher with RB node handler #9585

Merged
merged 1 commit into from
Aug 24, 2023
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Aug 9, 2023

Description

This PR integrates DAG fetcher to the RB node handler. Whenever the node is missing parents, the fetcher is requested to fetch those missing parents and their ancestors. To make sure the GCed DAG is not refetched, it changes the lowest_round in dag_store to provide the information.

Test Plan

Existing tests.

@sasha8 sasha8 self-assigned this Aug 10, 2023
@sasha8 sasha8 self-requested a review August 10, 2023 06:42
error!("request to fetch failed: {}", err);
}
bail!(NodeBroadcastHandleError::MissingParents);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are missing parents in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged this check with the below one. This one will miss the case when a node has all invalid parents.

let current_round = node.metadata().round();

// round 0 is a special case and does not require any parents
if current_round == 0 {
return Ok(());
bail!(NodeBroadcastHandleError::InvalidRound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should start to carry more debug information into the errors, like exact round, missing parents id, why it's invalid etc

Copy link
Contributor Author

@ibalajiarun ibalajiarun Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this now. I will do a separate pass for improving the errors overall and start adding metric counters.

NodeBroadcastHandleError::MissingParents
);
if prev_round >= dag_reader.lowest_round() {
if let Err(err) = self.fetch_requester.request_for_node(node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least unify this condition with the one below, it also doesn't look correct to me, maybe what you meant to check is prev_round > highest_round?

previously the check is to filter out node that below lowest round, the error was misleading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I did not account for check inversion - I was doing ensure! before. But, I removed this condition and just use the one with missing parents check below. Previously I was not checking for signatures if none of the parents existed locally, but I should have.

@ibalajiarun ibalajiarun requested a review from a team as a code owner August 14, 2023 20:27

let dag_reader = self.dag.read();
// check if the parent round is missing in the DAG
ensure!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to filter out the lower round right? otherwise we'd fetch all those nodes from gc'ed rounds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I can send you a node from a very old round and force you to fetch if it is below the GC level. We should ignore nodes from rounds below the GC level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I needed the DAG store to support this minimum round to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's becoming hard to write unit tests without a verifier trait. I will follow-up to introduce that trait and then we can have more unit tests for this case. Added TODO for now.

@@ -92,18 +88,21 @@ impl NodeBroadcastHandler {
.cloned()
.collect();
if !missing_parents.is_empty() {
// For each missing parent, verify their signatures and voting power
// For each missing parent, verify their signatures and voting power. Otherwise, a malicious node can send
// bad nodes with fake parents and cause this peer to issue unnecessary fetch requests.
ensure!(
missing_parents
.iter()
.all(|parent| { parent.verify(&self.epoch_state.verifier).is_ok() }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in node.verify()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to only verify the missing parents. Ideally, we won't miss any parents in the normal case and don't have to verify at all.

Base automatically changed from balaji/dag-e2e-6 to main August 22, 2023 21:36
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.6.2 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704

Compatibility test results for aptos-node-v1.6.2 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704 (PR)
1. Check liveness of validators at old version: aptos-node-v1.6.2
compatibility::simple-validator-upgrade::liveness-check : committed: 4152 txn/s, latency: 7016 ms, (p50: 6600 ms, p90: 9900 ms, p99: 14200 ms), latency samples: 174420
2. Upgrading first Validator to new version: 72cf56d05ce9326379f9eee77b2a88054f45d704
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1775 txn/s, latency: 15952 ms, (p50: 19300 ms, p90: 22000 ms, p99: 22600 ms), latency samples: 92340
3. Upgrading rest of first batch to new version: 72cf56d05ce9326379f9eee77b2a88054f45d704
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1796 txn/s, latency: 16445 ms, (p50: 19200 ms, p90: 22300 ms, p99: 22500 ms), latency samples: 93440
4. upgrading second batch to new version: 72cf56d05ce9326379f9eee77b2a88054f45d704
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3388 txn/s, latency: 9332 ms, (p50: 10200 ms, p90: 12700 ms, p99: 13300 ms), latency samples: 135520
5. check swarm health
Compatibility test for aptos-node-v1.6.2 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704 passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 72cf56d05ce9326379f9eee77b2a88054f45d704

two traffics test: inner traffic : committed: 6027 txn/s, latency: 6495 ms, (p50: 6300 ms, p90: 8100 ms, p99: 16300 ms), latency samples: 2615860
two traffics test : committed: 100 txn/s, latency: 3184 ms, (p50: 2900 ms, p90: 3900 ms, p99: 10000 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.224, avg: 0.213", "QsPosToProposal: max: 0.408, avg: 0.197", "ConsensusProposalToOrdered: max: 0.711, avg: 0.651", "ConsensusOrderedToCommit: max: 0.551, avg: 0.522", "ConsensusProposalToCommit: max: 1.223, avg: 1.173"]
Max round gap was 1 [limit 4] at version 1340981. Max no progress secs was 4.017658 [limit 10] at version 1342809.
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704

Compatibility test results for aptos-node-v1.5.1 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704 (PR)
Upgrade the nodes to version: 72cf56d05ce9326379f9eee77b2a88054f45d704
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 4568 txn/s, latency: 7152 ms, (p50: 7800 ms, p90: 10200 ms, p99: 10800 ms), latency samples: 169040
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 72cf56d05ce9326379f9eee77b2a88054f45d704 passed
Test Ok

@zekun000 zekun000 merged commit cfb7c43 into main Aug 24, 2023
73 checks passed
@zekun000 zekun000 deleted the balaji/dag-rb-fetch branch August 24, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants