Skip to content
This repository has been archived by the owner on Jan 11, 2024. It is now read-only.

Remove look ahead #442

Merged
merged 24 commits into from
Nov 29, 2023
Merged

Remove look ahead #442

merged 24 commits into from
Nov 29, 2023

Conversation

cryptoAtwill
Copy link
Contributor

@cryptoAtwill cryptoAtwill commented Nov 22, 2023

This PR simplifies existing look ahead and look back for non-null blocks and adds more secure logic for handling delayed execution.

The key change is in the background cache handling: only when a block's is not null and confirmed to have been executed, will it be injected into cache. Then topdown proposal will just be last_committed_block + some height. This also eliminates the need for delaying blocks in proposal execution as one can be sure the proposed height will be executed.

This is done through: https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R78

Copy link
Contributor

@adlrocha adlrocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass over the code, I think we are close, but I don't know if there may be a bug in poll_next on how we use and set to_confirm. Would you mind adding a test to confirm this? Thanks!

@@ -45,6 +47,30 @@ pub struct Config {
pub exponential_back_off: Duration,
/// The max number of retries for exponential backoff before giving up
pub exponential_retry_limit: usize,
/// The minimal number of blocks one should make the topdown proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed sync, something that I am wondering here is if this min_proposal_interval needs to be synced among all of the validators. What happens if one decides to set this value to 5 and the other to 100? As long as both have the finality information for the proposal it should be fine, but it may remove the initial purpose of using this approach. WDYT?

Suggested change
/// The minimal number of blocks one should make the topdown proposal
/// The minimum number of blocks one should make the topdown proposal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we keep it constant and we don't let users freely configure this value as I see that you are doing above.

@@ -12,6 +12,8 @@ use ipc_sdk::staking::StakingChangeRequest;
/// Finality provider that can handle null blocks
#[derive(Clone)]
pub struct FinalityWithNull {
/// The min topdown proposal height interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The min topdown proposal height interval
/// The topdown proposal interval

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a minimum? We should have a maximum, not a minimum. If you have caught up with the parent Lotus chain, and it proposes a block every 30 seconds, why could we not propose every single one?

Or what is this a minimum of? The number of blocks between proposals on the child subnet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking a lot about this, and I agree with @aakoshh here. In the current model this offers us nothing, you can easily provide every non-null height and the operation should be the same.


Ok(Some(if next_proposal_height < latest_height {
// safe to unwrap as we are sure `latest_height` will not be null block
self.min_nonnull_block(next_proposal_height, latest_height)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward here as you do is fine as long as the info in the cache will only be available once the receipt for that height has been executed and is available. If the cache can ensure this we can do this, if not we should look backward instead of forward to be safe.

fendermint/vm/topdown/src/sync/syncer.rs Outdated Show resolved Hide resolved
use crate::BlockHeight;
use std::fmt::{Display, Formatter};

/// There are three pointers, each refers to a block height, when syncing with parent. As Lotus has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is redundant with the one below. Maybe remove this one and keep the one from sync.rs?


/// We only want the non-null parent block's hash
async fn non_null_parent_hash(&self) -> BlockHash {
let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling it .pending_exec instead of .to_confirm? The latter confuse=d me for a moment.

Suggested change
let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) {
let parent_height = match (self.sync_pointers.pending_exec(), self.sync_pointers.tail()) {

self.head += 1;
}

pub fn advance_confirm(&mut self, height: BlockHeight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This referes to the next one to confirm, right?

Suggested change
pub fn advance_confirm(&mut self, height: BlockHeight) {
pub fn next_confirm(&mut self, height: BlockHeight) {

tracing::debug!(height, "found height to confirm");
height
}
(None, height) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only happen in the beginning right? According to https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R172-R183, after the first block this should always be Some.

Also, advance_confirmed is only called there, so I don't know to what extent to_confirm will ever be set to some other value different None. I am afraid there may be a bug there.

Comment on lines 159 to 163
let next_proposal_height = last_committed_height + self.min_proposal_interval;

Ok(Some(if next_proposal_height < latest_height {
// safe to unwrap as we are sure `latest_height` will not be null block
self.min_nonnull_block(next_proposal_height, latest_height)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if I get this right:

  • last_committed_height is the last finalized block from the ledger
  • latest_height is the highest record inserted into the cache; the fetcher guarantees that there is at least chain_delay blocks between that and the tip of the parent chain
  • hopefully it is also not more than some limited distance away from last_committed_height as discussed in https://github.com/consensus-shipyard/fendermint/issues/437

Why should we impose on ourselves a min_proposal_interval when we can propose anything betweeen last_committed_height and latest_height? Going back from latest_height to find a non-null block should be perfectly fine.

I suspect this exists because latest_height is still not limited in its look-ahead, and so you want to search something going from last_committed_height forward, but don't want to propose 1-by-1.

Once again a max_proposal_interval would make more sense to me: it limits the potential blocks that have to be fetched and executed in one block, and when in sync it allows proposals to go 1-by-1, because the root chain is so much slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Going back from latest_height to find a non-null block should be perfectly fine. might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers. Starting from last_committed_height should be safer, even if someone is way ahead, peers should eventually catch up and the proposal will be accepted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers.

I agree, which is why I keep referring to https://github.com/consensus-shipyard/fendermint/issues/437
where I said that we should limit the latest_height by not fetching more than e.g. 100 ahead from last_committed_height, and why I have in this thread multiple times suggested replacing this new min_topdown_interval with a max_topdown_interval, which achieves almost the same thing, except in the linked issue I would also limit the maximum size of the cache that anyone would fetch.

Comment on lines 34 to 35
/// Default topdown proposal height interval
pub(crate) const DEFAULT_PROPOSAL_INTERVAL: BlockHeight = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that we catch up with the root chain in bursts of 5 minute jumps. Why should it work like that?

Comment on lines 57 to 60
if let Some(h) = self.to_confirm {
self.tail = h;
}
self.to_confirm = Some(height);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this jump multiple steps? e.g. does it makes sense that tail is 4, to_confirm is 5, then to_confirm becomes 10 but tail is just 5?

/// executed (also with some checks to ensure no reorg has occurred). We fetch block 2's data and set
/// `tail = 2`, `to_confirm = Some(4)`. Then height 2 is ready to be proposed.
#[derive(Clone, Debug)]
pub(crate) struct SyncPointers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed example and description 🙏

I don't understand why we need this to_confirm at all; when we have a non-null height we know that everything before has been executed, and regardless of whether we know that some block has a follow-up that "confirms" the execution results, we should refrain from using it. That way everyone is on equal footing and there is no reason to wait any new blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we have a non-null height we know that everything before has been executed, this is reflected in the tail variable. And tail is actually the latest height in cache. So when proposing, everything up till tail will be available for proposal. But the thing is, if there is a new block that is non-null, but how are we going to populate the cache? We don't know which previous block's data should we pull? If we do this, there will be empty data for non-null blocks. If we just wait, this won't happen.

If you are suggesting we can remove to_confirm and replace it with tail, I think that's a good simplification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't really want to think in the terms of tails and heads and waiting for confirms.

What I have been proposing for a week is much simpler: as soon as you see a block, you know that its parent has been executed, so when we finalize a given block, let's limit ourselves to the events of its parent.

If we did not have null blocks, this would be simply the previous height. With null blocks we have to find the parent. But it's already in the cache, or it's already in the parent chain; the important thing is there is no situation where we have to wait for anything to happen.


/// Poll the next block height. Returns finalized and executed block data.
async fn poll_next(&mut self) -> Result<Option<(BlockHeight, ParentViewPayload)>, Error> {
let height = self.sync_pointers.head() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost. head can be a non-null but not executed block, right? But here we start by its height+1, and if we find something there, then sync will set all intermediary blocks as null. I don't see why they are all necessarily null blocks.

Comment on lines 181 to 186
self.sync_pointers.advance_confirm(height);
return Ok(Some((to_confirm, data)));
}

tracing::warn!(height, "non-null round at height, waiting for confirmation");
self.sync_pointers.advance_head();
Copy link
Contributor

@aakoshh aakoshh Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the head doesn't move when there is something to confirm :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an error, I fixed from integration testing.


return Ok(None);
}
return Err(Error::CannotQueryParent(err, height));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the normal error when there is just no block yet at that height?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think no block yet will happen as in we are delaying our polling by some value. Or are you referring to null block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't understand when it returns None; I see it returns None when it's a null block, but I don't see what happens when there is no block at all, like if you poll too often. Probably I don't understand anything, maybe it's not even polling as you say.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is too complicated for my 🧠

I have vague understanding that this isn't doing what I suggested in https://github.com/consensus-shipyard/fendermint/issues/436 by looking back, and it's also not limiting the cache size like I thought it would in https://github.com/consensus-shipyard/fendermint/issues/437

Instead it does multiple things, where it tries to go ahead by a minimum 10 blocks at a time (so jump 10 blocks once every 5 minutes), and it introduces a complex dance of only proposing when the execution results are available.

I can't say if that code is correct 🤯 but as I see this only applies to the filling of the cache, it does not apply to how you query the parent provider synchronously during block execution. Maybe it's there, but I could not identify it: if a validator sees that a block is finalized by the others, but their own particular parent has not executed that block yet, it will still get empty results in the API, correct?

I can't help but feel that everything would be so much simpler if we just forgot about waiting for results to appear for the finalized height, and embraced deferred execution by looking backwards where we know we will have data.

@cryptoAtwill
Copy link
Contributor Author

@aakoshh The issue I had with look backwards is all backwards cache can be null blocks. I have seen in logs that in calibration, around 7 or 8 blocks can be null. So if we dont find any executed blocks in cache, then we need to start looking forward. So if I know the current block is non-null block, why not wait for the next non-null block? It's the same idea, but we dont have to look backwards or forwards.

@aakoshh
Copy link
Contributor

aakoshh commented Nov 23, 2023

The issue I had with look backwards is all backwards cache can be null blocks.

No, it can't.

  1. We finalize block N. But we don't wait for its effects, we execute the effects of its ancestors.
  2. We poll and wait a new non-null block to appear.
  3. We see and finalize block N+k. We look for its parent that we can execute.
  4. You say there can be 7-8 null blocks in that k blocks since block N.
  5. But that's perfectly fine, because we never executed block N so that's at least 1 block we can execute now.

@cryptoAtwill
Copy link
Contributor Author

@aakoshh Sorry, I might have to disagree and challenge a bit on this.

We poll and wait a new non-null block to appear.

This is what head does.

We see and finalize block N+k. We look for its parent that we can execute.

So where do we look for the parent that can be executed? If it is a separate variable that tracks the previous non-null block, then it is the tail in the above implementation. If it is stored in cache, then The issue I had with look backwards is all backwards cache can be null blocks. will happen. Say we just started parent syncer and we have the following in cache:

# cache data
last_committed_height -> Null Block(height 100) -> Null Block(height 101)

In the next block we polled, it is a non-null block, we store in cache.

# cache data
last_committed_height -> Null Block(height 100) -> Null Block(height 101) -> Non Null Block (height 102)

Now we need to make a proposal, which with latest_height in cache is 102. But we cannot propose this block because it might not have been executed yet. But all the parents are null.

What I think is the correct behaviour is before block is confirmed to be executed, it should not be stored in cache at all, along with height 100, 101. We should use a separate variable to track it. Only when block 102 is confirmed to be executed, it should be allowed to insert into cache. So in cache, we can make sure at least the latest block is an executed block. When making proposals, we are safe to look forward because the uppper bound will always be executed non-null block.

@aakoshh
Copy link
Contributor

aakoshh commented Nov 27, 2023

@cryptoAtwill thanks for the example in your comment.

I think you still misunderstand what I am trying to propose.

I say that when block N is final, we don't execute it, we only execute its parent (the first non-null parent). That leaves block N unexecuted. Then, you see a non-null descendant block N+k being final. By definition you have at least one block to execute: N. I don't know how else to say this.

So where do we look for the parent that can be executed? If it is a separate variable that tracks the previous non-null block, then it is the tail in the above implementation. If it is stored in cache, then The issue I had with look backwards is all backwards cache can be null blocks. will happen.

Please don't get caught up with what the cache does. I said this before: forget the cache. It's nice to have a cache, but the ultimate source of truth is the parent blockchain. Maybe all blocks in the cache are null, maybe the cache is empty for all I know, but I would know that the last finalized block was not null and wasn't executed, so there is always something.

To be honest I'm not even sure what we are discussing here any more. I don't want to convince about how the cache works today, it doesn't matter. What I was trying to convince everyone about is that it's much simpler to implement a syncer which does not have to wait for another block to appear so that the finalized block is executed, if all we do is use the effects up to, but not including, the finalized block.

IMO it is conceptually the correct things to do as well, because if we had a more crypto oriented blockchain where you can request proofs about state, since we have deferred execution, with block N being final you can only request such proofs about the state hash included in its header, which is for block N-1.

Copy link
Contributor

@adlrocha adlrocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this one in an additional PR so it is easier to review? Thanks!

@@ -266,7 +266,11 @@ where

// error happens if we cannot get the validator set from ipc agent after retries
let validator_changes = provider
.validator_changes_from(prev_height + 1, finality.height)
.validator_changes_from(
prev_height + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prev_height + 1,
prev_height ,

.validator_changes_from(prev_height + 1, finality.height)
.validator_changes_from(
prev_height + 1,
finality.height,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
finality.height,
finality.height - 1,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We may need to update the cache not to clean finality.height - 1 (even if its null we know that the state is final up till here).
  • We need to apply this same approach for top-down messages.

If we receive 100 and we don't have yet that info because we are not up to date, we should either crash or not make progress, but never execute to avoid consensus failures.

.map(|r| r.value)
// now we query the validator changes at height `to` and we also checks the block hash,
// so that we know the parent node is reliable enough.
let mut r = self.validator_changes_with_check(to, block_hash).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the block_hash check according to what we have in the cache, if we have something? If not, let's not do this check because we are getting whatever we think is final from the parent (assuming no re-orgs).

}

/// Get top down message in the range `from` to `to`, both inclusive. It also validates the block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here to get the same range we are having in validator changes from prev_finality to latest_finality -1

Ok(None)
}

fn propose_next_height(&self) -> Stm<Option<BlockHeight>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use an interval and always propose a specific height because this gives us nothing. Ideally we should:

  • Propose whatever I see as the latest height as long as it is not higher than prev_committed_finality + MAX_DELTA. If this height is null, we always search back.

MAX_DELTA for now should big enough so it ensure at least one non-null block in that window (we can address small MAX_DELTA with all null-rounds in the future). For now, if everything null let's crash it because it breaks our assumption it would mean there is something really wrong in the parent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this with Akosh's proposal of having an additional delay from proposal to acceptance: https://github.com/consensus-shipyard/fendermint/pull/435/files

cryptoAtwill and others added 2 commits November 29, 2023 18:26
* new proposal logic

* fmt

* fix typo

* minor changes

* adding check

* ignore proposal height

* Fix type

* expose config params

---------

Co-authored-by: Alfonso de la Rocha <[email protected]>
@maciejwitowski maciejwitowski merged commit dca666c into main Nov 29, 2023
5 checks passed
@maciejwitowski maciejwitowski deleted the remove-look-ahead branch November 29, 2023 20:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants