Remove look ahead #442

cryptoAtwill · 2023-11-22T08:55:24Z

This PR simplifies existing look ahead and look back for non-null blocks and adds more secure logic for handling delayed execution.

The key change is in the background cache handling: only when a block's is not null and confirmed to have been executed, will it be injected into cache. Then topdown proposal will just be last_committed_block + some height. This also eliminates the need for delaying blocks in proposal execution as one can be sure the proposed height will be executed.

This is done through: https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R78

adlrocha

Did a first pass over the code, I think we are close, but I don't know if there may be a bug in poll_next on how we use and set to_confirm. Would you mind adding a test to confirm this? Thanks!

adlrocha · 2023-11-22T09:27:47Z

fendermint/vm/topdown/src/lib.rs

@@ -45,6 +47,30 @@ pub struct Config {
    pub exponential_back_off: Duration,
    /// The max number of retries for exponential backoff before giving up
    pub exponential_retry_limit: usize,
+    /// The minimal number of blocks one should make the topdown proposal


As discussed sync, something that I am wondering here is if this min_proposal_interval needs to be synced among all of the validators. What happens if one decides to set this value to 5 and the other to 100? As long as both have the finality information for the proposal it should be fine, but it may remove the initial purpose of using this approach. WDYT?

Suggested change

/// The minimal number of blocks one should make the topdown proposal

/// The minimum number of blocks one should make the topdown proposal

Unless we keep it constant and we don't let users freely configure this value as I see that you are doing above.

adlrocha · 2023-11-22T09:28:58Z

fendermint/vm/topdown/src/finality/null.rs

@@ -12,6 +12,8 @@ use ipc_sdk::staking::StakingChangeRequest;
 /// Finality provider that can handle null blocks
 #[derive(Clone)]
 pub struct FinalityWithNull {
+    /// The min topdown proposal height interval


Suggested change

/// The min topdown proposal height interval

/// The topdown proposal interval

Why do we need a minimum? We should have a maximum, not a minimum. If you have caught up with the parent Lotus chain, and it proposes a block every 30 seconds, why could we not propose every single one?

Or what is this a minimum of? The number of blocks between proposals on the child subnet?

I've been thinking a lot about this, and I agree with @aakoshh here. In the current model this offers us nothing, you can easily provide every non-null height and the operation should be the same.

adlrocha · 2023-11-22T09:36:41Z

fendermint/vm/topdown/src/finality/null.rs

+
+        Ok(Some(if next_proposal_height < latest_height {
+            // safe to unwrap as we are sure `latest_height` will not be null block
+            self.min_nonnull_block(next_proposal_height, latest_height)?


Looking forward here as you do is fine as long as the info in the cache will only be available once the receipt for that height has been executed and is available. If the cache can ensure this we can do this, if not we should look backward instead of forward to be safe.

fendermint/vm/topdown/src/sync/syncer.rs

adlrocha · 2023-11-22T09:42:22Z

fendermint/vm/topdown/src/sync/pointers.rs

+use crate::BlockHeight;
+use std::fmt::{Display, Formatter};
+
+/// There are three pointers, each refers to a block height, when syncing with parent. As Lotus has


This comment is redundant with the one below. Maybe remove this one and keep the one from sync.rs?

adlrocha · 2023-11-22T09:49:35Z

fendermint/vm/topdown/src/sync/syncer.rs

+
+    /// We only want the non-null parent block's hash
+    async fn non_null_parent_hash(&self) -> BlockHash {
+        let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) {


What about calling it .pending_exec instead of .to_confirm? The latter confuse=d me for a moment.

Suggested change

let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) {

let parent_height = match (self.sync_pointers.pending_exec(), self.sync_pointers.tail()) {

adlrocha · 2023-11-22T09:55:18Z

fendermint/vm/topdown/src/sync/pointers.rs

+        self.head += 1;
+    }
+
+    pub fn advance_confirm(&mut self, height: BlockHeight) {


This referes to the next one to confirm, right?

Suggested change

pub fn advance_confirm(&mut self, height: BlockHeight) {

pub fn next_confirm(&mut self, height: BlockHeight) {

adlrocha · 2023-11-22T09:59:13Z

fendermint/vm/topdown/src/sync/syncer.rs

+                tracing::debug!(height, "found height to confirm");
+                height
+            }
+            (None, height) => {


This will only happen in the beginning right? According to https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R172-R183, after the first block this should always be Some.

Also, advance_confirmed is only called there, so I don't know to what extent to_confirm will ever be set to some other value different None. I am afraid there may be a bug there.

aakoshh · 2023-11-22T14:12:03Z

fendermint/vm/topdown/src/finality/null.rs

+        let next_proposal_height = last_committed_height + self.min_proposal_interval;
+
+        Ok(Some(if next_proposal_height < latest_height {
+            // safe to unwrap as we are sure `latest_height` will not be null block
+            self.min_nonnull_block(next_proposal_height, latest_height)?


Let's see if I get this right:

last_committed_height is the last finalized block from the ledger

latest_height is the highest record inserted into the cache; the fetcher guarantees that there is at least chain_delay blocks between that and the tip of the parent chain

hopefully it is also not more than some limited distance away from last_committed_height as discussed in https://github.com/consensus-shipyard/fendermint/issues/437

Why should we impose on ourselves a min_proposal_interval when we can propose anything betweeen last_committed_height and latest_height? Going back from latest_height to find a non-null block should be perfectly fine.

I suspect this exists because latest_height is still not limited in its look-ahead, and so you want to search something going from last_committed_height forward, but don't want to propose 1-by-1.

Once again a max_proposal_interval would make more sense to me: it limits the potential blocks that have to be fetched and executed in one block, and when in sync it allows proposals to go 1-by-1, because the root chain is so much slower.

I think Going back from latest_height to find a non-null block should be perfectly fine. might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers. Starting from last_committed_height should be safer, even if someone is way ahead, peers should eventually catch up and the proposal will be accepted.

might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers.

I agree, which is why I keep referring to https://github.com/consensus-shipyard/fendermint/issues/437
where I said that we should limit the latest_height by not fetching more than e.g. 100 ahead from last_committed_height, and why I have in this thread multiple times suggested replacing this new min_topdown_interval with a max_topdown_interval, which achieves almost the same thing, except in the linked issue I would also limit the maximum size of the cache that anyone would fetch.

aakoshh · 2023-11-22T14:14:13Z

fendermint/vm/topdown/src/lib.rs

+/// Default topdown proposal height interval
+pub(crate) const DEFAULT_PROPOSAL_INTERVAL: BlockHeight = 10;


This would mean that we catch up with the root chain in bursts of 5 minute jumps. Why should it work like that?

aakoshh · 2023-11-22T14:28:54Z

fendermint/vm/topdown/src/sync/pointers.rs

+        if let Some(h) = self.to_confirm {
+            self.tail = h;
+        }
+        self.to_confirm = Some(height);


Can this jump multiple steps? e.g. does it makes sense that tail is 4, to_confirm is 5, then to_confirm becomes 10 but tail is just 5?

aakoshh · 2023-11-22T14:33:05Z

fendermint/vm/topdown/src/sync/pointers.rs

+/// executed (also with some checks to ensure no reorg has occurred). We fetch block 2's data and set
+/// `tail = 2`, `to_confirm = Some(4)`. Then height 2 is ready to be proposed.
+#[derive(Clone, Debug)]
+pub(crate) struct SyncPointers {


Thank you for the detailed example and description 🙏

I don't understand why we need this to_confirm at all; when we have a non-null height we know that everything before has been executed, and regardless of whether we know that some block has a follow-up that "confirms" the execution results, we should refrain from using it. That way everyone is on equal footing and there is no reason to wait any new blocks.

when we have a non-null height we know that everything before has been executed, this is reflected in the tail variable. And tail is actually the latest height in cache. So when proposing, everything up till tail will be available for proposal. But the thing is, if there is a new block that is non-null, but how are we going to populate the cache? We don't know which previous block's data should we pull? If we do this, there will be empty data for non-null blocks. If we just wait, this won't happen.

If you are suggesting we can remove to_confirm and replace it with tail, I think that's a good simplification.

To be honest I don't really want to think in the terms of tails and heads and waiting for confirms.

What I have been proposing for a week is much simpler: as soon as you see a block, you know that its parent has been executed, so when we finalize a given block, let's limit ourselves to the events of its parent.

If we did not have null blocks, this would be simply the previous height. With null blocks we have to find the parent. But it's already in the cache, or it's already in the parent chain; the important thing is there is no situation where we have to wait for anything to happen.

aakoshh · 2023-11-22T14:40:24Z

fendermint/vm/topdown/src/sync/syncer.rs

+
+    /// Poll the next block height. Returns finalized and executed block data.
+    async fn poll_next(&mut self) -> Result<Option<(BlockHeight, ParentViewPayload)>, Error> {
+        let height = self.sync_pointers.head() + 1;


I'm a bit lost. head can be a non-null but not executed block, right? But here we start by its height+1, and if we find something there, then sync will set all intermediary blocks as null. I don't see why they are all necessarily null blocks.

fendermint/vm/topdown/src/sync/syncer.rs

aakoshh · 2023-11-22T14:48:23Z

fendermint/vm/topdown/src/sync/syncer.rs

+            self.sync_pointers.advance_confirm(height);
+            return Ok(Some((to_confirm, data)));
+        }
+
+        tracing::warn!(height, "non-null round at height, waiting for confirmation");
+        self.sync_pointers.advance_head();


I don't understand why the head doesn't move when there is something to confirm :(

It's an error, I fixed from integration testing.

aakoshh · 2023-11-22T14:48:45Z

fendermint/vm/topdown/src/sync/syncer.rs

+
+                    return Ok(None);
+                }
+                return Err(Error::CannotQueryParent(err, height));


Is this the normal error when there is just no block yet at that height?

I dont think no block yet will happen as in we are delaying our polling by some value. Or are you referring to null block.

I just don't understand when it returns None; I see it returns None when it's a null block, but I don't see what happens when there is no block at all, like if you poll too often. Probably I don't understand anything, maybe it's not even polling as you say.

aakoshh

Sorry, this is too complicated for my 🧠

I have vague understanding that this isn't doing what I suggested in https://github.com/consensus-shipyard/fendermint/issues/436 by looking back, and it's also not limiting the cache size like I thought it would in https://github.com/consensus-shipyard/fendermint/issues/437

Instead it does multiple things, where it tries to go ahead by a minimum 10 blocks at a time (so jump 10 blocks once every 5 minutes), and it introduces a complex dance of only proposing when the execution results are available.

I can't say if that code is correct 🤯 but as I see this only applies to the filling of the cache, it does not apply to how you query the parent provider synchronously during block execution. Maybe it's there, but I could not identify it: if a validator sees that a block is finalized by the others, but their own particular parent has not executed that block yet, it will still get empty results in the API, correct?

I can't help but feel that everything would be so much simpler if we just forgot about waiting for results to appear for the finalized height, and embraced deferred execution by looking backwards where we know we will have data.

cryptoAtwill · 2023-11-23T04:46:58Z

@aakoshh The issue I had with look backwards is all backwards cache can be null blocks. I have seen in logs that in calibration, around 7 or 8 blocks can be null. So if we dont find any executed blocks in cache, then we need to start looking forward. So if I know the current block is non-null block, why not wait for the next non-null block? It's the same idea, but we dont have to look backwards or forwards.

Co-authored-by: Akosh Farkash <[email protected]>

…d/fendermint into remove-look-ahead

aakoshh · 2023-11-23T08:24:55Z

The issue I had with look backwards is all backwards cache can be null blocks.

No, it can't.

We finalize block N. But we don't wait for its effects, we execute the effects of its ancestors.
We poll and wait a new non-null block to appear.
We see and finalize block N+k. We look for its parent that we can execute.
You say there can be 7-8 null blocks in that k blocks since block N.
But that's perfectly fine, because we never executed block N so that's at least 1 block we can execute now.

Co-authored-by: adlrocha <[email protected]>

cryptoAtwill · 2023-11-23T12:54:42Z

@aakoshh Sorry, I might have to disagree and challenge a bit on this.

We poll and wait a new non-null block to appear.

This is what head does.

We see and finalize block N+k. We look for its parent that we can execute.

So where do we look for the parent that can be executed? If it is a separate variable that tracks the previous non-null block, then it is the tail in the above implementation. If it is stored in cache, then The issue I had with look backwards is all backwards cache can be null blocks. will happen. Say we just started parent syncer and we have the following in cache:

# cache data
last_committed_height -> Null Block(height 100) -> Null Block(height 101)

In the next block we polled, it is a non-null block, we store in cache.

# cache data
last_committed_height -> Null Block(height 100) -> Null Block(height 101) -> Non Null Block (height 102)

Now we need to make a proposal, which with latest_height in cache is 102. But we cannot propose this block because it might not have been executed yet. But all the parents are null.

What I think is the correct behaviour is before block is confirmed to be executed, it should not be stored in cache at all, along with height 100, 101. We should use a separate variable to track it. Only when block 102 is confirmed to be executed, it should be allowed to insert into cache. So in cache, we can make sure at least the latest block is an executed block. When making proposals, we are safe to look forward because the uppper bound will always be executed non-null block.

…d/fendermint into remove-look-ahead

aakoshh · 2023-11-27T12:29:53Z

@cryptoAtwill thanks for the example in your comment.

I think you still misunderstand what I am trying to propose.

I say that when block N is final, we don't execute it, we only execute its parent (the first non-null parent). That leaves block N unexecuted. Then, you see a non-null descendant block N+k being final. By definition you have at least one block to execute: N. I don't know how else to say this.

So where do we look for the parent that can be executed? If it is a separate variable that tracks the previous non-null block, then it is the tail in the above implementation. If it is stored in cache, then The issue I had with look backwards is all backwards cache can be null blocks. will happen.

Please don't get caught up with what the cache does. I said this before: forget the cache. It's nice to have a cache, but the ultimate source of truth is the parent blockchain. Maybe all blocks in the cache are null, maybe the cache is empty for all I know, but I would know that the last finalized block was not null and wasn't executed, so there is always something.

To be honest I'm not even sure what we are discussing here any more. I don't want to convince about how the cache works today, it doesn't matter. What I was trying to convince everyone about is that it's much simpler to implement a syncer which does not have to wait for another block to appear so that the finalized block is executed, if all we do is use the effects up to, but not including, the finalized block.

IMO it is conceptually the correct things to do as well, because if we had a more crypto oriented blockchain where you can request proofs about state, since we have deferred execution, with block N being final you can only request such proofs about the state hash included in its header, which is for block N-1.

adlrocha

Can you address this one in an additional PR so it is easier to review? Thanks!

adlrocha · 2023-11-29T11:57:32Z

fendermint/vm/interpreter/src/chain.rs

@@ -266,7 +266,11 @@ where

                    // error happens if we cannot get the validator set from ipc agent after retries
                    let validator_changes = provider
-                        .validator_changes_from(prev_height + 1, finality.height)
+                        .validator_changes_from(
+                            prev_height + 1,


Suggested change

prev_height + 1,

prev_height ,

adlrocha · 2023-11-29T11:57:42Z

fendermint/vm/interpreter/src/chain.rs

-                        .validator_changes_from(prev_height + 1, finality.height)
+                        .validator_changes_from(
+                            prev_height + 1,
+                            finality.height,


Suggested change

finality.height,

finality.height - 1,

We may need to update the cache not to clean finality.height - 1 (even if its null we know that the state is final up till here).

We need to apply this same approach for top-down messages.

If we receive 100 and we don't have yet that info because we are not up to date, we should either crash or not make progress, but never execute to avoid consensus failures.

adlrocha · 2023-11-29T12:09:00Z

fendermint/vm/topdown/src/finality/fetch.rs

-                .map(|r| r.value)
+        // now we query the validator changes at height `to` and we also checks the block hash,
+        // so that we know the parent node is reliable enough.
+        let mut r = self.validator_changes_with_check(to, block_hash).await?;


Can we do the block_hash check according to what we have in the cache, if we have something? If not, let's not do this check because we are getting whatever we think is final from the parent (assuming no re-orgs).

adlrocha · 2023-11-29T12:09:42Z

fendermint/vm/topdown/src/finality/fetch.rs

    }

+    /// Get top down message in the range `from` to `to`, both inclusive. It also validates the block


Update here to get the same range we are having in validator changes from prev_finality to latest_finality -1

adlrocha · 2023-11-29T12:20:39Z

fendermint/vm/topdown/src/finality/null.rs

+        Ok(None)
+    }
+
+    fn propose_next_height(&self) -> Stm<Option<BlockHeight>> {


We shouldn't use an interval and always propose a specific height because this gives us nothing. Ideally we should:

Propose whatever I see as the latest height as long as it is not higher than prev_committed_finality + MAX_DELTA. If this height is null, we always search back.

MAX_DELTA for now should big enough so it ensure at least one non-null block in that window (we can address small MAX_DELTA with all null-rounds in the future). For now, if everything null let's crash it because it breaks our assumption it would mean there is something really wrong in the parent.

Merge this with Akosh's proposal of having an additional delay from proposal to acceptance: https://github.com/consensus-shipyard/fendermint/pull/435/files

* new proposal logic * fmt * fix typo * minor changes * adding check * ignore proposal height * Fix type * expose config params --------- Co-authored-by: Alfonso de la Rocha <[email protected]>

Signed-off-by: Alfonso de la Rocha <[email protected]>

cryptoAtwill added 3 commits November 21, 2023 17:17

remove look ahead in query topdown

d8dcc7c

use three pointers

0d0927e

up crate

09c2c3d

cryptoAtwill requested review from aakoshh and adlrocha November 22, 2023 08:55

adlrocha reviewed Nov 22, 2023

View reviewed changes

aakoshh reviewed Nov 22, 2023

View reviewed changes

fendermint/vm/topdown/src/sync/syncer.rs Outdated Show resolved Hide resolved

aakoshh reviewed Nov 22, 2023

View reviewed changes

fendermint/vm/topdown/src/sync/syncer.rs Show resolved Hide resolved

aakoshh reviewed Nov 22, 2023

View reviewed changes

fix error

7408e0a

cryptoAtwill and others added 7 commits November 23, 2023 13:00

Update fendermint/vm/topdown/src/sync/syncer.rs

ae3f755

Co-authored-by: Akosh Farkash <[email protected]>

fix from integration testing

33dea16

Merge branch 'remove-look-ahead' of protocol-github:consensus-shipyar…

2539bde

…d/fendermint into remove-look-ahead

remove to_confirm pointer

a847501

restrict memory size

9180a23

dont propose until interval reached

63d318b

update latest height

026cf96

cryptoAtwill and others added 2 commits November 23, 2023 17:33

split tendermint query to separete file

05883d0

Update fendermint/vm/topdown/src/sync/syncer.rs

69ebdb3

Co-authored-by: adlrocha <[email protected]>

fix test linting

ea8bf9e

cryptoAtwill added 8 commits November 24, 2023 11:31

Merge branch 'remove-look-ahead' of protocol-github:consensus-shipyar…

1b37d9f

…d/fendermint into remove-look-ahead

fix tests

4885059

more tests

a4bb5f8

check hash when executing

07b3c37

sync

632401a

sort by nonce

3a9de4c

fmt

e12c393

more unit tests

9ca854f

adlrocha approved these changes Nov 29, 2023

View reviewed changes

cryptoAtwill and others added 2 commits November 29, 2023 18:26

New proposal logic (#447)

5818bfa

* new proposal logic * fmt * fix typo * minor changes * adding check * ignore proposal height * Fix type * expose config params --------- Co-authored-by: Alfonso de la Rocha <[email protected]>

Merge branch 'main' into remove-look-ahead

4a0c840

Signed-off-by: Alfonso de la Rocha <[email protected]>

maciejwitowski merged commit dca666c into main Nov 29, 2023
5 checks passed

maciejwitowski deleted the remove-look-ahead branch November 29, 2023 20:26

	/// The minimal number of blocks one should make the topdown proposal
	/// The minimum number of blocks one should make the topdown proposal

	/// The min topdown proposal height interval
	/// The topdown proposal interval

	let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) {
	let parent_height = match (self.sync_pointers.pending_exec(), self.sync_pointers.tail()) {

	pub fn advance_confirm(&mut self, height: BlockHeight) {
	pub fn next_confirm(&mut self, height: BlockHeight) {

		/// Default topdown proposal height interval
		pub(crate) const DEFAULT_PROPOSAL_INTERVAL: BlockHeight = 10;

		}

		/// Get top down message in the range `from` to `to`, both inclusive. It also validates the block

Remove look ahead #442

Remove look ahead #442

Conversation

cryptoAtwill commented Nov 22, 2023 • edited Loading

adlrocha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakoshh Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakoshh left a comment

Choose a reason for hiding this comment

cryptoAtwill commented Nov 23, 2023

aakoshh commented Nov 23, 2023

cryptoAtwill commented Nov 23, 2023

aakoshh commented Nov 27, 2023

adlrocha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoAtwill commented Nov 22, 2023 •

edited

Loading

aakoshh Nov 22, 2023 •

edited

Loading