-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass over the code, I think we are close, but I don't know if there may be a bug in poll_next
on how we use and set to_confirm
. Would you mind adding a test to confirm this? Thanks!
fendermint/vm/topdown/src/lib.rs
Outdated
@@ -45,6 +47,30 @@ pub struct Config { | |||
pub exponential_back_off: Duration, | |||
/// The max number of retries for exponential backoff before giving up | |||
pub exponential_retry_limit: usize, | |||
/// The minimal number of blocks one should make the topdown proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed sync, something that I am wondering here is if this min_proposal_interval
needs to be synced among all of the validators. What happens if one decides to set this value to 5
and the other to 100
? As long as both have the finality information for the proposal it should be fine, but it may remove the initial purpose of using this approach. WDYT?
/// The minimal number of blocks one should make the topdown proposal | |
/// The minimum number of blocks one should make the topdown proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we keep it constant and we don't let users freely configure this value as I see that you are doing above.
@@ -12,6 +12,8 @@ use ipc_sdk::staking::StakingChangeRequest; | |||
/// Finality provider that can handle null blocks | |||
#[derive(Clone)] | |||
pub struct FinalityWithNull { | |||
/// The min topdown proposal height interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The min topdown proposal height interval | |
/// The topdown proposal interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a minimum? We should have a maximum, not a minimum. If you have caught up with the parent Lotus chain, and it proposes a block every 30 seconds, why could we not propose every single one?
Or what is this a minimum of? The number of blocks between proposals on the child subnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking a lot about this, and I agree with @aakoshh here. In the current model this offers us nothing, you can easily provide every non-null height and the operation should be the same.
|
||
Ok(Some(if next_proposal_height < latest_height { | ||
// safe to unwrap as we are sure `latest_height` will not be null block | ||
self.min_nonnull_block(next_proposal_height, latest_height)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward here as you do is fine as long as the info in the cache will only be available once the receipt for that height has been executed and is available. If the cache can ensure this we can do this, if not we should look backward instead of forward to be safe.
use crate::BlockHeight; | ||
use std::fmt::{Display, Formatter}; | ||
|
||
/// There are three pointers, each refers to a block height, when syncing with parent. As Lotus has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is redundant with the one below. Maybe remove this one and keep the one from sync.rs
?
|
||
/// We only want the non-null parent block's hash | ||
async fn non_null_parent_hash(&self) -> BlockHash { | ||
let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling it .pending_exec
instead of .to_confirm
? The latter confuse=d me for a moment.
let parent_height = match (self.sync_pointers.to_confirm(), self.sync_pointers.tail()) { | |
let parent_height = match (self.sync_pointers.pending_exec(), self.sync_pointers.tail()) { |
self.head += 1; | ||
} | ||
|
||
pub fn advance_confirm(&mut self, height: BlockHeight) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This referes to the next one to confirm, right?
pub fn advance_confirm(&mut self, height: BlockHeight) { | |
pub fn next_confirm(&mut self, height: BlockHeight) { |
tracing::debug!(height, "found height to confirm"); | ||
height | ||
} | ||
(None, height) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only happen in the beginning right? According to https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R172-R183, after the first block this should always be Some
.
Also, advance_confirmed
is only called there, so I don't know to what extent to_confirm
will ever be set to some other value different None
. I am afraid there may be a bug there.
let next_proposal_height = last_committed_height + self.min_proposal_interval; | ||
|
||
Ok(Some(if next_proposal_height < latest_height { | ||
// safe to unwrap as we are sure `latest_height` will not be null block | ||
self.min_nonnull_block(next_proposal_height, latest_height)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if I get this right:
last_committed_height
is the last finalized block from the ledgerlatest_height
is the highest record inserted into the cache; the fetcher guarantees that there is at leastchain_delay
blocks between that and the tip of the parent chain- hopefully it is also not more than some limited distance away from
last_committed_height
as discussed in https://github.com/consensus-shipyard/fendermint/issues/437
Why should we impose on ourselves a min_proposal_interval
when we can propose anything betweeen last_committed_height
and latest_height
? Going back from latest_height
to find a non-null block should be perfectly fine.
I suspect this exists because latest_height
is still not limited in its look-ahead, and so you want to search something going from last_committed_height
forward, but don't want to propose 1-by-1.
Once again a max_proposal_interval
would make more sense to me: it limits the potential blocks that have to be fetched and executed in one block, and when in sync it allows proposals to go 1-by-1, because the root chain is so much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Going back from latest_height to find a non-null block should be perfectly fine.
might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers. Starting from last_committed_height
should be safer, even if someone is way ahead, peers should eventually catch up and the proposal will be accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might not work. Because other peers might be much slower, so the proposed height will always be rejected by peers.
I agree, which is why I keep referring to https://github.com/consensus-shipyard/fendermint/issues/437
where I said that we should limit the latest_height
by not fetching more than e.g. 100 ahead from last_committed_height
, and why I have in this thread multiple times suggested replacing this new min_topdown_interval
with a max_topdown_interval
, which achieves almost the same thing, except in the linked issue I would also limit the maximum size of the cache that anyone would fetch.
fendermint/vm/topdown/src/lib.rs
Outdated
/// Default topdown proposal height interval | ||
pub(crate) const DEFAULT_PROPOSAL_INTERVAL: BlockHeight = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that we catch up with the root chain in bursts of 5 minute jumps. Why should it work like that?
if let Some(h) = self.to_confirm { | ||
self.tail = h; | ||
} | ||
self.to_confirm = Some(height); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this jump multiple steps? e.g. does it makes sense that tail
is 4, to_confirm
is 5, then to_confirm
becomes 10 but tail
is just 5?
/// executed (also with some checks to ensure no reorg has occurred). We fetch block 2's data and set | ||
/// `tail = 2`, `to_confirm = Some(4)`. Then height 2 is ready to be proposed. | ||
#[derive(Clone, Debug)] | ||
pub(crate) struct SyncPointers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed example and description 🙏
I don't understand why we need this to_confirm
at all; when we have a non-null height we know that everything before has been executed, and regardless of whether we know that some block has a follow-up that "confirms" the execution results, we should refrain from using it. That way everyone is on equal footing and there is no reason to wait any new blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we have a non-null height we know that everything before has been executed, this is reflected in the tail
variable. And tail
is actually the latest height in cache. So when proposing, everything up till tail
will be available for proposal. But the thing is, if there is a new block that is non-null, but how are we going to populate the cache? We don't know which previous block's data should we pull? If we do this, there will be empty data for non-null blocks. If we just wait, this won't happen.
If you are suggesting we can remove to_confirm
and replace it with tail
, I think that's a good simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't really want to think in the terms of tails and heads and waiting for confirms.
What I have been proposing for a week is much simpler: as soon as you see a block, you know that its parent has been executed, so when we finalize a given block, let's limit ourselves to the events of its parent.
If we did not have null blocks, this would be simply the previous height. With null blocks we have to find the parent. But it's already in the cache, or it's already in the parent chain; the important thing is there is no situation where we have to wait for anything to happen.
|
||
/// Poll the next block height. Returns finalized and executed block data. | ||
async fn poll_next(&mut self) -> Result<Option<(BlockHeight, ParentViewPayload)>, Error> { | ||
let height = self.sync_pointers.head() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost. head
can be a non-null but not executed block, right? But here we start by its height+1, and if we find something there, then sync
will set all intermediary blocks as null
. I don't see why they are all necessarily null blocks.
self.sync_pointers.advance_confirm(height); | ||
return Ok(Some((to_confirm, data))); | ||
} | ||
|
||
tracing::warn!(height, "non-null round at height, waiting for confirmation"); | ||
self.sync_pointers.advance_head(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the head doesn't move when there is something to confirm :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an error, I fixed from integration testing.
|
||
return Ok(None); | ||
} | ||
return Err(Error::CannotQueryParent(err, height)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the normal error when there is just no block yet at that height?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think no block yet
will happen as in we are delaying our polling by some value. Or are you referring to null block
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't understand when it returns None
; I see it returns None
when it's a null block, but I don't see what happens when there is no block at all, like if you poll too often. Probably I don't understand anything, maybe it's not even polling as you say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is too complicated for my 🧠
I have vague understanding that this isn't doing what I suggested in https://github.com/consensus-shipyard/fendermint/issues/436 by looking back, and it's also not limiting the cache size like I thought it would in https://github.com/consensus-shipyard/fendermint/issues/437
Instead it does multiple things, where it tries to go ahead by a minimum 10 blocks at a time (so jump 10 blocks once every 5 minutes), and it introduces a complex dance of only proposing when the execution results are available.
I can't say if that code is correct 🤯 but as I see this only applies to the filling of the cache, it does not apply to how you query the parent provider synchronously during block execution. Maybe it's there, but I could not identify it: if a validator sees that a block is finalized by the others, but their own particular parent has not executed that block yet, it will still get empty results in the API, correct?
I can't help but feel that everything would be so much simpler if we just forgot about waiting for results to appear for the finalized height, and embraced deferred execution by looking backwards where we know we will have data.
@aakoshh The issue I had with look backwards is all backwards cache can be null blocks. I have seen in logs that in calibration, around 7 or 8 blocks can be null. So if we dont find any executed blocks in cache, then we need to start looking forward. So if I know the current block is non-null block, why not wait for the next non-null block? It's the same idea, but we dont have to look backwards or forwards. |
Co-authored-by: Akosh Farkash <[email protected]>
…d/fendermint into remove-look-ahead
No, it can't.
|
Co-authored-by: adlrocha <[email protected]>
@aakoshh Sorry, I might have to disagree and challenge a bit on this.
This is what
So where do we look for the parent that can be executed? If it is a separate variable that tracks the previous non-null block, then it is the
In the next block we polled, it is a non-null block, we store in cache.
Now we need to make a proposal, which with What I think is the correct behaviour is before block is confirmed to be executed, it should not be stored in cache at all, along with height 100, 101. We should use a separate variable to track it. Only when block 102 is confirmed to be executed, it should be allowed to insert into cache. So in cache, we can make sure at least the latest block is an executed block. When making proposals, we are safe to look forward because the uppper bound will always be executed non-null block. |
…d/fendermint into remove-look-ahead
@cryptoAtwill thanks for the example in your comment. I think you still misunderstand what I am trying to propose. I say that when block
Please don't get caught up with what the cache does. I said this before: forget the cache. It's nice to have a cache, but the ultimate source of truth is the parent blockchain. Maybe all blocks in the cache are null, maybe the cache is empty for all I know, but I would know that the last finalized block was not null and wasn't executed, so there is always something. To be honest I'm not even sure what we are discussing here any more. I don't want to convince about how the cache works today, it doesn't matter. What I was trying to convince everyone about is that it's much simpler to implement a syncer which does not have to wait for another block to appear so that the finalized block is executed, if all we do is use the effects up to, but not including, the finalized block. IMO it is conceptually the correct things to do as well, because if we had a more crypto oriented blockchain where you can request proofs about state, since we have deferred execution, with block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address this one in an additional PR so it is easier to review? Thanks!
@@ -266,7 +266,11 @@ where | |||
|
|||
// error happens if we cannot get the validator set from ipc agent after retries | |||
let validator_changes = provider | |||
.validator_changes_from(prev_height + 1, finality.height) | |||
.validator_changes_from( | |||
prev_height + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prev_height + 1, | |
prev_height , |
.validator_changes_from(prev_height + 1, finality.height) | ||
.validator_changes_from( | ||
prev_height + 1, | ||
finality.height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finality.height, | |
finality.height - 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We may need to update the cache not to clean
finality.height - 1
(even if its null we know that the state is final up till here). - We need to apply this same approach for top-down messages.
If we receive 100 and we don't have yet that info because we are not up to date, we should either crash or not make progress, but never execute to avoid consensus failures.
.map(|r| r.value) | ||
// now we query the validator changes at height `to` and we also checks the block hash, | ||
// so that we know the parent node is reliable enough. | ||
let mut r = self.validator_changes_with_check(to, block_hash).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do the block_hash check according to what we have in the cache, if we have something? If not, let's not do this check because we are getting whatever we think is final from the parent (assuming no re-orgs).
} | ||
|
||
/// Get top down message in the range `from` to `to`, both inclusive. It also validates the block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update here to get the same range we are having in validator changes from prev_finality
to latest_finality -1
Ok(None) | ||
} | ||
|
||
fn propose_next_height(&self) -> Stm<Option<BlockHeight>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use an interval and always propose a specific height because this gives us nothing. Ideally we should:
- Propose whatever I see as the latest height as long as it is not higher than
prev_committed_finality + MAX_DELTA
. If this height is null, we always search back.
MAX_DELTA
for now should big enough so it ensure at least one non-null block in that window (we can address small MAX_DELTA with all null-rounds in the future). For now, if everything null let's crash it because it breaks our assumption it would mean there is something really wrong in the parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge this with Akosh's proposal of having an additional delay from proposal to acceptance: https://github.com/consensus-shipyard/fendermint/pull/435/files
* new proposal logic * fmt * fix typo * minor changes * adding check * ignore proposal height * Fix type * expose config params --------- Co-authored-by: Alfonso de la Rocha <[email protected]>
Signed-off-by: Alfonso de la Rocha <[email protected]>
This PR simplifies existing
look ahead
andlook back
for non-null blocks and adds more secure logic for handling delayed execution.The key change is in the background cache handling: only when a block's is not null and confirmed to have been executed, will it be injected into cache. Then topdown proposal will just be
last_committed_block
+ some height. This also eliminates the need for delaying blocks in proposal execution as one can be sure the proposed height will be executed.This is done through: https://github.com/consensus-shipyard/fendermint/pull/442/files#diff-7f8fecef12fedb16dfaea5611b9eccaf7be6b3ffa36059933234d77b74310534R78