You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation does not appear to gracefully handle Aptos epochs with respect to Avalanche bootstrap syncing. The error below, which typically appears in the context of changes to the validator set or various sorts of asynchrony, is indicative of as much.
shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4
[02-08|09:41:50.142] FATAL <rLgK4miC2cHSdc84z8H9iNBZTkqmyRD7xSPJnkkpHc34yhRLY Chain> handler/handler.go:339 shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4\" while processing sync message: NodeID-EzN4q9mU6TVFkND6oghbdLAUqDacE9Czp Op: chits Message: chain_id:\"p\\x07\\xe5\\xfeﲼ\\x87v\\xb8\\x18h\\xae\\xd8E\\x80/\\x19^\\x98\\x83\\x7fI\\xaf\\xdb\\xe99\\xac\\xe5\\xd4k\\x8a\" request_id:8941 preferred_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\" accepted_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\" preferred_id_at_height:\"\\xc1\\xddAX\\x92\\x06\\xe1ĄE\\xbe\\x9ar\\x15l\\xf6YEr\\x9f\\xf2ԟ\\x1bL\\x87߯\\xb1\\x14\\xea\\xed\""}
Steps to Reproduce
This is a somewhat challenging error to reproduce. It more reliably emerges over longer running periods and when adding a removing several validators. The simplest procedure I've determined so far is:
Start an M1 subnet on fuji with one validator.
Submit several transactions to this subnet. For example, by calling, movement aptos init repeatedly.
Add a second validator.
If you do not see the error above will the validator is bootstrapping, remove the second validator and submit more transactions and try again.
If you still do not see the error above, remove add a third validator and attempt once more.
Repeat.
You will need to inspect the logs for your chain to view the error. These are stored at ~/.avalanchego/logs/<chain-id>.log
Possible Solutions
It seems likely to me that there are three plausibilities:
The block execution ordering is currently faulty and we should order incoming blocks in a queue by epoch before they are sent to AptosBlockExecutor and only pop off when the ledge epoch matches.
This occurs owing to periods of asynchrony, in which case the blocks are simply missing/not being disseminated. This would largely be a quality of the network on the whole and may not be something around which we can engineer without significantly more re-design.
This occurs owing to a bad reorg strategy in which case a fuller re-design is necessary.
Another alternative would be to remove epochs altogether. This is non-trivial to introduce into the current implementation as simply setting the same epoch for every block will cause a similar invalid epoch history error to the above.
The text was updated successfully, but these errors were encountered:
l-monninger
changed the title
bug:
bug: gap in epoch history with dynamic validator set
Feb 8, 2024
Generally, it seems our reorg strategy is not very effective. If we're expecting the Aptos db, mempool, and execution layer to handle everything that comes out of the consensus engine, that appears to be problematic in the least when dealing with epochs.
I would start by attempting to remove epochs altogether and seeing if this problem persists. It doesn't appear you can do this simply because of the expectations of the Aptos BlockExecutor. You can't for example just set an epoch parameter to a large number. You also can't just reuse the same epoch number over and over again because of the transaction commitments.
It may be helpful to specify the exact order of events that the engine is calling your VM in order to trace the execution path that leads to the error. This would also help me put in perspective what may be going wrong within the VM or how the engine/VM have slightly incorrect expectations of each other.
Summary
The current implementation does not appear to gracefully handle Aptos epochs with respect to Avalanche bootstrap syncing. The error below, which typically appears in the context of changes to the validator set or various sorts of asynchrony, is indicative of as much.
Steps to Reproduce
This is a somewhat challenging error to reproduce. It more reliably emerges over longer running periods and when adding a removing several validators. The simplest procedure I've determined so far is:
movement aptos init
repeatedly.You will need to inspect the logs for your chain to view the error. These are stored at
~/.avalanchego/logs/<chain-id>.log
Possible Solutions
It seems likely to me that there are three plausibilities:
AptosBlockExecutor
and only pop off when the ledge epoch matches.Another alternative would be to remove epochs altogether. This is non-trivial to introduce into the current implementation as simply setting the same epoch for every block will cause a similar invalid epoch history error to the above.
The text was updated successfully, but these errors were encountered: