bug: gap in epoch history with dynamic validator set #104

l-monninger · 2024-02-08T16:18:41Z

Summary

The current implementation does not appear to gracefully handle Aptos epochs with respect to Avalanche bootstrap syncing. The error below, which typically appears in the context of changes to the validator set or various sorts of asynchrony, is indicative of as much.

shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4

[02-08|09:41:50.142] FATAL <rLgK4miC2cHSdc84z8H9iNBZTkqmyRD7xSPJnkkpHc34yhRLY Chain> handler/handler.go:339 shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4\" while processing sync message: NodeID-EzN4q9mU6TVFkND6oghbdLAUqDacE9Czp Op: chits Message: chain_id:\"p\\x07\\xe5\\xfeﲼ\\x87v\\xb8\\x18h\\xae\\xd8E\\x80/\\x19^\\x98\\x83\\x7fI\\xaf\\xdb\\xe99\\xac\\xe5\\xd4k\\x8a\"  request_id:8941  preferred_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  accepted_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  preferred_id_at_height:\"\\xc1\\xddAX\\x92\\x06\\xe1ĄE\\xbe\\x9ar\\x15l\\xf6YEr\\x9f\\xf2ԟ\\x1bL\\x87߯\\xb1\\x14\\xea\\xed\""}

Steps to Reproduce

This is a somewhat challenging error to reproduce. It more reliably emerges over longer running periods and when adding a removing several validators. The simplest procedure I've determined so far is:

Start an M1 subnet on fuji with one validator.
Submit several transactions to this subnet. For example, by calling, movement aptos init repeatedly.
Add a second validator.
If you do not see the error above will the validator is bootstrapping, remove the second validator and submit more transactions and try again.
If you still do not see the error above, remove add a third validator and attempt once more.
Repeat.

You will need to inspect the logs for your chain to view the error. These are stored at ~/.avalanchego/logs/<chain-id>.log

Possible Solutions

It seems likely to me that there are three plausibilities:

The block execution ordering is currently faulty and we should order incoming blocks in a queue by epoch before they are sent to AptosBlockExecutor and only pop off when the ledge epoch matches.
This occurs owing to periods of asynchrony, in which case the blocks are simply missing/not being disseminated. This would largely be a quality of the network on the whole and may not be something around which we can engineer without significantly more re-design.
This occurs owing to a bad reorg strategy in which case a fuller re-design is necessary.

Another alternative would be to remove epochs altogether. This is non-trivial to introduce into the current implementation as simply setting the same epoch for every block will cause a similar invalid epoch history error to the above.

The text was updated successfully, but these errors were encountered:

l-monninger · 2024-02-13T18:05:43Z

Generally, it seems our reorg strategy is not very effective. If we're expecting the Aptos db, mempool, and execution layer to handle everything that comes out of the consensus engine, that appears to be problematic in the least when dealing with epochs.

I would start by attempting to remove epochs altogether and seeing if this problem persists. It doesn't appear you can do this simply because of the expectations of the Aptos BlockExecutor. You can't for example just set an epoch parameter to a large number. You also can't just reuse the same epoch number over and over again because of the transaction commitments.

aaronbuchwald · 2024-02-16T16:04:23Z

For reference, this is the process Avalanche nodes follow when performing bootstrapping: https://github.com/ava-labs/avalanchego/blob/master/snow/engine/snowman/bootstrap/bootstrapper.go#L49

It may be helpful to specify the exact order of events that the engine is calling your VM in order to trace the execution path that leads to the error. This would also help me put in perspective what may be going wrong within the VM or how the engine/VM have slightly incorrect expectations of each other.

l-monninger changed the title ~~bug:~~ bug: gap in epoch history with dynamic validator set Feb 8, 2024

l-monninger mentioned this issue Feb 8, 2024

Experiencing unhealthy snowman consensus with geographically distributed nodes around clusters of addValidator and removeValidator events ava-labs/avalanchego#2713

Closed

l-monninger self-assigned this Feb 10, 2024

l-monninger added the bug Something isn't working label Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: gap in epoch history with dynamic validator set #104

bug: gap in epoch history with dynamic validator set #104

l-monninger commented Feb 8, 2024 •

edited

Loading

l-monninger commented Feb 13, 2024 •

edited

Loading

aaronbuchwald commented Feb 16, 2024

bug: gap in epoch history with dynamic validator set #104

bug: gap in epoch history with dynamic validator set #104

Comments

l-monninger commented Feb 8, 2024 • edited Loading

Summary

Steps to Reproduce

Possible Solutions

l-monninger commented Feb 13, 2024 • edited Loading

aaronbuchwald commented Feb 16, 2024

l-monninger commented Feb 8, 2024 •

edited

Loading

l-monninger commented Feb 13, 2024 •

edited

Loading