Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: gap in epoch history with dynamic validator set #104

Open
l-monninger opened this issue Feb 8, 2024 · 2 comments
Open

bug: gap in epoch history with dynamic validator set #104

l-monninger opened this issue Feb 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@l-monninger
Copy link
Collaborator

l-monninger commented Feb 8, 2024

Summary

The current implementation does not appear to gracefully handle Aptos epochs with respect to Avalanche bootstrap syncing. The error below, which typically appears in the context of changes to the validator set or various sorts of asynchrony, is indicative of as much.

shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4
[02-08|09:41:50.142] FATAL <rLgK4miC2cHSdc84z8H9iNBZTkqmyRD7xSPJnkkpHc34yhRLY Chain> handler/handler.go:339 shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4\" while processing sync message: NodeID-EzN4q9mU6TVFkND6oghbdLAUqDacE9Czp Op: chits Message: chain_id:\"p\\x07\\xe5\\xfeﲼ\\x87v\\xb8\\x18h\\xae\\xd8E\\x80/\\x19^\\x98\\x83\\x7fI\\xaf\\xdb\\xe99\\xac\\xe5\\xd4k\\x8a\"  request_id:8941  preferred_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  accepted_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  preferred_id_at_height:\"\\xc1\\xddAX\\x92\\x06\\xe1ĄE\\xbe\\x9ar\\x15l\\xf6YEr\\x9f\\xf2ԟ\\x1bL\\x87߯\\xb1\\x14\\xea\\xed\""}

Steps to Reproduce

This is a somewhat challenging error to reproduce. It more reliably emerges over longer running periods and when adding a removing several validators. The simplest procedure I've determined so far is:

  1. Start an M1 subnet on fuji with one validator.
  2. Submit several transactions to this subnet. For example, by calling, movement aptos init repeatedly.
  3. Add a second validator.
  4. If you do not see the error above will the validator is bootstrapping, remove the second validator and submit more transactions and try again.
  5. If you still do not see the error above, remove add a third validator and attempt once more.
  6. Repeat.

You will need to inspect the logs for your chain to view the error. These are stored at ~/.avalanchego/logs/<chain-id>.log

Possible Solutions

It seems likely to me that there are three plausibilities:

  1. The block execution ordering is currently faulty and we should order incoming blocks in a queue by epoch before they are sent to AptosBlockExecutor and only pop off when the ledge epoch matches.
  2. This occurs owing to periods of asynchrony, in which case the blocks are simply missing/not being disseminated. This would largely be a quality of the network on the whole and may not be something around which we can engineer without significantly more re-design.
  3. This occurs owing to a bad reorg strategy in which case a fuller re-design is necessary.

Another alternative would be to remove epochs altogether. This is non-trivial to introduce into the current implementation as simply setting the same epoch for every block will cause a similar invalid epoch history error to the above.

@l-monninger
Copy link
Collaborator Author

l-monninger commented Feb 13, 2024

Generally, it seems our reorg strategy is not very effective. If we're expecting the Aptos db, mempool, and execution layer to handle everything that comes out of the consensus engine, that appears to be problematic in the least when dealing with epochs.

I would start by attempting to remove epochs altogether and seeing if this problem persists. It doesn't appear you can do this simply because of the expectations of the Aptos BlockExecutor. You can't for example just set an epoch parameter to a large number. You also can't just reuse the same epoch number over and over again because of the transaction commitments.

@aaronbuchwald
Copy link

For reference, this is the process Avalanche nodes follow when performing bootstrapping: https://github.com/ava-labs/avalanchego/blob/master/snow/engine/snowman/bootstrap/bootstrapper.go#L49

It may be helpful to specify the exact order of events that the engine is calling your VM in order to trace the execution path that leads to the error. This would also help me put in perspective what may be going wrong within the VM or how the engine/VM have slightly incorrect expectations of each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants