Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork in sidechain block production #1524

Open
2 tasks
Kailai-Wang opened this issue Mar 25, 2023 · 5 comments
Open
2 tasks

Fork in sidechain block production #1524

Kailai-Wang opened this issue Mar 25, 2023 · 5 comments
Assignees
Labels
D2-bug Something isn't working I3-high should be completed within 5 working days

Comments

@Kailai-Wang
Copy link
Collaborator

Kailai-Wang commented Mar 25, 2023

Context

It happened once on the staging, two workers diverged in the end.

Many Consensus(BlockAncestryMismatch errors in worker0 logs, meaning that the imported blocks (most likely produced by worker1) don't follow the already known last_block locally. As a result, the blocks will be discarded.

So it seems worker1 is producing blocks at a higher speed than worker0.

Tasks:

  • investigate the reason of divergence
  • have a solution for it

Logs (only workers):
log.tar.gz


✔️ Please set appropriate labels and assignees if applicable.

@Kailai-Wang Kailai-Wang added D2-bug Something isn't working I3-high should be completed within 5 working days labels Mar 25, 2023
@Kailai-Wang Kailai-Wang self-assigned this Mar 25, 2023
@Kailai-Wang
Copy link
Collaborator Author

Another observation is sometimes a sidechain block is broadcasted to the same peer multiple times:

[2023-03-23T20:27:01Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-23T20:27:01Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:3001"
[2023-03-23T20:27:01Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"

As a result, only the first block import was successful, the subsequent import was rejected with Block already imported error.
How did it happen? Why do the peers vector have duplicated entries?

@Kailai-Wang
Copy link
Collaborator Author

The problem started with block #13283, before that both workers had the same state.

worker0 successfully produced #13283based on parentchain block #10024

Syncing Parentchain block number 10024 at Sidechain block number  13283 
[2023-03-24T11:12:35Z DEBUG enclave_runtime::top_pool_execution] Aura executed successfully
[2023-03-24T11:12:35Z INFO  enclave_runtime::top_pool_execution] Remaining time in slot (id: Slot(279942725), stage After AURA): 19 ms, 0.31666666666666665% of slot time
[2023-03-24T11:12:35Z DEBUG enclave_runtime::top_pool_execution] Proposing 1 sidechain block(s) (broadcasting to peers)
[2023-03-24T11:12:35Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Enclave produced sidechain blocks: [13283]
[2023-03-24T11:12:36Z INFO  its_consensus_common::peer_block_sync] Successfully imported broadcast sidechain block (number: 13283), based on parentchain block 10024

worker1 tried to produce #13283 but failed with "production takes too long", which is OK as it didn't import the blocks yet.

[2023-03-24T11:12:44Z WARN  aura] ⌛️ Discarding proposal for slot 279942726, block number 13283; block production took too long
[2023-03-24T11:12:44Z INFO  aura] Did not produce a block for slot 279942726 in shard 0x2558a6e57f2f14fd62b2fc0169b83de6f5aaa20b5738baa75ee6d45c4ac7b838
[2023-03-24T11:12:44Z DEBUG enclave_runtime::top_pool_execution] Aura executed successfully
[2023-03-24T11:12:44Z INFO  enclave_runtime::top_pool_execution] No time remaining in slot (id: Slot(279942726), stage: After AURA)

before that it already synced to parentchain block #10025

[2023-03-24T11:12:43Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:12:43Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Entering ocall_worker_request
[2023-03-24T11:12:43Z DEBUG itc_parentchain_indirect_calls_executor] Scanning block 10025 for relevant xt
[2023-03-24T11:12:43Z INFO  itc_parentchain_block_importer::block_importer] Successfully imported parentchain block (number: 10025, hash: 0x9870…c30a)

Then it tried to import the block #13283 that was broadcasted by worker0 but failed:

[2023-03-24T11:12:44Z DEBUG its_consensus_common::block_import] Attempting to import sidechain block (number: 13283, hash: 0xd208acd8b9a0cfa506bdc9d4b4e1f71823e5d9b6721d5687bae84e043dfa12d3, parentchain hash: 0x5e9c0eed130d83ed2551a314c6ad11602d988898e32e1d9663dcffb449e818f0)
[2023-03-24T11:12:44Z WARN  its_consensus_common::block_import] Could not peek parentchain block, returning latest parentchain block (Other("Failed to find parentchain header in import queue (hash: 0x5e9c…18f0) that is associated with the current sidechain block that is to be imported (number: 13283, hash: 0xd208…12d3)"))
[2023-03-24T11:12:44Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:12:44Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Entering ocall_worker_request
[2023-03-24T11:12:44Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:12:44Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Entering ocall_worker_request
[2023-03-24T11:12:44Z ERROR its_block_verification] Author verification for block (number: 13283) failed, block will be discarded
[2023-03-24T11:12:44Z ERROR enclave_runtime::error] Returning error Consensus(VerificationError(BadParentchainBlock(0x987032cab64faeb945a5a5f7b4a6ff5b8f5a2a690bbb43c781c615709633c30a, "Invalid parentchain head"))) as sgx unexpected.
[2023-03-24T11:12:44Z ERROR integritee_service::sidechain_setup] Sgx(SGX_ERROR_UNEXPECTED)

block #13283 was discarded by worker1, from this point on two workers had started to produce their own blocks, they continued broadcasting the generated blocks, which caused chaos:

  • the worker1 -> worker0 block was always rejected, because worker0 produced the blocks at a slower pace, so the block coming from worker1 was always too early for worker0, and worker0 complained about discontinuity: Invalid block number, x does not succeed x
  • meanwhile, worker0 was producing blocks and broadcasting them to worker1 too, these blocks were always too late for worker1 and got ignored.

As a result, they were not in sync any more and started producing their own versions of blocks.

e.g. for the problematic #13283, worker1 produced its own version of it after 30s and happily imported it:

[2023-03-24T11:13:11Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Enclave produced sidechain blocks: [13283]
[2023-03-24T11:13:11Z DEBUG integritee_service::ocall_bridge::sidechain_ocall] Updating peers..
[2023-03-24T11:13:11Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Successfully updated peers
[2023-03-24T11:13:11Z DEBUG integritee_service::ocall_bridge::sidechain_ocall] Broadcasting sidechain blocks ...
[2023-03-24T11:13:11Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Successfully broadcast blocks
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:3001"
[2023-03-24T11:13:11Z DEBUG enclave_runtime::top_pool_execution] Sending sidechain block(s) confirmation extrinsic.. 
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:13:11Z INFO  enclave_runtime::top_pool_execution] Remaining time in slot (id: Slot(279942731), stage After broadcasting and sending extrinsic): 101 ms, 1.6833333333333331% of slot time
[2023-03-24T11:13:11Z DEBUG enclave_runtime::top_pool_execution] End sidechain block production cycle
[2023-03-24T11:13:12Z DEBUG its_consensus_common::block_import] Attempting to import sidechain block (number: 13283, hash: 0xb13ea2f261a4355dbf5a39c2de0cfa68dc192bf15e8d5f1fabc12d30c7326d3f, parentchain hash: 0x05ee15dc4114b269e5445fbceff779e329568a977d01889d49aac6235b6c3c7f)
[2023-03-24T11:13:12Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:13:12Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Entering ocall_worker_request
[2023-03-24T11:13:12Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:13:12Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Entering ocall_worker_request
[2023-03-24T11:13:12Z DEBUG integritee_service::ocall_bridge::bridge_api] Requesting WorkerOnChain OCall API instance
[2023-03-24T11:13:12Z INFO  its_consensus_common::block_import] Applying state diff for block 13283 of size 219 bytes
[2023-03-24T11:13:12Z INFO  its_consensus_common::block_import] Applying state update from block 13283 took 0 ms
[2023-03-24T11:13:12Z INFO  its_consensus_common::block_import] Importing block 13283 took 429 ms
[2023-03-24T11:13:12Z INFO  its_consensus_common::peer_block_sync] Successfully imported broadcast sidechain block (number: 13283), based on parentchain block 10027

For the following blocks from worker0, worker1 found it was already imported and thus ignored it, e.g. for #13284:

[2023-03-24T11:17:36Z WARN  its_consensus_common::peer_block_sync] Sidechain block from queue (number: 13284) was already imported (current block number: 13286). Block will be ignored.

@github-actions
Copy link
Contributor

❗ This issue is stale because it has been open for 60 days with no activity.
Remove Stale label or update it, otherwise this issue will be closed in 7 days.
@litentry/parachain

@github-actions github-actions bot added the Stale label May 25, 2023
@BillyWooo BillyWooo removed the Stale label Jul 8, 2023
@grumpygreenguy
Copy link
Contributor

Note: when tackling this; let's try reenabling the skipped tests in resuming_worker.test.ts

@kziemianek
Copy link
Member

Redundant broadcasting should be resolved in #1546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D2-bug Something isn't working I3-high should be completed within 5 working days
Projects
None yet
Development

No branches or pull requests

4 participants