Fork in sidechain block production #1231

Kailai-Wang · 2023-03-27T09:50:57Z

See litentry/litentry-parachain#1524 for the compete logs.

It doesn't happen very often, actually it's the first time we've seen this. But it has a very bad consequence that two workers will diverge and never run in sync again.

I've investigated it (see litentry/litentry-parachain#1524 (comment)) and the root cause seems to be:

[2023-03-24T11:12:30Z] worker0 claimed aura slot 279942725 and produced sidechain bn 13283 based on parentchain bn 10024
[2023-03-24T11:12:36Z] in the last second of aura duration (6s in our case), it finished composing and broadcasted it to worker1
[2023-03-24T11:12:36Z] the broadcast took time and worker1 started to claim slot 279942726 and built its own version of sidechain bn 13283, but with parentchain bn 10025 (that's the latest parentchain header worker1 synced)
[2023-03-24T11:12:44Z] worker1 failed to produce the block as it took too long
[2023-03-24T11:12:44Z] worker1 got the sidechain block from worker0 and attempted to import it. However it couldn't verify the block because it couldn't find the matching parentchain block header in its parentchain import queue.
[2023-03-24T11:12:44Z] the block was discarded by worker1 and from this point on, they couldn't sync with each other anymore.

My questions:

there're two cases/timepoints where the blocks in the parentchain import queue will be popped:
- when a worker composes a sidechain block
- when a worker imports a sidechain block from the peers
  the queue can't be popped twice (reminds me of double-free), otherwise the desired parentchain header can't be peeked. It seems we can easily run into this situation when a worker fails to produce blocks but the queue is already popped. Then the worker can't import a block that was composed based on an older parentchain bn anymore. Have you already got an idea regarding the solutions?
what if worker1 succeeded in producing block bn 13283 on its slot? Then we have two valid sidechain blocks with different parentchain bn backings. Worker0 will probably reject the block from worker1 as it was already imported, what about worker1? Is the order of block import guaranteed (that it will first import the block from worker0, then worker1 itself)? What if under a very bad network situation where worker1 couldn't get the block from worker0 in time?
When broadcasting the blocks to peers, I see it's broadcasted to the same endpoint multiple times:

[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:3001"
[2023-03-24T11:13:11Z DEBUG enclave_runtime::top_pool_execution] Sending sidechain block(s) confirmation extrinsic.. 
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"

why was that? Shall we use set instead of vectors in the peers or urls collections?

The text was updated successfully, but these errors were encountered:

clangenb · 2023-03-28T15:24:15Z

So the phenomenon you observed is basically two corner cases combined:

Due to network latency, worker1 started to produce block n, in slot x+1 before it received worker0's block n from slot x. So if the worker1 succeeded in producing a block, it would be a classical fork due to network latency. Fork mitigation is the last security debt we need to handle. See #685, #687, #690. However, now that worker1 failed to produce a block, the fork could have gone unnoticed if it hadn't tried to import the next parentchain block, which uncovered a bug: #1251

So to answer the questions:

The solution regarding the double pop of the queue is: there absolutely mustn't be a state change when a worker fails to produce a block. Hence, the parentchain block must either be put back into the queue or never popped. I haven't looked in the relevant lines of code yet, so I can't tell the most suitable fix yet. However, disregarding the code, the idiomatic approach would be that the queue is not changed if we fail to produce the block.
This would be the classical fork that still needs to be handled.
I am curious about these logs, according to my understanding, we should only see each address once.

Kailai-Wang · 2023-03-29T22:01:26Z

Thanks @clangenb

yea I agree - I can try to modify the code. The slot proposer does need the parentchain block to be able to compose a sidechain block upon it, so basically I think there're two ways of keeping the queue state unchanged if it fails to produce a block:
- pop -> compose the block -> push the popped blocks back to front if composing fails (if the queue is Deque)
- note the blocks that should be popped and only pop them when the composing succeeds => have to check the code it would affect any peeked parentchain header
  Do you have any preference?
I see where you are coming from, thanks
it might come from the litentry-specific part, I'll have to check our code. But in general I believe the peer updating mechanism could be improved: Improve the peer updating mechanism litentry/litentry-parachain#1546
Do you see the same?

clangenb · 2023-04-12T07:28:36Z

Sorry for the delayed response here.

I am currently looking at the codebase myself to see what could be done. Then I will be able to tell my preference, or see if I have a solution in mind.
I absolutely agree.

On another note, if I analyse the logs correctly, worker1 failed to produce the block because the import of the parentchain block took too long. This must not be the case: #1275.

brenzi added the A3-sidechain label Apr 20, 2023

Kailai-Wang mentioned this issue May 3, 2023

import first, then check MRENCLAVE litentry/litentry-parachain#1656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fork in sidechain block production #1231

Fork in sidechain block production #1231

Kailai-Wang commented Mar 27, 2023

clangenb commented Mar 28, 2023 •

edited

Loading

Kailai-Wang commented Mar 29, 2023

clangenb commented Apr 12, 2023

Fork in sidechain block production #1231

Fork in sidechain block production #1231

Comments

Kailai-Wang commented Mar 27, 2023

clangenb commented Mar 28, 2023 • edited Loading

Kailai-Wang commented Mar 29, 2023

clangenb commented Apr 12, 2023

clangenb commented Mar 28, 2023 •

edited

Loading