Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork in sidechain block production #1231

Open
Kailai-Wang opened this issue Mar 27, 2023 · 3 comments
Open

Fork in sidechain block production #1231

Kailai-Wang opened this issue Mar 27, 2023 · 3 comments

Comments

@Kailai-Wang
Copy link
Contributor

See litentry/litentry-parachain#1524 for the compete logs.

It doesn't happen very often, actually it's the first time we've seen this. But it has a very bad consequence that two workers will diverge and never run in sync again.

I've investigated it (see litentry/litentry-parachain#1524 (comment)) and the root cause seems to be:

  • [2023-03-24T11:12:30Z] worker0 claimed aura slot 279942725 and produced sidechain bn 13283 based on parentchain bn 10024
  • [2023-03-24T11:12:36Z] in the last second of aura duration (6s in our case), it finished composing and broadcasted it to worker1
  • [2023-03-24T11:12:36Z] the broadcast took time and worker1 started to claim slot 279942726 and built its own version of sidechain bn 13283, but with parentchain bn 10025 (that's the latest parentchain header worker1 synced)
  • [2023-03-24T11:12:44Z] worker1 failed to produce the block as it took too long
  • [2023-03-24T11:12:44Z] worker1 got the sidechain block from worker0 and attempted to import it. However it couldn't verify the block because it couldn't find the matching parentchain block header in its parentchain import queue.
  • [2023-03-24T11:12:44Z] the block was discarded by worker1 and from this point on, they couldn't sync with each other anymore.

My questions:

  1. there're two cases/timepoints where the blocks in the parentchain import queue will be popped:
    • when a worker composes a sidechain block
    • when a worker imports a sidechain block from the peers
      the queue can't be popped twice (reminds me of double-free), otherwise the desired parentchain header can't be peeked. It seems we can easily run into this situation when a worker fails to produce blocks but the queue is already popped. Then the worker can't import a block that was composed based on an older parentchain bn anymore. Have you already got an idea regarding the solutions?
  2. what if worker1 succeeded in producing block bn 13283 on its slot? Then we have two valid sidechain blocks with different parentchain bn backings. Worker0 will probably reject the block from worker1 as it was already imported, what about worker1? Is the order of block import guaranteed (that it will first import the block from worker0, then worker1 itself)? What if under a very bad network situation where worker1 couldn't get the block from worker0 in time?
  3. When broadcasting the blocks to peers, I see it's broadcasted to the same endpoint multiple times:
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:3001"
[2023-03-24T11:13:11Z DEBUG enclave_runtime::top_pool_execution] Sending sidechain block(s) confirmation extrinsic.. 
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"
[2023-03-24T11:13:11Z DEBUG integritee_service::worker] Broadcasting block to peer with address: "ws://127.0.0.1:2001"

why was that? Shall we use set instead of vectors in the peers or urls collections?

@clangenb
Copy link
Contributor

clangenb commented Mar 28, 2023

So the phenomenon you observed is basically two corner cases combined:

Due to network latency, worker1 started to produce block n, in slot x+1 before it received worker0's block n from slot x. So if the worker1 succeeded in producing a block, it would be a classical fork due to network latency. Fork mitigation is the last security debt we need to handle. See #685, #687, #690. However, now that worker1 failed to produce a block, the fork could have gone unnoticed if it hadn't tried to import the next parentchain block, which uncovered a bug: #1251

So to answer the questions:

  1. The solution regarding the double pop of the queue is: there absolutely mustn't be a state change when a worker fails to produce a block. Hence, the parentchain block must either be put back into the queue or never popped. I haven't looked in the relevant lines of code yet, so I can't tell the most suitable fix yet. However, disregarding the code, the idiomatic approach would be that the queue is not changed if we fail to produce the block.
  2. This would be the classical fork that still needs to be handled.
  3. I am curious about these logs, according to my understanding, we should only see each address once.

@Kailai-Wang
Copy link
Contributor Author

Thanks @clangenb

  1. yea I agree - I can try to modify the code. The slot proposer does need the parentchain block to be able to compose a sidechain block upon it, so basically I think there're two ways of keeping the queue state unchanged if it fails to produce a block:
    • pop -> compose the block -> push the popped blocks back to front if composing fails (if the queue is Deque)
    • note the blocks that should be popped and only pop them when the composing succeeds => have to check the code it would affect any peeked parentchain header
      Do you have any preference?
  2. I see where you are coming from, thanks
  3. it might come from the litentry-specific part, I'll have to check our code. But in general I believe the peer updating mechanism could be improved: Improve the peer updating mechanism litentry/litentry-parachain#1546
    Do you see the same?

@clangenb
Copy link
Contributor

Sorry for the delayed response here.

  1. I am currently looking at the codebase myself to see what could be done. Then I will be able to tell my preference, or see if I have a solution in mind.
  2. I absolutely agree.

On another note, if I analyse the logs correctly, worker1 failed to produce the block because the import of the parentchain block took too long. This must not be the case: #1275.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants