Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Savanna unittests modeled after the fast testnet wave tests. #380

Closed
greg7mdp opened this issue Jul 18, 2024 · 1 comment · Fixed by #444, #487, #582, #595 or #597
Closed

Create Savanna unittests modeled after the fast testnet wave tests. #380

greg7mdp opened this issue Jul 18, 2024 · 1 comment · Fixed by #444, #487, #582, #595 or #597
Assignees

Comments

@greg7mdp
Copy link
Contributor

greg7mdp commented Jul 18, 2024

We will use the savanna_cluster class to implement the following test scenarios derived from this document

Test setup

Common prerequisites:

  • 4 simulated nodes (A, B, C, D), each having one finalizer with weight=1.
  • cluster has total of 4 finalizers, quorum=3.
  • 4 producers: { PA, PB, PC, PD }
  • node A has just produced the first 2 blocks of PA's first round, all other nodes have voted strong on every block

savanna_cluster functionality

  • by default, all nodes in the cluster are logically connected to every other node
  • it is possible to simulate network partitions, in which case every node in a partition is connected to every other node in the same partition
  • every block produced by a node is by default, and synchronously, pushed to every connected node, who get the opportunity to vote on the received block, and the votes are similarly propagated to all connected nodes. all this happens within the produce_block(s) call.

Definitions for unit tests

shutdown A: means that close() is called for A's tester, which does control.reset(); chain_transactions.clear();

restart A: means that open() is called for A's tester, which restarts the node using the existing state.

fsi: finalizer safety information (finalizers/safety.dat)

head: the chain head for a node, queried from the controller and retrieved within a test using tester::head()

lib: the last irreversible block id for a node, as reported by the irreversible_block signal, and retrieved within a test using tester::lib_id

state: memory mapped file holding the chainbase state (state/shared_memory.bin directory)

blocks log: files holding irreversible blocks (files blocks/blocks.log and blocks/blocks.index)

reversible data: files holding reversible blocks data (located in blocks/reversible)

finality violation: defined as the existence of 2 final blocks where none is an ancestor of the other.

confirm a finality violation: Finality violations can be confirmed by showing that the libs of two nodes are in conflict.

confirm no finality violation: The abscence of a finality violation can established if, on a reconnected network, heads can be propagated without unlinkable blocks.

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

  • shutdown C
  • A produces 4 more blocks. Verify that lib advances by 4
  • restart C
  • push blocks A -> C
  • verify that C votes again (strong) and that lib continues to advance

[sd1] Recover a killed node with old finalizer safety info

  • save C's fsi
  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state, replace C's fsi with previously saved file
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd2] Recover a killed node with deleted finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state and fsi
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd3] Recover a killed node while retaining up to date finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state, lease fsi alone
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

All but one finalizer nodes go down

Tests are similar above, except that C is replaced by the set { B, C, D }, and lib stops advancing when { B, C, D } are shutdown

[md0] recovery when nodes go down

  • shutdown { B, C, D }
  • A produces 4 more blocks. Verify that lib advances by 1
  • restart { B, C, D }
  • push blocks A -> { B, C, D }
  • verify that { B, C, D } vote again (strong) and that lib continues to advance

[md1] Recover a killed node with old finalizer safety info

  • save { B, C, D }'s fsi
  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state, replace { B, C, D }'s fsi with previously saved file
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md2] Recover a killed node with deleted finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state and fsi
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md3] Recover a killed node while retaining up to date finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state, lease fsi alone
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

  • A produces 2 blocks
  • take snapshot of C
  • A produces enough blocks so the snapshot block becomes irreversible and the snapshot is created.
  • verify that all nodes have the same last irreversible block ID (lib_id) and head block ID (h_id) - the snapshot block
  • split network { A, B } and { C, D }
  • A produces two more blocks, so A and B will vote strong but finality will not advance
  • remove network split
  • shutdown all four nodes
  • delete the state and the reversible data for all nodes, but do not delete the fsi or blocks log
  • restart all four nodes from previously saved snapshot. A and B finalizers will be locked on lib_id's child which was lost
  • A produces 4 blocks
  • verify that head is advancing on all nodes
  • verify that lib does not advance and is stuck at lib_id (because validators are locked on a reversible block which has been lost, so they cannot vote any since the claim on the lib block is just copied forward and will always be on a block with a timestamp < that the lock block in the fsi)
  • verify that A and B aren't voting
  • shutdown all four nodes again
  • delete every node's fsi
  • restart all four nodes
  • A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances

Finality violation

Goal is to identify a finality violation, defined as the existence of 2 final blocks where none is an ancestor of the other.

[fv1] Validate network can tolerate 1/4 fault

  • shutdown D
  • add B's finalizer key to D (B still keeps its key and votes with it)
  • restart D
  • A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances
  • verify no finality violations have occurred

[fv2] split network when one node holds two finalizer keys

  • shutdown D
  • add B's finalizer key to D (B still keeps its key and votes with it)
  • restart D
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 4 blocks, verify lib does not advance (because quorum not met)
  • C produces 4 blocks (starting with delay 5 *_block_interval_us), verify lib advances (because quorum met, thanks to D voting with B's key in addition to its own)
  • verify no finality violations have occurred

[fv3] restore split network when one node holds two finalizer keys

  • execute the steps from test [fv2]
  • remove network partition
  • propagate_heads
  • A produces 4 blocks (starting with delay 5 *_block_interval_us)
  • verify that lib is still advancing on C and D
  • verify that lib resumed advancing on A and B
  • verify all nodes have the same head and lib
  • Verify no finality violations have occurred

[fv4] Validate network cannot tolerate 2/4 fault

  • execute the steps from test [fv2]
  • shutdown A
  • add C's finalizer key to A (C still keeps its key and votes with it)
  • restart A
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 4 blocks, verify lib advances (because together A and B hold three keys)
  • C produces 4 blocks (starting with delay 5 *_block_interval_us), verify lib advances (because together C and C hold three keys)
  • verify finality violations have occurred (for example by showing that lib blocks on A and C are incompatible)
  • remove network partition
  • C produces 4 blocks. verify that lib advances on C and D, and that unlinkable blocks are received on A and B
  • A produces 4 blocks (starting with delay 5 *_block_interval_us). verify that lib advances on A and B, and that unlinkable blocks are received on C and D.

unit tests: Savanna transition testing

For these tests, the cluster will be started in a pre-savanna configuration, but with finalizer keys initialized and set to one per node as before.

[st0] straightforward transition

  • call setfinalizer action on node A, with a policy where each nodes has one vote.
  • produce blocks on A, waiting for transition to Savanna to complete
  • A produces 4 blocks, verify lib advances

[st1] transition with split network before critical block

  • update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
  • call setfinalizer action on node A, with a policy where each nodes has one vote.
  • A produces 1 block (the genesis block)
  • A produces 2 blocks, verify we are before the critical block
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 20 blocks. produce some more blocks and verify lib doesn't advance and transition to Savanna doesn't complete
  • remove network partition
  • propagate_heads
  • A produces 1 block
  • check that A's head is greater than the critical block
  • A produces 4 blocks. verify that lib advances on all nodes, and that transition to Savanna happened
  • verify presence of snapshot_00 and snapshot_01 files, and preserve them

[st2] restart from Snapshot at beginning of transition while preserving fsi

  • update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
  • call setfinalizer action on node A, with a policy where each nodes has one vote.
  • A produces one block (the genesis block)
  • A produces 2 blocks, make sure we are before the critical block
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 2 blocks
  • take snapshot of C
  • A produces a few blocks (say 5, must be less than 14)
  • for each of { B, C, D }, shutdown
  • for each of { B, C, D }, remove blocks log, reversible blocks and state, but do not remove finalizer safety information
  • remove network partition
  • for each of { B, C, D }, restart from previously taken snapshot
  • push blocks from A (up to A's head) to each of { B, C, D }
  • A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
  • A produces 1 block, check it is a proper savanna block
  • A produces 1 block, check that lib is not advancing
  • A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
  • A produces 3 blocks. verify that lib advances by 3 on all nodes

[st3] restart from Snapshot at end of transition while preserving fsi

  • update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
  • call setfinalizer action on node A, with a policy where each nodes has one vote.
  • A produces one block (the genesis block)
  • A produces 2 blocks, make sure we are before the critical block
  • A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 1 blocks
  • take snapshot of C
  • for each of { B, C, D }, shutdown
  • for each of { B, C, D }, remove blocks log, reversible blocks and state, but do not remove finalizer safety information
  • remove network partition
  • for each of { B, C, D }, restart from previously taken snapshot
  • push blocks from A (up to A's head) to each of { B, C, D }
  • A produces 1 block, check that lib is not advancing
  • A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
  • A produces 3 blocks. verify that lib advances by 3 on all nodes

[st4] restart from Snapshot at beginning of transition without preserving fsi

Very similar to [st2], only difference is that fsi are removed before restarting the nodes.

  • update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
  • call setfinalizer action on node A, with a policy where each nodes has one vote.
  • A produces one block (the genesis block)
  • A produces 2 blocks, make sure we are before the critical block
  • partition network by disconnecting { A, B } and { C, D }
  • A produces 2 blocks
  • take snapshot of C
  • A produces a few blocks (say 5, must be less than 14)
  • for each of { B, C, D }, shutdown
  • for each of { B, C, D }, remove blocks log, reversible blocks, state, and also remove finalizer safety information
  • remove network partition
  • for each of { B, C, D }, restart from previously taken snapshot
  • push blocks from A (up to A's head) to each of { B, C, D }
  • A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
  • A produces 1 block, check it is a proper savanna block
  • A produces 1 block, check that lib is not advancing
  • A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
  • A produces 3 blocks. verify that lib advances by 3 on all nodes

unit tests: Finalizer policy testing

[fp0] policy change

  • precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
  • shutdown C
  • B produces 2 blocks, verify lib advances by 2
  • update finalizer_policy with a new key for B
  • produce blocks on A, waiting the new policy to become pending
  • when new policy is pending, shutdown B, add the new finalizer key to B, and restart B
  • produce blocks on A, waiting for transition to complete (until the updated policy is active on A's head)
  • produce 3 blocks on A, verify that lib advances by 3

[fp1] policy change including weight and threshold

  • precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
  • shutdown C, verify that lib still advances since threshold is 3
  • update finalizer_policy so that C's weight is 3, B and D are removed, and the threshold is 4.
  • produce blocks on A, waiting the new policy to become pending
  • verify that lib stops advancing (because C is down so we can't get a QC on the pending policy which needs three C votes)
  • restart C
  • produce blocks on A, waiting for transition to complete (until the updated policy is active on A's head)
  • produce 2 blocks on A, verify that lib advances by 2
  • shutdown B and D
  • produce 2 blocks on A, verify that lib advances by 2 (because B and D are not in the updated policy)

[fp2] policy change: reduce threshold, replace all keys

  • precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
  • shutdown D, verify that lib still advances since threshold is 3
  • update signing keys on each of { A, B }, so every node has 2 keys, the previous one + a new one
  • produce 2 blocks on A, verify that lib advances by 2
  • update the finalizer_policy to include only { A, B }'s new keys, and threshold is 2
  • produce blocks on A, waiting the new policy to become pending
  • produce blocks on A, waiting the new policy to become active
  • A produces 2 blocks, verify that lib advances by 2
  • shutdown C and D
  • A produces 2 blocks, verify that lib advances by 2

[fp3] policy change: restart from snapshot

  • precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
  • update signing keys on each of { A, B, C }, so every node has 2 keys, the previous one + a new one
  • update the finalizer_policy to include only { A, B, C }'s new keys, C's weight is 2, and threshold is 3
  • A produces one block.
  • Take a snapshot of C. Produce 2 blocks on A so the snapshot block is stored in block log
  • for each of { A, B, C, D }, shutdown, delete state, but not blocks log, reversible data or fsi.
  • for each of { A, B, D }, restart from the snapshot
  • A produces 4 blocks, verify that lib has advanced only by one and that the new policy is only pending (because C is down so no quorum on new policy)
  • restart C from the snapshot
  • A produces 4 blocks, verify that the new policy is active and lib starts advancing again
  • shutdown B and D
  • A produces 3 blocks, verify that lib advances by 3 (because together A and C meet the 3 votes quorum for the new policy)
@enf-ci-bot enf-ci-bot moved this to Todo in Team Backlog Jul 18, 2024
@arhag arhag added this to the Savanna: Production-Ready milestone Jul 18, 2024
@arhag arhag added 👍 lgtm and removed triage labels Jul 24, 2024
@greg7mdp greg7mdp moved this from Todo to Awaiting Review in Team Backlog Jul 25, 2024
@greg7mdp greg7mdp moved this from Awaiting Review to In Progress in Team Backlog Jul 25, 2024
@greg7mdp greg7mdp self-assigned this Jul 25, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Team Backlog Aug 2, 2024
@arhag arhag reopened this Aug 6, 2024
@github-project-automation github-project-automation bot moved this from Done to Todo in Team Backlog Aug 6, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Team Backlog Aug 8, 2024
@greg7mdp
Copy link
Contributor Author

greg7mdp commented Aug 8, 2024

Re-opening issue for the implementation of the last four tests related to Finalizer policy testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment