Create Savanna unittests modeled after the fast testnet wave tests. #380

greg7mdp · 2024-07-18T20:50:05Z

We will use the savanna_cluster class to implement the following test scenarios derived from this document

Test setup

Common prerequisites:

4 simulated nodes (A, B, C, D), each having one finalizer with weight=1.
cluster has total of 4 finalizers, quorum=3.
4 producers: { PA, PB, PC, PD }
node A has just produced the first 2 blocks of PA's first round, all other nodes have voted strong on every block

`savanna_cluster` functionality

by default, all nodes in the cluster are logically connected to every other node
it is possible to simulate network partitions, in which case every node in a partition is connected to every other node in the same partition
every block produced by a node is by default, and synchronously, pushed to every connected node, who get the opportunity to vote on the received block, and the votes are similarly propagated to all connected nodes. all this happens within the produce_block(s) call.

Definitions for unit tests

shutdown A: means that close() is called for A's tester, which does control.reset(); chain_transactions.clear();

restart A: means that open() is called for A's tester, which restarts the node using the existing state.

fsi: finalizer safety information (finalizers/safety.dat)

head: the chain head for a node, queried from the controller and retrieved within a test using tester::head()

lib: the last irreversible block id for a node, as reported by the irreversible_block signal, and retrieved within a test using tester::lib_id

state: memory mapped file holding the chainbase state (state/shared_memory.bin directory)

blocks log: files holding irreversible blocks (files blocks/blocks.log and blocks/blocks.index)

reversible data: files holding reversible blocks data (located in blocks/reversible)

finality violation: defined as the existence of 2 final blocks where none is an ancestor of the other.

confirm a finality violation: Finality violations can be confirmed by showing that the libs of two nodes are in conflict.

confirm no finality violation: The abscence of a finality violation can established if, on a reconnected network, heads can be propagated without unlinkable blocks.

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

shutdown C
A produces 4 more blocks. Verify that lib advances by 4
restart C
push blocks A -> C
verify that C votes again (strong) and that lib continues to advance

[sd1] Recover a killed node with old finalizer safety info

save C's fsi
A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state, replace C's fsi with previously saved file
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd2] Recover a killed node with deleted finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state and fsi
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd3] Recover a killed node while retaining up to date finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state, lease fsi alone
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

All but one finalizer nodes go down

Tests are similar above, except that C is replaced by the set { B, C, D }, and lib stops advancing when { B, C, D } are shutdown

[md0] recovery when nodes go down

shutdown { B, C, D }
A produces 4 more blocks. Verify that lib advances by 1
restart { B, C, D }
push blocks A -> { B, C, D }
verify that { B, C, D } vote again (strong) and that lib continues to advance

[md1] Recover a killed node with old finalizer safety info

save { B, C, D }'s fsi
A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state, replace { B, C, D }'s fsi with previously saved file
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md2] Recover a killed node with deleted finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state and fsi
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md3] Recover a killed node while retaining up to date finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state, lease fsi alone
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

A produces 2 blocks
take snapshot of C
A produces enough blocks so the snapshot block becomes irreversible and the snapshot is created.
verify that all nodes have the same last irreversible block ID (lib_id) and head block ID (h_id) - the snapshot block
split network { A, B } and { C, D }
A produces two more blocks, so A and B will vote strong but finality will not advance
remove network split
shutdown all four nodes
delete the state and the reversible data for all nodes, but do not delete the fsi or blocks log
restart all four nodes from previously saved snapshot. A and B finalizers will be locked on lib_id's child which was lost
A produces 4 blocks
verify that head is advancing on all nodes
verify that lib does not advance and is stuck at lib_id (because validators are locked on a reversible block which has been lost, so they cannot vote any since the claim on the lib block is just copied forward and will always be on a block with a timestamp < that the lock block in the fsi)
verify that A and B aren't voting
shutdown all four nodes again
delete every node's fsi
restart all four nodes
A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances

Finality violation

Goal is to identify a finality violation, defined as the existence of 2 final blocks where none is an ancestor of the other.

[fv1] Validate network can tolerate 1/4 fault

shutdown D
add B's finalizer key to D (B still keeps its key and votes with it)
restart D
A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances
verify no finality violations have occurred

[fv2] split network when one node holds two finalizer keys

shutdown D
add B's finalizer key to D (B still keeps its key and votes with it)
restart D
partition network by disconnecting { A, B } and { C, D }
A produces 4 blocks, verify lib does not advance (because quorum not met)
C produces 4 blocks (starting with delay 5 *_block_interval_us), verify lib advances (because quorum met, thanks to D voting with B's key in addition to its own)
verify no finality violations have occurred

[fv3] restore split network when one node holds two finalizer keys

execute the steps from test [fv2]
remove network partition
propagate_heads
A produces 4 blocks (starting with delay 5 *_block_interval_us)
verify that lib is still advancing on C and D
verify that lib resumed advancing on A and B
verify all nodes have the same head and lib
Verify no finality violations have occurred

[fv4] Validate network cannot tolerate 2/4 fault

execute the steps from test [fv2]
shutdown A
add C's finalizer key to A (C still keeps its key and votes with it)
restart A
partition network by disconnecting { A, B } and { C, D }
A produces 4 blocks, verify lib advances (because together A and B hold three keys)
C produces 4 blocks (starting with delay 5 *_block_interval_us), verify lib advances (because together C and C hold three keys)
verify finality violations have occurred (for example by showing that lib blocks on A and C are incompatible)
remove network partition
C produces 4 blocks. verify that lib advances on C and D, and that unlinkable blocks are received on A and B
A produces 4 blocks (starting with delay 5 *_block_interval_us). verify that lib advances on A and B, and that unlinkable blocks are received on C and D.

unit tests: Savanna transition testing

For these tests, the cluster will be started in a pre-savanna configuration, but with finalizer keys initialized and set to one per node as before.

[st0] straightforward transition

call setfinalizer action on node A, with a policy where each nodes has one vote.
produce blocks on A, waiting for transition to Savanna to complete
A produces 4 blocks, verify lib advances

[st1] transition with split network before critical block

update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
call setfinalizer action on node A, with a policy where each nodes has one vote.
A produces 1 block (the genesis block)
A produces 2 blocks, verify we are before the critical block
partition network by disconnecting { A, B } and { C, D }
A produces 20 blocks. produce some more blocks and verify lib doesn't advance and transition to Savanna doesn't complete
remove network partition
propagate_heads
A produces 1 block
check that A's head is greater than the critical block
A produces 4 blocks. verify that lib advances on all nodes, and that transition to Savanna happened
verify presence of snapshot_00 and snapshot_01 files, and preserve them

[st2] restart from Snapshot at beginning of transition while preserving fsi

update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
call setfinalizer action on node A, with a policy where each nodes has one vote.
A produces one block (the genesis block)
A produces 2 blocks, make sure we are before the critical block
partition network by disconnecting { A, B } and { C, D }
A produces 2 blocks
take snapshot of C
A produces a few blocks (say 5, must be less than 14)
for each of { B, C, D }, shutdown
for each of { B, C, D }, remove blocks log, reversible blocks and state, but do not remove finalizer safety information
remove network partition
for each of { B, C, D }, restart from previously taken snapshot
push blocks from A (up to A's head) to each of { B, C, D }
A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
A produces 1 block, check it is a proper savanna block
A produces 1 block, check that lib is not advancing
A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
A produces 3 blocks. verify that lib advances by 3 on all nodes

[st3] restart from Snapshot at end of transition while preserving fsi

update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
call setfinalizer action on node A, with a policy where each nodes has one vote.
A produces one block (the genesis block)
A produces 2 blocks, make sure we are before the critical block
A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
partition network by disconnecting { A, B } and { C, D }
A produces 1 blocks
take snapshot of C
for each of { B, C, D }, shutdown
for each of { B, C, D }, remove blocks log, reversible blocks and state, but do not remove finalizer safety information
remove network partition
for each of { B, C, D }, restart from previously taken snapshot
push blocks from A (up to A's head) to each of { B, C, D }
A produces 1 block, check that lib is not advancing
A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
A produces 3 blocks. verify that lib advances by 3 on all nodes

[st4] restart from Snapshot at beginning of transition without preserving fsi

Very similar to [st2], only difference is that fsi are removed before restarting the nodes.

update schedule to 2 producers, so that we have multiple blocks (17) between the genesis and critical blocks
call setfinalizer action on node A, with a policy where each nodes has one vote.
A produces one block (the genesis block)
A produces 2 blocks, make sure we are before the critical block
partition network by disconnecting { A, B } and { C, D }
A produces 2 blocks
take snapshot of C
A produces a few blocks (say 5, must be less than 14)
for each of { B, C, D }, shutdown
for each of { B, C, D }, remove blocks log, reversible blocks, state, and also remove finalizer safety information
remove network partition
for each of { B, C, D }, restart from previously taken snapshot
push blocks from A (up to A's head) to each of { B, C, D }
A produces blocks until lib reaches the genesis block (so we are at the critical block), check it is not a proper savanna block.
A produces 1 block, check it is a proper savanna block
A produces 1 block, check that lib is not advancing
A produces 1 block, check that lib starts advancing again (it takes a 2-chain after the critical block)
A produces 3 blocks. verify that lib advances by 3 on all nodes

unit tests: Finalizer policy testing

[fp0] policy change

precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
shutdown C
B produces 2 blocks, verify lib advances by 2
update finalizer_policy with a new key for B
produce blocks on A, waiting the new policy to become pending
when new policy is pending, shutdown B, add the new finalizer key to B, and restart B
produce blocks on A, waiting for transition to complete (until the updated policy is active on A's head)
produce 3 blocks on A, verify that lib advances by 3

[fp1] policy change including weight and threshold

precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
shutdown C, verify that lib still advances since threshold is 3
update finalizer_policy so that C's weight is 3, B and D are removed, and the threshold is 4.
produce blocks on A, waiting the new policy to become pending
verify that lib stops advancing (because C is down so we can't get a QC on the pending policy which needs three C votes)
restart C
produce blocks on A, waiting for transition to complete (until the updated policy is active on A's head)
produce 2 blocks on A, verify that lib advances by 2
shutdown B and D
produce 2 blocks on A, verify that lib advances by 2 (because B and D are not in the updated policy)

[fp2] policy change: reduce threshold, replace all keys

precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
shutdown D, verify that lib still advances since threshold is 3
update signing keys on each of { A, B }, so every node has 2 keys, the previous one + a new one
produce 2 blocks on A, verify that lib advances by 2
update the finalizer_policy to include only { A, B }'s new keys, and threshold is 2
produce blocks on A, waiting the new policy to become pending
produce blocks on A, waiting the new policy to become active
A produces 2 blocks, verify that lib advances by 2
shutdown C and D
A produces 2 blocks, verify that lib advances by 2

[fp3] policy change: restart from snapshot

precondition: Nodes A, B, C, D are running and contributing to finality, each have weight=1, threshold is 3
update signing keys on each of { A, B, C }, so every node has 2 keys, the previous one + a new one
update the finalizer_policy to include only { A, B, C }'s new keys, C's weight is 2, and threshold is 3
A produces one block.
Take a snapshot of C. Produce 2 blocks on A so the snapshot block is stored in block log
for each of { A, B, C, D }, shutdown, delete state, but not blocks log, reversible data or fsi.
for each of { A, B, D }, restart from the snapshot
A produces 4 blocks, verify that lib has advanced only by one and that the new policy is only pending (because C is down so no quorum on new policy)
restart C from the snapshot
A produces 4 blocks, verify that the new policy is active and lib starts advancing again
shutdown B and D
A produces 3 blocks, verify that lib advances by 3 (because together A and C meet the 3 votes quorum for the new policy)

The text was updated successfully, but these errors were encountered:

greg7mdp · 2024-08-08T18:15:30Z

Re-opening issue for the implementation of the last four tests related to Finalizer policy testing

enf-ci-bot added the triage label Jul 18, 2024

enf-ci-bot added this to Team Backlog Jul 18, 2024

enf-ci-bot moved this to Todo in Team Backlog Jul 18, 2024

arhag added this to the Savanna: Production-Ready milestone Jul 18, 2024

arhag added 👍 lgtm and removed triage labels Jul 24, 2024

greg7mdp moved this from Todo to Awaiting Review in Team Backlog Jul 25, 2024

greg7mdp moved this from Awaiting Review to In Progress in Team Backlog Jul 25, 2024

greg7mdp self-assigned this Jul 25, 2024

greg7mdp mentioned this issue Jul 30, 2024

Implement Savanna disaster recovery unit tests. #444

Merged

greg7mdp closed this as completed in #444 Aug 2, 2024

github-project-automation bot moved this from In Progress to Done in Team Backlog Aug 2, 2024

arhag reopened this Aug 6, 2024

github-project-automation bot moved this from Done to Todo in Team Backlog Aug 6, 2024

arhag modified the milestones: Savanna: Production-Ready, Spring v1.0.0 Aug 6, 2024

greg7mdp mentioned this issue Aug 7, 2024

Implement Savanna transition unit tests. #487

Merged

greg7mdp closed this as completed in #487 Aug 8, 2024

github-project-automation bot moved this from Todo to Done in Team Backlog Aug 8, 2024

greg7mdp reopened this Aug 8, 2024

github-project-automation bot moved this from Done to Todo in Team Backlog Aug 8, 2024

BenjaminGormanPMP moved this from Todo to In Progress in Team Backlog Aug 15, 2024

greg7mdp mentioned this issue Aug 19, 2024

Implement last three Savanna unittests modeled after the fast testnet wave tests. #582

Merged

greg7mdp closed this as completed in #582 Aug 19, 2024

github-project-automation bot moved this from In Progress to Done in Team Backlog Aug 19, 2024

arhag mentioned this issue Aug 19, 2024

Test proposer policy does not become active when finality has stalled #553

Closed

arhag reopened this Aug 19, 2024

github-project-automation bot moved this from Done to Todo in Team Backlog Aug 19, 2024

greg7mdp mentioned this issue Aug 20, 2024

[main -> 1.0] Implement last three Savanna unittests modeled after the fast testnet wave tests. #595

Merged

arhag linked a pull request Aug 20, 2024 that will close this issue

[main -> 1.0] Implement last three Savanna unittests modeled after the fast testnet wave tests. #595

Merged

arhag modified the milestones: Spring v1.0.0, Spring v1.0.0-rc2 Aug 20, 2024

arhag linked a pull request Aug 20, 2024 that will close this issue

[1.0 -> main] Replay from retained directory with no block log #597

Merged

heifner closed this as completed in #597 Aug 20, 2024

github-project-automation bot moved this from Todo to Done in Team Backlog Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Savanna unittests modeled after the fast testnet wave tests. #380

Create Savanna unittests modeled after the fast testnet wave tests. #380

greg7mdp commented Jul 18, 2024 •

edited

Loading

greg7mdp commented Aug 8, 2024

Create Savanna unittests modeled after the fast testnet wave tests. #380

Create Savanna unittests modeled after the fast testnet wave tests. #380

Comments

greg7mdp commented Jul 18, 2024 • edited Loading

Test setup

Common prerequisites:

savanna_cluster functionality

Definitions for unit tests

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

[sd1] Recover a killed node with old finalizer safety info

[sd2] Recover a killed node with deleted finalizer safety info

[sd3] Recover a killed node while retaining up to date finalizer safety info

All but one finalizer nodes go down

[md0] recovery when nodes go down

[md1] Recover a killed node with old finalizer safety info

[md2] Recover a killed node with deleted finalizer safety info

[md3] Recover a killed node while retaining up to date finalizer safety info

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

Finality violation

[fv1] Validate network can tolerate 1/4 fault

[fv2] split network when one node holds two finalizer keys

[fv3] restore split network when one node holds two finalizer keys

[fv4] Validate network cannot tolerate 2/4 fault

unit tests: Savanna transition testing

[st0] straightforward transition

[st1] transition with split network before critical block

[st2] restart from Snapshot at beginning of transition while preserving fsi

[st3] restart from Snapshot at end of transition while preserving fsi

[st4] restart from Snapshot at beginning of transition without preserving fsi

unit tests: Finalizer policy testing

[fp0] policy change

[fp1] policy change including weight and threshold

[fp2] policy change: reduce threshold, replace all keys

[fp3] policy change: restart from snapshot

greg7mdp commented Aug 8, 2024

greg7mdp commented Jul 18, 2024 •

edited

Loading

`savanna_cluster` functionality