Respect the testnets #138

q9f · 2023-04-24T14:08:45Z

q9f
Apr 24, 2023

I am following up on the discussion of the expedited nv19 upgrade timeline, the last-minute changes to the specification, and the subsequent breakage of the Calibration test network.

Putting aside my concerns with how nv19 was expedited and how little time client teams and the community had to prepare for the upgrade (notably, not even the Filecoin ops team managed to upgrade their snapshot servers in time), I would like to take the chance to discuss the Filecoin upgrade and testing strategy and the role of testnets specifically.

A protocol upgrade on a decentralized network has to go through various stages:

Specification

drafting motivation, documentation, and rationale
providing a specification and test cases; ideally executable

Governance

shed light on motivation, merits, and rationale
iron out technical concerns, iterate steps 1 and 2 if required
agree on rollout plan/timeline

Implementation

client teams have the chance to implement the specification and test cases
clients also implement provided test cases or add new cases to the spec
if new insights occur, iterate steps 1, 2, and 3 to be sure everyone is aware

Testing

on top of the spec tests, clients run integration and end-to-end tests as well as simulations
client teams collude on coordinated developer testnets to give the upgrade a dry run
allow all engineers to test the proposed upgrades
if something goes unexpected, iterate steps 1-4

Rollout

at this point, the upgrade has been reviewed, discussed and tested excessively
give the community time to prepare for the upgrade (make releases, announcements, etc.)
roll out the upgrade on a public testnet as a last pre-flight check
give it time to have a buffer between the upgrades; at this point, you don't want to go back to square 1 again
roll out the upgrade on mainnet

Taking nv19 as an example, we failed somewhere between steps 3 and 4. It became apparent that changes were needed. But instead of iterating the process, it was rolled out to the Calibration test network without further testing. The subsequent result was utter chaos and the breakage of the testnet.

A public testnet should always be treated as a mainnet for change management. Breaking a testnet should be avoided at any cost because it's a precious infrastructure for the core developers and the entire community. Furthermore, breaking a public testnet gives away the opportunity to properly test a successful upgrade and gives way to potential issues on the mainnet.

I would strongly encourage you to

spend more effort on building and maintaining specification tests for protocol upgrades that can be shared across all clients and used before rolling out upgrades
treat Calibnet (or any other dedicated testnet for that purpose) as the last bastion for testing mainnet upgrades
do not rush protocol upgrades on the testnet, as the highest goal should always be success and confidence that the mainnet is not in danger
push protocol upgrade dates if there is the slightest chance of issue/failure; moving deadlines is not a weakness here; it's a strength

Losing or bricking a testnet due to a failed upgrade induces a lot of cost on the community. Patching the network, resetting the network, or setting up a new one requires a lot of teams to go to extra lengths to adjust code, infrastructure, and documentation.

Treat the public testnet as if it was a mainnet.

ZenGround0 · 2023-04-24T18:47:16Z

ZenGround0
Apr 24, 2023

It became apparent that changes were needed. But instead of iterating the process, it was rolled out to the Calibration test network without further testing.

So this is actually not true. We iterated again on all testing and still missed the PoSt issue. The bug that made it to calibration was in both the release before descoping and expediation and the release after descoping and expediation. We should have been more thorough and caught it before calibration but this is decoupled from the expediated release.

A public testnet should always be treated as a mainnet for change management.
Breaking a testnet should be avoided at any cost because it's a precious infrastructure for the core developers and the entire community.
treat Calibnet (or any other dedicated testnet for that purpose) as the last bastion for testing mainnet upgrades
Furthermore, breaking a public testnet gives away the opportunity to properly test a successful upgrade and gives way to potential issues on the mainnet.

A number of engineers were paged in the wee hours of the morning on Sunday in order to fix the bug and keep it from bringing down calibration. It is a large mischaracterization to say that there is disrespect or lack of concern for the calibration testnet or that there is not overwhelming agreement on the above points. Everyone involved acknowledged this was a big mistake and we all stepped up to correct it.

push protocol upgrade dates if there is the slightest chance of issue/failure; moving deadlines is not a weakness here; it's a strength

There is a tradeoff and our ecosystem is perhaps less risk averse than your preference. There is a cost to pushing deadlines. As we've discussed a bit on slack there are many possible and previously encountered scenarios where network security makes moving fast necessary.

0 replies

arajasek · 2023-05-03T15:25:39Z

arajasek
May 3, 2023
Maintainer

@q9f Thanks for starting this discussion. I generally agree with all of the sentiments and guiding principles you cover. In particular, the importance of a stable, public testnet that is mainnet-like in every sense cannot be overstated. Calibrationnet is that network for Filecoin.

So given that we're aligned on the importance of #RespectingTheTestnets, I'm interested in discussing how we can achieve that. The fact is that we have deployed bad code onto calibrationnet for 2 of the past 3 network upgrades -- nv17 required a network reset, while nv19 was more easily managed with some patching in the FVM. We need to do better.

From my perspective, I think the biggest problem is that too much runs through the Lotus team, and our close partners who built the FVM. Internally, this feels like we're responsible for too much, externally it perhaps looks like we have too much power. Regardless, I think it's a problem*.

Taking nv19 as an example, we failed somewhere between steps 3 and 4.

So I guess I would like to know what we can do to motivate others to participate in the processes you're describing more. In particular, if we take the above statement as true (I don't want to get especially bogged down by the details of nv19), I'd like to look at steps 3 and 4, and ask what can be done to encourage other teams (and in particular Forest) to participate in them:

Implementation

client teams have the chance to implement the specification and test cases

clients also implement provided test cases or add new cases to the spec

if new insights occur, iterate steps 1, 2, and 3 to be sure everyone is aware

What prevented Forest from adding test cases to FIP-0061? Of course there's no guarantee that you folks would have thought of and implemented the specific case the Lotus team missed, but we're guaranteed not to find bugs if we don't even try.

Testing

on top of the spec tests, clients run integration and end-to-end tests as well as simulations

client teams collude on coordinated developer testnets to give the upgrade a dry run

allow all engineers to test the proposed upgrades

if something goes unexpected, iterate steps 1-4

What is actively preventing Forest from writing and running these end-to-end tests and simulations?
I really like the idea of a coordinated developer testnet -- have you put this idea before, and if so what caused it to fail?

I don't mean to sound argumentative -- I know there are very real obstacles to Forest being able to participate more actively in Filecoin's development, testing, and maintenance. One of those is the pace at which we move, which is somewhat hard to tune to work for everyone. In this case, I think NOT expediting nv19 would have been quite the wrong call given the precarious block validation times, reduced average WinCount, decreased chain throughput, and loss of revenue to SPs (all of which I assume was affecting Forest nodes too, and observed by Forest's users and chain monitoring).

But I'm aware there are other major blockers to Forest being able to operate at the level we'd all like it to -- @lemmih has brought up many in the past that I know I personally have not adequately addressed. I think now would be a great time to start resolving these outstanding issues so that we can collectively move towards achieving the upgrade and testing strategy that you describe.

*I want to say that this is already rapidly improving -- testing and review from the Venus team is increasingly catching issues with FIPs & their implementation in the FVM / builtin-actors, which is FANTASTIC. We're also getting more testing, investigation, and fixes from the broader community, which is another major lift. I greatly look forward to that pool expanding yet further, especially the potential addition of Forest engineers to those testing efforts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect the testnets #138

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Respect the testnets #138

q9f Apr 24, 2023

Replies: 2 comments

ZenGround0 Apr 24, 2023

arajasek May 3, 2023 Maintainer

q9f
Apr 24, 2023

ZenGround0
Apr 24, 2023

arajasek
May 3, 2023
Maintainer