Replies: 2 comments
-
So this is actually not true. We iterated again on all testing and still missed the PoSt issue. The bug that made it to calibration was in both the release before descoping and expediation and the release after descoping and expediation. We should have been more thorough and caught it before calibration but this is decoupled from the expediated release.
A number of engineers were paged in the wee hours of the morning on Sunday in order to fix the bug and keep it from bringing down calibration. It is a large mischaracterization to say that there is disrespect or lack of concern for the calibration testnet or that there is not overwhelming agreement on the above points. Everyone involved acknowledged this was a big mistake and we all stepped up to correct it.
There is a tradeoff and our ecosystem is perhaps less risk averse than your preference. There is a cost to pushing deadlines. As we've discussed a bit on slack there are many possible and previously encountered scenarios where network security makes moving fast necessary. |
Beta Was this translation helpful? Give feedback.
-
@q9f Thanks for starting this discussion. I generally agree with all of the sentiments and guiding principles you cover. In particular, the importance of a stable, public testnet that is mainnet-like in every sense cannot be overstated. Calibrationnet is that network for Filecoin. So given that we're aligned on the importance of #RespectingTheTestnets, I'm interested in discussing how we can achieve that. The fact is that we have deployed bad code onto calibrationnet for 2 of the past 3 network upgrades -- nv17 required a network reset, while nv19 was more easily managed with some patching in the FVM. We need to do better. From my perspective, I think the biggest problem is that too much runs through the Lotus team, and our close partners who built the FVM. Internally, this feels like we're responsible for too much, externally it perhaps looks like we have too much power. Regardless, I think it's a problem*.
So I guess I would like to know what we can do to motivate others to participate in the processes you're describing more. In particular, if we take the above statement as true (I don't want to get especially bogged down by the details of nv19), I'd like to look at steps 3 and 4, and ask what can be done to encourage other teams (and in particular Forest) to participate in them:
What prevented Forest from adding test cases to FIP-0061? Of course there's no guarantee that you folks would have thought of and implemented the specific case the Lotus team missed, but we're guaranteed not to find bugs if we don't even try.
I don't mean to sound argumentative -- I know there are very real obstacles to Forest being able to participate more actively in Filecoin's development, testing, and maintenance. One of those is the pace at which we move, which is somewhat hard to tune to work for everyone. In this case, I think NOT expediting nv19 would have been quite the wrong call given the precarious block validation times, reduced average WinCount, decreased chain throughput, and loss of revenue to SPs (all of which I assume was affecting Forest nodes too, and observed by Forest's users and chain monitoring). But I'm aware there are other major blockers to Forest being able to operate at the level we'd all like it to -- @lemmih has brought up many in the past that I know I personally have not adequately addressed. I think now would be a great time to start resolving these outstanding issues so that we can collectively move towards achieving the upgrade and testing strategy that you describe. *I want to say that this is already rapidly improving -- testing and review from the Venus team is increasingly catching issues with FIPs & their implementation in the FVM / builtin-actors, which is FANTASTIC. We're also getting more testing, investigation, and fixes from the broader community, which is another major lift. I greatly look forward to that pool expanding yet further, especially the potential addition of Forest engineers to those testing efforts. |
Beta Was this translation helpful? Give feedback.
-
I am following up on the discussion of the expedited nv19 upgrade timeline, the last-minute changes to the specification, and the subsequent breakage of the Calibration test network.
Putting aside my concerns with how nv19 was expedited and how little time client teams and the community had to prepare for the upgrade (notably, not even the Filecoin ops team managed to upgrade their snapshot servers in time), I would like to take the chance to discuss the Filecoin upgrade and testing strategy and the role of testnets specifically.
A protocol upgrade on a decentralized network has to go through various stages:
Taking nv19 as an example, we failed somewhere between steps 3 and 4. It became apparent that changes were needed. But instead of iterating the process, it was rolled out to the Calibration test network without further testing. The subsequent result was utter chaos and the breakage of the testnet.
A public testnet should always be treated as a mainnet for change management. Breaking a testnet should be avoided at any cost because it's a precious infrastructure for the core developers and the entire community. Furthermore, breaking a public testnet gives away the opportunity to properly test a successful upgrade and gives way to potential issues on the mainnet.
I would strongly encourage you to
Losing or bricking a testnet due to a failed upgrade induces a lot of cost on the community. Patching the network, resetting the network, or setting up a new one requires a lot of teams to go to extra lengths to adjust code, infrastructure, and documentation.
Treat the public testnet as if it was a mainnet.
Beta Was this translation helpful? Give feedback.
All reactions