-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574
Comments
I want to suggest exploring how we could improve/get more involved in the current process before developing a new release verification solution. During the 0.17 release, @dharmapunk82 started documenting the release process at https://www.notion.so/pl-strflt/Release-Notes-6e0efff28ee540be9ccb8f2b85104c42 🙇. Now the question is: can we automate it?
As @guseggert correctly pointed out, it is also crucial to formally determine what Kubo metrics we care about. The doc mentions:
In my opinion, such a list is a prerequisite for any further work. It might also be a good idea to make a distinction between success and performance metrics. If we want to further develop the current deployment process, it'd be helpful when thinking about things like automatic rollbacks or shadow deployments. If we decide to utilise any other solution, it'd be easier to reason whether the replacement allows us to answer questions that we need answering. Finally, it'd be interesting to see Thunderdome integrated more closely with Kubo verification. I see it as a massive opportunity for shifting issue discovery left. However, I'd be cautious as to what extent it can "replace" a real deployment. |
So just to restate the goal and motivation: we want to be in a place where Kubo maintainers are confident in Kubo releases without requiring coordination with specific external groups. We'd like to treat the ipfs.io gateways as just another gateway operator, so that it is the operators of ipfs.io who are responsible for timely testing of RC releases, and if no feedback comes in during the RC cycle, then we move forward with the release. Currenly we block releases on ipfs.io testing, which requires a lot of coordination due to Kubo maintainers (rightfully) not having direct access to production systems that they don't own. We do this because it is the best mechanism we have for load testing Kubo on real workloads. I think there are broadly two strategies we could take here:
I think 1) is an ideal long-term direction, because I believe in the benefits of "owning" the code you write/merge. The people pressing the merge button should feel pain the pain of bad decisions, not just the users...this "skin in the game" not only drives motivation for high quality standards, but gives maintainers much more leverage to push back on bad code/designs. Many long-standing performance issues and features for gateway operators also continue to languish because maintainers are not feeling the constant pain of them, and Kubo maintainers are in the best position to make the significant changes to Kubo to fix them or add the necessary features. The ipfs.io gateway operators have papered over many issues ("bad bits" blocking, figuring out when to manually scale, dashboards, excessive resource usage, etc.), which provides a fix for ipfs.io but for nobody else, which is very unfortunate for the ecosystem. There's also a product angle to 1) that I've always been interested in. Currently the ipfs.io infrastructure is closed source and private, because the cost of extracting it is too high. Also there are numerous design deficiencies with it that make operating it more painful than it needs to be (top of mind: lack of autoscaling). Providing a solid "out-of-the-box" gateway product that fixes these issues would be beneficial for the community IMO. (related: https://github.com/ipfs-cluster/ipfs-operator) That being said, I think 2) is a good incremental step towards 1) anyway, and will not be throwaway work even if 1) is not pursued. The crux is: can we get 2) to a point where Kubo maintainers are confident cutting a release without feedback from ipfs.io gateways? I think the answer is "yes", and I think we're almost there already. So I propose the following concrete actions:
I'd prefer we start with just running it ad-hoc by the release engineer, see how it goes, and then determine if we want to add it to CI and what that would look like. |
@galargh I think we should avoid tying releases to ipfs.io deployments if possible. If something goes wrong it's quite difficult (and dangerous) for us to debug, it usually involves coordinating with a bifrost engineer which dramatically increases turnaround time. I really think we'd be in a better world if the infrastructure running the release acceptance tests are owned by us and are not production systems.
Can you elaborate on the specific deficient areas of Thunderdome? |
@guseggert : the 0.19 release is going to sneak up on us quick. Can you please engage with @iand to see if there's any way we can get help here on the Thunderdome side so we don't need t rely on Bifrost deployments for the next release? |
Just a meta-not/flag that PL is moving towards We will lose ability to dogfood Kubo unless we explicitly own % of gateway trafic, and route it to gateways backed by Kubo instead of Saturn. |
For the 0.19 release, the release engineer should work with @iand to run Thunderdome on Kubo to mimic the validation we perform on ipfs.io gateways. @iand has added a lot of documentation around this, it will probably take a release or two to work through the process here, and then when we have the manual process working okay we can think about automating it. |
Sorry I missed this earlier. I didn't have any specific Thunderdome "deficiencies" in mind. I was only speaking to the fact that it is not a "real" deployment that the end users interact with. So using it can give us more confidence but the final verification will still be happening only when the code reaches ipfs.io. I was really trying to gauge how far off we might be from setting up Continuous Deployment to ipfs.io. Given the answers, it seems to me we're not really there yet.
I strongly believe someone from the core Kubo maintainers team should drive the validation setup work. I'm happy to get involved and help out but I think it'd be hard for me or Ian to definitively conclude what set of experiments would give Kubo maintainers enough confidence to proceed with a release. |
I agree, but I would like to build a list of the kinds of validation we need to see. What's the success criteria? Also Thunderdome can simulate the bifrost environment fairly well and it captures the same metrics but it's missing some diagnostics that the Kubo team might want. What level of logging is needed? Are there additional metrics that we should add or existing ones that should be tracked? I'm also thinking about things like taking profiles at various points, goroutine traces and opentelemetry style tracing. |
@BigLep Do we have a write-up of the testing process we should be following now? |
Done Criteria
There is an agreed-upon plan (e.g., document, issue) with the Kubo maintainers for how we can adjust the release process so that our schedule is not dependent on other teams' schedule (and in particular PL EngRes Bifrost team for ipfs.io Gateway deployments). This plan should make clear what the acceptance criteria is and the specific steps we're going to take so that engineers can pick it up and execute on it.
Why Important
Delays in the ipfs.io gateway deployment have been the main cause for release slips. As discussed during the 0.18 retrospective, delay starts with the deploy slipping which opens up the door for other bug fixes, improvements, etc. to start creeping in which can further can push the date out.
In addition, by not owning the production service that is using the software, maintainers are shielded from seeing firsthand how the software performs in production.
User/Customer
Kubo maintainers
Notes
The text was updated successfully, but these errors were encountered: