-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out noble upgrade cadence plan #7333
Comments
Also, for instances with hands-on administrators, we can give them a heads up and let them manually run the migration script before our auto/forced migration. |
(Early thoughts, not fully formed) I like the idea of admins being in control of the migration, unless there's a situation where there's not a hands-on admin and we run up against the deadline. What about an alternate approach that might look like:
|
is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong? If so, (building on @nathandyer's proposal) we kindof get that already if we give folks the option to migrate ahead of time and get the feedback of the first ones off the ice. |
We don't need temporary apt servers or anything tho - we can ship the changes packaged as normal and just have EOL checks. |
Yes, and (if things go poorly) we shouldn't take down every single SecureDrop all at once. |
To merge Nathan's proposal with mine:
|
It looks like APT has built-in support for phased updates. Could that work for us, so we don't have to implement it ourselves? (This might be worth looking into generally.) |
Thanks for flagging that, unfortunately focal's apt doesn't support phasing so it's not an option for us here, but it will become an option once we do upgrade to noble so let me file a separate task for that. |
One point made in today's team meeting is that the admin-instigated upgrade period will give us a good sense of how robust the upgrade process is and inform how important spreading out the upgrades are. Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing. |
On Wed, Nov 13, 2024 at 12:31:42PM -0800, Kunal Mehta wrote:
Another thing I clarified is that the point of having mon go before
app is so that we have a consistent state to test against. I don't
want both servers upgrading at the same time, in a weird
undefined/hard to test state. So one should go first, and then we
upgrade the second. No strong opinion on whether it's app or mon, but
that it's a defined order we can replicate during testing.
I agree, @legoktm. In any cloud deployment we would be able to stagger
these, but our Ansible playbooks effectively run parallel to the
Application and Monitor Servers at each step.
In the automatic scenario, how will we (and an administrator) know that
a Monitor Server has been upgraded successfully? No "/metadata"
endpoint to monitor there.
|
To clarify, even for the manual administrator-initiated upgrade, I would still want to do them in series (mon first, then app).
We/FPF will have no visibility into mon upgrades (maybe we can peek at apt-prod web request logs I guess). For admins we'll send some sort of message via OSSEC alerts (i.e. |
Doing it for the manual updates is probably easier, you can just modify Ansible's inventory. Though there might be some refactoring necessary if roles depend on info shared between app and mon. But it still largely feels over-engineered for the automated case to me:
If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway. As an aside, I am very leery of trying to infer stuff from apt repo stats:
|
That's a really good point and seems like a good rationale to do app before mon. If app fails, we can send OSSEC alerts via mon, and either it's down so we notice, or we can display something in the JI to further flag it for journalists/admins.
Which part do you think is over-engineered? Or: what would you want to do differently? I think we have a different perspective/disagreement on how much we should be leaning into automatic vs manually driven? My current perspective is that we should be making the auto upgrade more robust/feasible/safe/etc., even at the risk of overdoing it, but it places the cost on us rather than administrators. |
Description
Instead of upgrading every single instance at the exact same time (once we push a deb), I think it would be better to do some sort of staged rollout.
My proposal would be that on package upgrade, each instance generates a random number (1-5) and stores it somewhere. In theory we've now split all the securedrop servers into 5 groups.
Then, in another file we ship with the package (possibly the upgrade script itself) we have a number we control. if we set it to 1, we'll upgrade ~20% of servers. Then we can do another deb package release to bump it to 2 to upgrade ~40% of all servers. And so on.
I also think this mechanism should be split for both app and mon. We should upgrade all mon servers to 100% and then do all the app servers.
The text was updated successfully, but these errors were encountered: