Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out noble upgrade cadence plan #7333

Open
legoktm opened this issue Nov 8, 2024 · 13 comments
Open

Figure out noble upgrade cadence plan #7333

legoktm opened this issue Nov 8, 2024 · 13 comments
Labels
needs/discussion queued up for discussion at future team meeting. Use judiciously. noble Ubuntu Noble related work

Comments

@legoktm
Copy link
Member

legoktm commented Nov 8, 2024

Description

Instead of upgrading every single instance at the exact same time (once we push a deb), I think it would be better to do some sort of staged rollout.

My proposal would be that on package upgrade, each instance generates a random number (1-5) and stores it somewhere. In theory we've now split all the securedrop servers into 5 groups.

Then, in another file we ship with the package (possibly the upgrade script itself) we have a number we control. if we set it to 1, we'll upgrade ~20% of servers. Then we can do another deb package release to bump it to 2 to upgrade ~40% of all servers. And so on.

I also think this mechanism should be split for both app and mon. We should upgrade all mon servers to 100% and then do all the app servers.

@legoktm legoktm added needs/discussion queued up for discussion at future team meeting. Use judiciously. noble Ubuntu Noble related work labels Nov 8, 2024
@legoktm
Copy link
Member Author

legoktm commented Nov 8, 2024

Also, for instances with hands-on administrators, we can give them a heads up and let them manually run the migration script before our auto/forced migration.

@nathandyer
Copy link
Contributor

nathandyer commented Nov 8, 2024

(Early thoughts, not fully formed)

I like the idea of admins being in control of the migration, unless there's a situation where there's not a hands-on admin and we run up against the deadline.

What about an alternate approach that might look like:

  1. We select a hard deadline date for the auto upgrade (for arguments sake, March 1st). Prior to March 1st, Admins can manually kick off the upgrade from an Admin Workstation. The mechanism for which might be:

  2. We publish the package with the noble upgrade script on a new (temporary?) Apt server

  3. We update securedrop-admin on the Admin Workstations with a securedrop-admin noble-upgrade command, which essentially adds the new Apt server to the sources list on both app and mon.

  4. Admins can manually update before the deadline date

  5. When the deadline happens, we promote the packages to the normal apt prod servers and "force" the update

  6. We retire the temporary Apt server

@zenmonkeykstop
Copy link
Contributor

is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong? If so, (building on @nathandyer's proposal) we kindof get that already if we give folks the option to migrate ahead of time and get the feedback of the first ones off the ice.

@zenmonkeykstop
Copy link
Contributor

We don't need temporary apt servers or anything tho - we can ship the changes packaged as normal and just have EOL checks.

@legoktm
Copy link
Member Author

legoktm commented Nov 8, 2024

is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong

Yes, and (if things go poorly) we shouldn't take down every single SecureDrop all at once.

@legoktm legoktm moved this to In Progress in SecureDrop dev cycle Nov 12, 2024
@legoktm
Copy link
Member Author

legoktm commented Nov 13, 2024

To merge Nathan's proposal with mine:

  1. We ship debs and admin workstation code that installs the upgrade scripts but doesn't do anything automatically.
  2. Admins can do ./securedrop-admin noble-upgrade until some set deadline. We recommend this, but don't require it.
  3. After deadline passes, we push debs to enable the upgrade process to run automatically on mon servers. Depending on how many instances are left, we split this up into batches.
  4. Once we've finished mon servers, we repeat for app servers.

@cfm
Copy link
Member

cfm commented Nov 13, 2024

It looks like APT has built-in support for phased updates. Could that work for us, so we don't have to implement it ourselves? (This might be worth looking into generally.)

@legoktm
Copy link
Member Author

legoktm commented Nov 13, 2024

Thanks for flagging that, unfortunately focal's apt doesn't support phasing so it's not an option for us here, but it will become an option once we do upgrade to noble so let me file a separate task for that.

@legoktm
Copy link
Member Author

legoktm commented Nov 13, 2024

One point made in today's team meeting is that the admin-instigated upgrade period will give us a good sense of how robust the upgrade process is and inform how important spreading out the upgrades are.

Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing.

@cfm
Copy link
Member

cfm commented Nov 14, 2024 via email

@legoktm
Copy link
Member Author

legoktm commented Nov 14, 2024

In any cloud deployment we would be able to stagger these, but our Ansible playbooks effectively run parallel to the Application and Monitor Servers at each step.

To clarify, even for the manual administrator-initiated upgrade, I would still want to do them in series (mon first, then app).

In the automatic scenario, how will we (and an administrator) know that a Monitor Server has been upgraded successfully? No "/metadata" endpoint to monitor there.

We/FPF will have no visibility into mon upgrades (maybe we can peek at apt-prod web request logs I guess).

For admins we'll send some sort of message via OSSEC alerts (i.e. logger.error("mon server has been upgraded")).

@zenmonkeykstop
Copy link
Contributor

Doing it for the manual updates is probably easier, you can just modify Ansible's inventory. Though there might be some refactoring necessary if roles depend on info shared between app and mon.

But it still largely feels over-engineered for the automated case to me:

  • if the update fails on mon admins should shutdown app anyway, so downtime would be unavoidable.
  • if admins are just letting the updates run automatically, and it fails on mon, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, so app will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).

If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway.

As an aside, I am very leery of trying to infer stuff from apt repo stats:

  • we should be committing to removing that kind of metadata where we can and
  • it would likely be unreliable.
  • it would likely not be useful for troubleshooting specific instances unless we get into inferring identities from IPs

@legoktm
Copy link
Member Author

legoktm commented Nov 15, 2024

if admins are just letting the updates run automatically, and it fails on mon, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, so app will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).

That's a really good point and seems like a good rationale to do app before mon. If app fails, we can send OSSEC alerts via mon, and either it's down so we notice, or we can display something in the JI to further flag it for journalists/admins.

If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway.
<snip>
But it still largely feels over-engineered for the automated case to me:

Which part do you think is over-engineered? Or: what would you want to do differently?

I think we have a different perspective/disagreement on how much we should be leaning into automatic vs manually driven? My current perspective is that we should be making the auto upgrade more robust/feasible/safe/etc., even at the risk of overdoing it, but it places the cost on us rather than administrators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/discussion queued up for discussion at future team meeting. Use judiciously. noble Ubuntu Noble related work
Projects
Status: In Progress
Development

No branches or pull requests

4 participants