Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create focal -> noble upgrade script #7332

Open
legoktm opened this issue Nov 8, 2024 · 9 comments
Open

Create focal -> noble upgrade script #7332

legoktm opened this issue Nov 8, 2024 · 9 comments
Assignees
Labels
noble Ubuntu Noble related work

Comments

@legoktm
Copy link
Member

legoktm commented Nov 8, 2024

Description

Part of #7211.

The workflow of the script will be:

  • apply any pending updates and reboot
  • run migration check to make sure we're all set to go
  • stop apache2
  • take a backup
  • write /etc/securedrop-upgraded-from-focal marker file
  • check disk space, again??
  • raise OSSEC email_alert_level config to 15
  • disable unattended upgrades
  • edit sources.list to point to noble
  • apt-get update
  • apt-get upgrade --without-new-pkgs
  • apt-get full-upgrade
  • re-enable unattended-upgrades
  • re-enable OSSEC(?)
  • reboot
  • some basic verification
  • start apache2
  • delete backup

Everything the script does should be aggressively logged and each major step should trigger an OSSEC notification.

I think we can trigger the script from a systemd timer/service. Each step should be recorded in some kind of state file so it can recover and resume if interrupted.

We should also leave behind a marker file like /etc/securedrop-upgraded-from-focal so if we detect some bug in the future, we can conditionally apply logic based on whether it's a fresh noble install or an upgrade.

@legoktm legoktm added the noble Ubuntu Noble related work label Nov 8, 2024
@legoktm legoktm self-assigned this Nov 8, 2024
@legoktm legoktm moved this to In Progress in SecureDrop dev cycle Nov 8, 2024
@legoktm
Copy link
Member Author

legoktm commented Nov 12, 2024

One question I've been debating in my head is whether to write the script in Rust or Python. The pros for Rust are pretty clear, this is a place we want to have solid error handling and recovery. The cons are that I think it'll be a little harder to get review for and be a bit more friction for others to contribute.

My plan is to start sketching out the script in Python and then see if it can do all the error handling stuff in a not-crazy way. I expect any Python<-->Rust porting to be trivial.

@legoktm
Copy link
Member Author

legoktm commented Nov 12, 2024

raise OSSEC email_alert_level config to 15

This is doable on the mon server but going to be an issue on the app server.

@zenmonkeykstop
Copy link
Contributor

Doesn't OSSEC have a message threshold anyway? Given this is a one-time event I think you could probably do without the alert level change.

@legoktm
Copy link
Member Author

legoktm commented Nov 12, 2024

Yes, after it hits some limit it stops sending messages and just queues them up...which means there's a backlog of like 1000 emails and any new alerts (e.g. using the send OSSEC test button) don't get sent. When I upgraded the mon server I got ~27 emails before it stopped sending new ones. The queue seems to be stored in memory though, because I restarted the ossec-server service and then it dropped everything and started sending my newly triggered OSSEC test alerts.

I'll put it into the nice to have bucket for now and when we do test upgrades we can see how bad the impact is and decide if we want something else.

@legoktm
Copy link
Member Author

legoktm commented Nov 13, 2024

One question I've been debating in my head is whether to write the script in Rust or Python. The pros for Rust are pretty clear, this is a place we want to have solid error handling and recovery. The cons are that I think it'll be a little harder to get review for and be a bit more friction for others to contribute.

Now that I've started writing the code, there's actually a bigger issue: we are uninstalling Python 3.8 and installing Python 3.12 during the migration. If something fails midway through, we could easily have a broken Python installation. A statically compiled Rust binary will avoid all of that.

@legoktm
Copy link
Member Author

legoktm commented Nov 13, 2024

For reference: https://github.com/freedomofpress/securedrop/blob/6f5ef9e69fb1ac87ce4414e92bd1481347f11795/securedrop/debian/config/usr/bin/securedrop-noble-migration.py is what I had sketched out in Python. I'm going to redo that all in Rust now.

legoktm added a commit that referenced this issue Nov 15, 2024
The script is split into various stages where progress is tracked
on-disk. The script is able to resume where it was at any point, and
needs to, given multiple reboots in the middle.

Fixes #7332.
@legoktm
Copy link
Member Author

legoktm commented Nov 15, 2024

And here's the Rust port: https://github.com/freedomofpress/securedrop/blob/94b84b7894d1bb6f21e93cd61fda3f793363dba8/noble-migration/src/bin/upgrade.rs

It's still very basic with lots of FIXMEs inline, but you can see the rough state machine and how it'll handle reboots, etc. Next is:

  • finishing said FIXMEs
  • implementing logging in an OSSEC-friendly way (i.e. error!("foo") should trigger an OSSEC notification)
  • error handling and identifying most likely failure points
  • trying it! and integrating it into the staging CI job

@legoktm
Copy link
Member Author

legoktm commented Nov 19, 2024

We should also shut down apache during the upgrade.

legoktm added a commit that referenced this issue Nov 19, 2024
The script is split into various stages where progress is tracked
on-disk. The script is able to resume where it was at any point, and
needs to, given multiple reboots in the middle.

Fixes #7332.
legoktm added a commit that referenced this issue Nov 20, 2024
The script is split into various stages where progress is tracked
on-disk. The script is able to resume where it was at any point, and
needs to, given multiple reboots in the middle.

The new noble-upgrade.json file shipped in the securedrop-config package
is used to control the upgrade process.

Fixes #7332.
legoktm added a commit that referenced this issue Nov 20, 2024
The script is split into various stages where progress is tracked
on-disk. The script is able to resume where it was at any point, and
needs to, given multiple reboots in the middle.

The new noble-upgrade.json file shipped in the securedrop-config package
is used to control the upgrade process.

Fixes #7332.
@legoktm
Copy link
Member Author

legoktm commented Nov 20, 2024

In the upgrade-script branch I've created a noble-migration.json file that contains the automated upgrade conditions:

{
    "app": {
        "enabled": false,
        "bucket": 0
    },
    "mon": {
        "enabled": false,
        "bucket": 0
    }
}

You can see at

fn should_upgrade(state: &State) -> Result<bool> {
let config: UpgradeConfig = serde_json::from_str(
&fs::read_to_string(CONFIG_PATH)
.context("failed to read CONFIG_PATH")?,
)
.context("failed to deserialize CONFIG_PATH")?;
let for_host = if is_mon_server() {
&config.mon
} else {
&config.app
};
if !for_host.enabled {
info!("Auto-upgrades are disabled");
return Ok(false);
}
if for_host.bucket > state.bucket {
info!(
"Auto-upgrades are enabled, but our bucket hasn't been enabled yet"
);
return Ok(false);
}
Ok(true)
}
how it's interpreted.

One thing I'm not sure of is how does the script get manually started by admins? My current thinking is that we have them edit (via ansible playbook) this JSON file. Then it'll get overridden by the new noble securedrop-config package. So maybe if we've already started the upgrade, we ignore this file and continue upgrading.

I have not tested the script yet but I think we're at the point where we're ready to do so. My plan is to add this on as an extra step to the existing focal staging job. This will mean writing an ansible playbook for molecule to execute, and it can be the same thing (hopefully) that ./securedrop-admin uses.

One "gotcha" I hit was that once we start the script, at some point it'll reboot on its own accord, and ansible will lose the connection and fail. I think we can do something like https://www.jeffgeerling.com/blog/2018/reboot-and-wait-reboot-complete-ansible-playbook to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noble Ubuntu Noble related work
Projects
Status: In Progress
Development

No branches or pull requests

2 participants