Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signpost: Avoid updating into nothing #444

Merged
merged 1 commit into from
Jan 14, 2020
Merged

signpost: Avoid updating into nothing #444

merged 1 commit into from
Jan 14, 2020

Conversation

sam-aws
Copy link
Contributor

@sam-aws sam-aws commented Oct 21, 2019

Issue #, if available:
N/A

Description of changes:
Add two new commands to signpost:

  • mark-inactive-valid, used to mark the inactive partition as having a
    potentially valid image but not for boot.
  • cancel-upgrade, which reverts the changes of upgrade-to-inactive.

In addition upgrade-to-inactive makes two initial checks; whether the
inactive partition has been marked valid with mark-inactive-valid, and
whether the inactive partition has already been marked for boot. The
first is to avoid accidentally booting into a blank partition, while the
second is a warning that makes it more clear that invoking
upgrade-to-inactive twice does not revert the change.

Signed-off-by: Samuel Mendoza-Jonas [email protected]

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sam-aws sam-aws requested a review from iliana October 21, 2019 20:41
Comment on lines +203 to +210
/// Sets 'tries left' to 1 on the inactive partition to represent a
/// potentially valid image, but does not change the priority.
/// **does not write to the disk**.
pub fn mark_inactive_valid(&mut self) {
let mut inactive_flags = self.gptprio(self.inactive());
inactive_flags.set_tries_left(1);
self.set_gptprio(self.inactive(), inactive_flags);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do nothing if successful=true, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually looked at the definition of set_tries_left() and yes! We could change this to 2 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read this in the morning and maybe not; these are exclusive aren't they?

Comment on lines +225 to +232
ensure!(
inactive_flags.tries_left() > 0,
error::InactiveNotValid {
inactive: &self.os_disk
}
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a successful check here, too.

workspaces/updater/signpost/src/main.rs Show resolved Hide resolved
@tjkirch
Copy link
Contributor

tjkirch commented Oct 22, 2019

We're adding complexity here so I'd like to make sure I understand.

In addition upgrade-to-inactive makes two initial checks; whether the
inactive partition has been marked valid with mark-inactive-valid

Wouldn't upgrade-to-inactive imply that inactive should be marked as valid? Or check if looks basically OK, and only error if it not? If we're going to add steps, we should make sure that the distinction is both important and clear.

@sam-aws
Copy link
Contributor Author

sam-aws commented Oct 22, 2019

Wouldn't upgrade-to-inactive imply that inactive should be marked as valid? Or check if looks basically OK, and only error if it not? If we're going to add steps, we should make sure that the distinction is both important and clear.

Right, this adds the latter of those, checking if it's OK. What I'm trying to avoid here is a user running updog update-apply which is basically a call to upgrade-to-inactive and then rebooting into a blank or invalid Thar image. upgrade-to-inactive doesn't currently have a way of checking for that, and neither does Updog.

@sam-aws
Copy link
Contributor Author

sam-aws commented Oct 22, 2019

(Updated to remove a related //TODO)

@sam-aws sam-aws marked this pull request as ready for review November 5, 2019 21:02
@tjkirch tjkirch requested a review from iliana November 7, 2019 17:23
Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong, but I also don't feel like I'm smart enough to understand what's going on. I think the interactions between our many commands and the multiple flags on multiple partitions are complex, and therefore likely to have unanticipated interactions, and this is our most critical code path. I think it'd be nice to reflect the state more clearly in fewer places; I'm not sure what to recommend without a better understanding. That said, I'm approving because I don't see any immediate issues and the experts like the flow, but I'd definitely recommend more documentation overall.

workspaces/updater/signpost/src/state.rs Show resolved Hide resolved
workspaces/updater/signpost/src/state.rs Outdated Show resolved Hide resolved
let mut inactive_flags = self.gptprio(self.inactive());
ensure!(
inactive_flags.priority() != 2 && !inactive_flags.successful(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care whether inactive was successful? It seems like the priority check should be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be related to my primary confusion about this PR. We can't actually validate the inactive partition, we can only check that it has some markers from the user, so in essence we're funneling them through a different process we believe to be more indicative of their intent. Is that a fair characterization?

Add two new commands to signpost:
- mark-inactive-valid, used to mark the inactive partition as having a
potentially valid image but not for boot.
- cancel-upgrade, which reverts the changes of upgrade-to-inactive.

In addition upgrade-to-inactive makes two initial checks; whether the
inactive partition has been marked valid with mark-inactive-valid, and
whether the inactive partition has already been marked for boot. The
first is to avoid accidentally booting into a blank partition, while the
second is a warning that makes it more clear that invoking
upgrade-to-inactive twice does not revert the change.

Signed-off-by: Samuel Mendoza-Jonas <[email protected]>
@sam-aws
Copy link
Contributor Author

sam-aws commented Nov 13, 2019

Updated with the above suggestions, and changing the first ensure!() in upgrade_to_inactive() to check that the flags have been cleared, not just that they haven't been set to upgrade.

@sam-aws
Copy link
Contributor Author

sam-aws commented Nov 13, 2019

I also did a quick write-up of how these changes fit into signpost/updog - it got a bit long but hopefully it makes it more clear what we're adding here while avoiding changing the update process. If it reads well enough I can look into folding it into some docs as well.

GPT flags

There are currently three attributes associated with a "side"; priority, tries-left, and a "successful" marker. These are distinct and don't overlap in the partition header.
Priority is the relative boot priority of the side.
Tries-left is the number of times a side may attempt to boot. If 0, side will not be chosen.
Successful is set true once a side has been successfully booted on (see issue #86)

Current commands/methods:

mark-successful

  • successful = true

clear-inactive - zeroes the inactive flags prior to writing / upgrading

  • inactive.priority = 0
  • inactive.tries_left = 0
  • inactive.successful = false

upgrade-to-inactive - sets the inactive side to boot next

  • inactive.priority = 2 (highest)
  • inactive.tries_left = 1
  • inactive.successful = false
  • active.priority = 1 (2nd)

rollback-to-inactive - sets a known-good inactive to boot next

  • if inactive.successful == true && priority > 0 && tries_left > 0:
    - inactive.priority = 2 (highest)
    - active.priority = 1 (2nd)

Commands/methods after changes:

mark-successful - no change
clear-inactive - no change
mark-inactive-valid - new, marks inactive as being potentially bootable

  • inactive.tries_left = 1

upgrade-to-inactive

  • as before, but checks for
    - inactive.priority == 0 && inactive.successful == false, ie. cleared flags except for..
    - inactive.tries > 0, ie. marked valid

cancel-upgrade - new, clear inactive flags and set active priority back to highest

  • inactive.priority = 0
  • inactive.tries_left = 0
  • inactive.successful = false
  • active.priority = 2

rollback-to-inactive - no change

How changes add safety without borking current behaviour

Previously upgrade-to-inactive unconditionally set the inactive side to be the next boot target. The new commit introduces mark-inactive-valid which sets inactive.tries_left to 1. This does not by itself mark the side ready for booting as the active partition will still have the higher priority. However upgrade-to-inactive now has a way to check that the inactive side has a good chance of booting into something, before giving it the higher priority.

Cancel-upgrade is a new command which clears the inactive side flags and restores the active side's priority. Previously there wasn't an obvious way to reverse the effect of upgrade-to-inactive.

How this affects updog / manual signpost

This avoids a scenario where the user runs updog update-apply by itself or after a failed updog update-image; the inactive side will have tries_left = 0 (because Updog calls clear-inactive before any changes are made) and signpost will refuse to upgrade to it. The changes in signpost and updog are completely transparent to the user.
Since mark_valid doesn't touch priority or successful, it won't cause rollback_to_inactive to accidentally rollback to this unknown side.

For example, trying to upgrade without having written an update:

bash-5.0# updog update-apply
Could not mark inactive partition for boot: Inactive partition /dev/nvme0n1 has not been marked valid for upgrade

Trying to mark for upgrade twice:

bash-5.0# updog update-image -i 0.1.6
Starting update to 0.1.6
Update applied: aws-k8s-0.1.6
bash-5.0# updog update-apply
bash-5.0# updog update-apply
Could not mark inactive partition for boot: Inactive partition /dev/nvme0n1 is already marked for upgrade

Cancel-upgrade shouldn't be part of any normal usage; it provides a way to back out if the user was using signpost directly and accidentally marked something undesirable for upgrade.

Cancelling the upgrade:

bash-5.0# signpost cancel-upgrade
bash-5.0# updog update-apply
Could not mark inactive partition for boot: Inactive partition /dev/nvme0n1 has not been marked valid for upgrade

@tjkirch
Copy link
Contributor

tjkirch commented Nov 14, 2019

upgrade-to-inactive now has a way to check that the inactive side has a good chance of booting into something

This is the piece I'm not sure I agree with. The only reason we believe it has a better chance of booting is that the user said so, but the user was already saying so by running upgrade-to-inactive. I don't know that there's any improvement there, because the number of commands is confusing, and I suspect not all users will understand the difference or the subtleties. (I don't think I do!)

I'm OK with this if you think it's clearer - I know I don't represent all users - but I'd like to see us consider alternate models that could reduce the number of commands. (Perhaps by hiding some of the partition flags; I'm not sure we need to represent everything so directly to the user.)

@sam-aws
Copy link
Contributor Author

sam-aws commented Nov 15, 2019

I don't know that there's any improvement there, because the number of commands is confusing ...

I agree, it took me a bit of staring at Signpost before I felt like I had a good handle on what's happening. Adding this extra command doesn't help in that regard, but on the other hand my interpretation of a signpost user is somewhere between "no-one, it's just an API Updog uses" to "a developer or advanced user trying to put out a fire".

In normal usage everything is done via updog, and in that regard nothing has visibly changed; Updog uses the new command internally to try and keep track of whether an updog update-image has occurred before a call to updog update-apply. One alternative would be for Updog to keep track of this itself but I'd prefer to avoid keeping state in Updog itself.

@iliana
Copy link
Contributor

iliana commented Nov 15, 2019

Adding this extra command doesn't help in that regard, but on the other hand my interpretation of a signpost user is somewhere between "no-one, it's just an API Updog uses" to "a developer or advanced user trying to put out a fire".

What if signpost is just a library + CLI utility for modifying the three fields on the boot partitions, and the logic for what the bits should actually be moves to updog?

@sam-aws
Copy link
Contributor Author

sam-aws commented Nov 18, 2019

What if signpost is just a library + CLI utility for modifying the three fields on the boot partitions, and the logic for what the bits should actually be moves to updog?

That could work, and give the user a single interface for these kinds of changes.

w.r.t this PR, the question appears to be whether introducing mark-inactive-valid is worth the amount of "safety" it provides, even though it's not a hard guarantee. Maybe moving this logic together into Updog could make the relationship to the fields more clear?

@tjkirch
Copy link
Contributor

tjkirch commented Dec 6, 2019

Has anyone had thoughts about alternative structures/workflows that could make this verification clearer, or tried changing signpost+updog per #444 (comment) ?

My approval stands, in case I'm alone in finding it confusing. Not sure if @iliana's stands after her comment about signpost+updog or if she'd prefer that change.

@iliana
Copy link
Contributor

iliana commented Dec 6, 2019

My approval stands but I'd appreciate an issue being opened to track the proposal of moving signpost-workflows into updog and turning signpost into a bit twiddling library/tool.

@iliana iliana merged commit 3498ca0 into develop Jan 14, 2020
@sam-aws sam-aws deleted the mark-valid branch January 15, 2020 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants