Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1907333: daemon: Move rollback removal to update loop #2297

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions pkg/daemon/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -1029,10 +1029,6 @@ func (dn *Daemon) checkStateOnFirstRun() error {
return fmt.Errorf("error detecting previous SSH accesses: %v", err)
}

if err := dn.removeRollback(); err != nil {
return errors.Wrapf(err, "Failed to remove rollback")
}

// Bootstrapping state is when we have the node annotations file
if state.bootstrapping {
targetOSImageURL := state.currentConfig.Spec.OSImageURL
Expand Down
11 changes: 8 additions & 3 deletions pkg/daemon/update.go
Original file line number Diff line number Diff line change
Expand Up @@ -422,9 +422,14 @@ func (dn *Daemon) applyOSChanges(oldConfig, newConfig *mcfgv1.MachineConfig) (re
}

// Update OS
if err := dn.updateOS(newConfig, osImageContentDir); err != nil {
MCDPivotErr.WithLabelValues(dn.node.Name, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
return err
if mcDiff.osUpdate {
if err := dn.updateOS(newConfig, osImageContentDir); err != nil {
MCDPivotErr.WithLabelValues(dn.node.Name, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
return err
}
if err := dn.removeRollback(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be problematic where updated OS doesn't boot fine. Admin will no longer be able to manually rollback which I think can be the case in disconnected install.

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed? Keeping rollback deployment is anyway not a concern other than it provides additional free space.

Copy link
Member Author

@cgwalters cgwalters Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a concern. That said, OpenShift doesn't really support rollbacks in general, and even specifically for the MCO doing it will trigger issues related to #1190 - basically the MCO will go degraded.

We don't really have a strong story for e.g. "The new kernel in 4.6.X broke booting on this type of hardware" other than "pause the update and wait for a fixed kernel to arrive and then update".

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed?

The problem with that is that if we e.g. just log errors we have no real visibility into failures (though I guess we could add an alert) and:

Keeping rollback deployment is anyway not a concern other than it provides additional free space.

The original motivation for doing this is (very briefly) mentioned in the PR:
#2220

It's basically: compliance tooling that e.g. adjusts kernel parameters for security wants to avoid a system (accidentally) booting into a non-compliant configuration. If the rollback deployment is there it's an easy down-arrow away for an admin at the grub prompt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, this will be removing the rollback (current) deployment when there is an osupdate, during the update cycle of the MCD. If something fails below (e.g. updating kargs), we also call a removePendingDeployment. Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

We also run an explicit defer of applyOSChanges to roll back to the previous deployment, maybe we should modify that code just in case something was hitting that before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

There's actually three deployments potentially; "pending", "booted", "rollback". rpm-ostree will never let you remove the booted deployment. There's no API to do it in rpm-ostree, and multiple layers of safeguards against it (e.g. this code), in addition to the entire codebase being designed to avoid mutating the running system.

Does that address the concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, sorry I got the terms confused.

In this case wouldn't the rollback deployment being removed here get pushed out anyways once we update and reboot? Assuming nothing went wrong the booted becomes rollback, the pending becomes booted, and the rollback being removed here would be gone anyways?

In a firstboot scenario, this code also runs but fresh nodes would only have a booted deployment, and updates to a pending deployment via the firstboot systemd unit. There wouldn't be a rollback to remove there, so if there are security concerns the firstboot would linger as a rollback and doesn't get removed until a further update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes you're totally right 😄 This won't have the intended effect indeed.

Blah...the architectural problem we have here is there's no "sync loop" in the MCD outside of trying to perform an update; the closest we have is a few ad-hoc goroutines like the ssh monitor.

return err
}
}

defer func() {
Expand Down