Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

Closed
cmacknz opened this issue Feb 21, 2023 · 4 comments
Labels
8.8-candidate Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Feb 21, 2023

The Elastic Agent upgrade watcher currently considers the state of the agent itself and each started component process when deciding if the upgrade was successful:

if state.State == client.Failed {
ch.log.Error("error checker notifying failure of agent")
ch.notifyChan <- ErrAgentStatusFailed
}
for _, comp := range state.Components {
if comp.State == client.Failed {
err = multierror.Append(err, errors.New(fmt.Sprintf("component %s[%v] failed: %s", comp.Name, comp.ID, comp.Message)))
}
}

The agent upgrade watcher should stop considering the state of component processes when deciding whether it should roll back the upgrade. There are multiple reasons for this:

  1. The agent should not trust component processes to be well behaved at all times. Components have and will continue to have unexpected runtime errors and panics that may be transient and do not indicate that the upgrade itself failed.
  2. Any problem causing a component to fail at startup is a bug that we would want to address immediately. By automatically rolling back the upgrade we are creating the need for two investigations instead of one. We need an investigation into why the upgrade failed, followed by an investigation into why the component failed. It is much simpler to present the component failure immediately after upgrade.

It should still be possible to easily rollback an upgrade, it should just not be done automatically based a brief sampling of the component state.

We need elastic/kibana#172745 to be completed first.

@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team 8.8-candidate labels Feb 21, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Feb 27, 2023

I think we need to properly support downgrades in the Fleet UI before we can implement this.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 28, 2023

We need elastic/kibana#172745 to be completed first.

@blakerouse
Copy link
Contributor

The agent should not trust component processes to be well behaved at all times. Components have and will continue to have unexpected runtime errors and panics that may be transient and do not indicate that the upgrade itself failed.

Do we really want that? That means that if you upgrade the Elastic Agent and say Endpoint Security is broken it will remain broken and not rolled back automatically. I understand that with this change its less likely that Elastic Agent will be blamed for a bad upgrade, but I don't know if we necessary want to make the rollback process a manual process.

@cmacknz
Copy link
Member Author

cmacknz commented Apr 19, 2023

Do we really want that?

I'm not longer convinced we want this, and a better solution is to just make it much more obvious that an upgrade has rolled back along with the reason why. I'm going to close this.

@cmacknz cmacknz closed this as completed Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.8-candidate Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

3 participants