The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

cmacknz · 2023-02-21T20:56:19Z

The Elastic Agent upgrade watcher currently considers the state of the agent itself and each started component process when deciding if the upgrade was successful:

elastic-agent/internal/pkg/agent/application/upgrade/error_checker.go

Lines 81 to 90 in c097697

    
           if state.State == client.Failed { 
        
           	ch.log.Error("error checker notifying failure of agent") 
        
           	ch.notifyChan <- ErrAgentStatusFailed 
        
           } 
        
           for _, comp := range state.Components { 
        
           	if comp.State == client.Failed { 
        
           		err = multierror.Append(err, errors.New(fmt.Sprintf("component %s[%v] failed: %s", comp.Name, comp.ID, comp.Message))) 
        
           	} 
        
           }

The agent upgrade watcher should stop considering the state of component processes when deciding whether it should roll back the upgrade. There are multiple reasons for this:

The agent should not trust component processes to be well behaved at all times. Components have and will continue to have unexpected runtime errors and panics that may be transient and do not indicate that the upgrade itself failed.
Any problem causing a component to fail at startup is a bug that we would want to address immediately. By automatically rolling back the upgrade we are creating the need for two investigations instead of one. We need an investigation into why the upgrade failed, followed by an investigation into why the component failed. It is much simpler to present the component failure immediately after upgrade.

It should still be possible to easily rollback an upgrade, it should just not be done automatically based a brief sampling of the component state.

We need elastic/kibana#172745 to be completed first.

cmacknz · 2023-02-27T15:23:40Z

I think we need to properly support downgrades in the Fleet UI before we can implement this.

cmacknz · 2023-02-28T13:33:21Z

We need elastic/kibana#172745 to be completed first.

blakerouse · 2023-04-18T14:36:11Z

The agent should not trust component processes to be well behaved at all times. Components have and will continue to have unexpected runtime errors and panics that may be transient and do not indicate that the upgrade itself failed.

Do we really want that? That means that if you upgrade the Elastic Agent and say Endpoint Security is broken it will remain broken and not rolled back automatically. I understand that with this change its less likely that Elastic Agent will be blamed for a bad upgrade, but I don't know if we necessary want to make the rollback process a manual process.

cmacknz · 2023-04-19T01:27:53Z

Do we really want that?

I'm not longer convinced we want this, and a better solution is to just make it much more obvious that an upgrade has rolled back along with the reason why. I'm going to close this.

cmacknz added Team:Elastic-Agent Label for the Agent team 8.8-candidate labels Feb 21, 2023

pierrehilbert assigned pchila Feb 28, 2023

cmacknz unassigned pchila Apr 19, 2023

cmacknz closed this as completed Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

cmacknz commented Feb 21, 2023 •

edited by jlind23

Loading

cmacknz commented Feb 27, 2023

cmacknz commented Feb 28, 2023

blakerouse commented Apr 18, 2023

cmacknz commented Apr 19, 2023

The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

The Elastic Agent upgrade process should not consider component state when deciding to roll back #2300

Comments

cmacknz commented Feb 21, 2023 • edited by jlind23 Loading

cmacknz commented Feb 27, 2023

cmacknz commented Feb 28, 2023

blakerouse commented Apr 18, 2023

cmacknz commented Apr 19, 2023

cmacknz commented Feb 21, 2023 •

edited by jlind23

Loading