[Fleet] Interrupting Agent Updates #178735

Harmlos · 2024-03-14T14:46:35Z

Describe the feature:

It is necessary to add the ability to cancel the update of one or all agents.

Describe a specific use case for the feature:

Sometimes, an issue arises where initiating updates for multiple agents leads to significant network bandwidth consumption. Agents attempt to update at different intervals, and they display an "updating" status until the update is installed.

Being able to cancel updates from the Fleet console will allow for better management of network load caused by agents and the occasional cancellation of erroneous actions.

elasticmachine · 2024-03-14T18:09:51Z

Pinging @elastic/fleet (Team:Fleet)

nimarezainia · 2024-04-03T22:17:16Z

@Harmlos how many agents do you typically upgrade at the same time?
You do have the ability to define a window of time for the upgrade to be scheduled in so that the upgrades are spread out in order to alleviate this network saturation event.

Harmlos · 2024-04-04T13:57:01Z

@nimarezainia
The problem is that I don't know which network each computer belongs to. It could be either from the main office network or from a remote branch network, where 10 computers are connected to the office via a very weak channel. These are the peculiarities of the network architecture.

Considering the specifics of the operation, everything seems fine for the user and the company - email works.

Launching an update for one agent affects the channel. Moreover, even one agent cannot download the update and fails to complete the download.

And not downloading the update once - this very agent retries to download the update, leading to the full utilization of the already weak channel.

There is no possibility to track such a state automatically, only by reviewing the events of the distribution server.

`192.168.67.149 - - [03/Apr/2024:13:53:29 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209207041 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:13:57:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208322305 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:02:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 207945473 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:06:02 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 211287809 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:13:52 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208027393 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:17:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208109313 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:20:32 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 210009857 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:24:16 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209583873 "-" "Beat elastic-agent v8.9.2"`

nimarezainia · 2024-04-04T23:59:25Z

@Harmlos We can provide a way to cancel actions (upgrade as an example) that haven't been acted on yet by the agent. This is mainly for cases where the admin realizes that there may be something wrong with the image and stop it's spread. However for your use case I fear that this become a case of trial and error. You are planning on issuing upgrade to block of agents and monitoring your networks and then potentially cancelling. Then perhaps repeating the process. Seems disruptive.

Instead pick a set number of agents, across a large enough window and know that Fleet will take care of upgrading them across that window of time. Much more deterministic. In addition you have the ability to set a future time for this to happen (during a maintanance window perhaps)

Harmlos · 2024-04-11T13:59:21Z

We are trying various methods to initiate agent updates. One of the options is to use a standalone script to retrieve a list of agents and sequentially start updates for them.

The problem is that if issues are detected on an agent, there is no way to stop the update. The only option is to restart the agent on the host to stop it from being stuck in the update status.

It would be very convenient to have the ability to cancel agent updates with a button or via an API, for example. In this case, it would be possible to describe the update logic in an external script, and only use the API functions to start or cancel updates in case of issues on the host.

juliaElastic · 2025-01-03T13:56:24Z

Currently upgrades can only be cancelled before the agent receives the action (scheduled in the future).

We could send a cancel action to the agent from Fleet, and then the agent would stop the upgrade process and roll back to the initial version before the upgrade and reset the updating state.
I need help from the agent team to define how the cancellation would work internally in agent. cc @cmacknz @jlind23

For example, an existing cancel action looks like this, the data.target_id field contains the action ID of the upgrade action.

{
        "_index": ".fleet-actions-7",
        "_id": "63bbb752-c3ed-420f-977d-234dbf4c37af",
        "_score": 1,
        "_source": {
          "@timestamp": "2025-01-03T14:06:19.995Z",
          "expiration": "2025-02-08T14:06:08.633Z",
          "agents": [
            "9359e9e8-4219-4026-8268-6d12b2d8d416"
          ],
          "namespaces": [
            "default"
          ],
          "action_id": "7fcd8ade-31ad-4672-b0e6-f227152d0bab",
          "data": {
            "target_id": "e90de33b-bddf-42ed-a802-c1aca1ee0d91"
          },
          "type": "CANCEL",
          "traceparent": "00-53b8d9a2830a7109a5d24bbf941269fe-17b74f06a00c2d67-01"
        }
      }

cmacknz · 2025-01-03T14:49:39Z

The cancel action would work, but only makes sense for some of the earlier upgrade states.

          type: string
          enum:
            - UPG_REQUESTED
            - UPG_SCHEDULED
            - UPG_DOWNLOADING
            - UPG_EXTRACTING
            - UPG_REPLACING
            - UPG_RESTARTING
            - UPG_WATCHING

The cancel action can abort us from UPG_REQUESTED, UPG_SCHEDULED, and UPG_DOWNLOADING. Anything after that can't be interrupted safely (maybe extracting but it happens so fast it'd be hard to do). I don't think cancelling an upgrade should rollback the version, we will have a separate mechanism to roll back with #172745.

A cancel should mean "abort an upgrade which has not happened yet". Once the upgrade has happened, which I'll define as the point where the new version is on disk and agent has started the work to switch to it, there is no cancelling. The mechanism to undo that is a rollback from that point onward.

We do need to be careful about upgrade actions that were forwarded to endpoint-security as they cause it to unprotect itself when tamper protected. We have to investigate what endpoint does internally when we forward an action, and we possibly also need to start forwarding them the cancellation to ensure endpoint knows it won't be upgraded. The forwarding of the upgrade action for tamper protection happens quite early in the upgrade process.

Another quirk of a cancel action is that it is always received after an upgrade action, so there is always a chance the agent does the upgrade anyway. Perhaps it only got the cancel on the next checkin because of timing, or the upgrade completely too quickly.

It would be best if we could avoid dispatching known cancelled actions to agents at all, or guarantee that an upgrade and its cancellation always get delivered in the same checkin so agent can see that this was the case and not start the upgrade.

There is enough nuance here that I think we should write a small RFC before starting implementation, it will help gather opinions

juliaElastic · 2025-01-07T10:51:44Z

@kpollich Can we ask someone from the agent team to write up the RFC? I can help with the Fleet related bits but it seems like most of the nuance is on agent side.

jlind23 · 2025-01-07T14:58:37Z

@pchila adding this to your plate for one of the upcoming sprint.

botelastic bot added the needs-team Issues missing a team label label Mar 14, 2024

jughosta added the Team:Fleet Team label for Observability Data Collection Fleet team label Mar 14, 2024

botelastic bot removed the needs-team Issues missing a team label label Mar 14, 2024

jlind23 assigned nimarezainia Apr 3, 2024

nimarezainia removed their assignment Jun 24, 2024

nimarezainia mentioned this issue Oct 1, 2024

[Fleet] Allow users to cancel inactive unenrollment actions #189508

Open

2 tasks

kpollich assigned nchaulet and juliaElastic and unassigned nchaulet Nov 21, 2024

jlind23 assigned pchila Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Interrupting Agent Updates #178735

[Fleet] Interrupting Agent Updates #178735

Harmlos commented Mar 14, 2024

elasticmachine commented Mar 14, 2024

nimarezainia commented Apr 3, 2024

Harmlos commented Apr 4, 2024 •

edited

Loading

nimarezainia commented Apr 4, 2024

Harmlos commented Apr 11, 2024

juliaElastic commented Jan 3, 2025 •

edited

Loading

cmacknz commented Jan 3, 2025

juliaElastic commented Jan 7, 2025

jlind23 commented Jan 7, 2025

[Fleet] Interrupting Agent Updates #178735

[Fleet] Interrupting Agent Updates #178735

Comments

Harmlos commented Mar 14, 2024

elasticmachine commented Mar 14, 2024

nimarezainia commented Apr 3, 2024

Harmlos commented Apr 4, 2024 • edited Loading

nimarezainia commented Apr 4, 2024

Harmlos commented Apr 11, 2024

juliaElastic commented Jan 3, 2025 • edited Loading

cmacknz commented Jan 3, 2025

juliaElastic commented Jan 7, 2025

jlind23 commented Jan 7, 2025

Harmlos commented Apr 4, 2024 •

edited

Loading

juliaElastic commented Jan 3, 2025 •

edited

Loading