Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Interrupting Agent Updates #178735

Open
Harmlos opened this issue Mar 14, 2024 · 9 comments
Open

[Fleet] Interrupting Agent Updates #178735

Harmlos opened this issue Mar 14, 2024 · 9 comments
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@Harmlos
Copy link

Harmlos commented Mar 14, 2024

Describe the feature:

It is necessary to add the ability to cancel the update of one or all agents.

Describe a specific use case for the feature:

Sometimes, an issue arises where initiating updates for multiple agents leads to significant network bandwidth consumption. Agents attempt to update at different intervals, and they display an "updating" status until the update is installed.

Being able to cancel updates from the Fleet console will allow for better management of network load caused by agents and the occasional cancellation of erroneous actions.

@botelastic botelastic bot added the needs-team Issues missing a team label label Mar 14, 2024
@jughosta jughosta added the Team:Fleet Team label for Observability Data Collection Fleet team label Mar 14, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Mar 14, 2024
@nimarezainia
Copy link
Contributor

@Harmlos how many agents do you typically upgrade at the same time?
You do have the ability to define a window of time for the upgrade to be scheduled in so that the upgrades are spread out in order to alleviate this network saturation event.

@Harmlos
Copy link
Author

Harmlos commented Apr 4, 2024

@nimarezainia
The problem is that I don't know which network each computer belongs to. It could be either from the main office network or from a remote branch network, where 10 computers are connected to the office via a very weak channel. These are the peculiarities of the network architecture.

Considering the specifics of the operation, everything seems fine for the user and the company - email works.

Launching an update for one agent affects the channel. Moreover, even one agent cannot download the update and fails to complete the download.

And not downloading the update once - this very agent retries to download the update, leading to the full utilization of the already weak channel.

There is no possibility to track such a state automatically, only by reviewing the events of the distribution server.

`192.168.67.149 - - [03/Apr/2024:13:53:29 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209207041 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:13:57:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208322305 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:02:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 207945473 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:06:02 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 211287809 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:13:52 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208027393 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:17:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208109313 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:20:32 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 210009857 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:24:16 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209583873 "-" "Beat elastic-agent v8.9.2"`

@nimarezainia
Copy link
Contributor

@Harmlos We can provide a way to cancel actions (upgrade as an example) that haven't been acted on yet by the agent. This is mainly for cases where the admin realizes that there may be something wrong with the image and stop it's spread. However for your use case I fear that this become a case of trial and error. You are planning on issuing upgrade to block of agents and monitoring your networks and then potentially cancelling. Then perhaps repeating the process. Seems disruptive.

Instead pick a set number of agents, across a large enough window and know that Fleet will take care of upgrading them across that window of time. Much more deterministic. In addition you have the ability to set a future time for this to happen (during a maintanance window perhaps)

@Harmlos
Copy link
Author

Harmlos commented Apr 11, 2024

We are trying various methods to initiate agent updates. One of the options is to use a standalone script to retrieve a list of agents and sequentially start updates for them.

The problem is that if issues are detected on an agent, there is no way to stop the update. The only option is to restart the agent on the host to stop it from being stuck in the update status.

It would be very convenient to have the ability to cancel agent updates with a button or via an API, for example. In this case, it would be possible to describe the update logic in an external script, and only use the API functions to start or cancel updates in case of issues on the host.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jan 3, 2025

Currently upgrades can only be cancelled before the agent receives the action (scheduled in the future).

We could send a cancel action to the agent from Fleet, and then the agent would stop the upgrade process and roll back to the initial version before the upgrade and reset the updating state.
I need help from the agent team to define how the cancellation would work internally in agent. cc @cmacknz @jlind23

For example, an existing cancel action looks like this, the data.target_id field contains the action ID of the upgrade action.

{
        "_index": ".fleet-actions-7",
        "_id": "63bbb752-c3ed-420f-977d-234dbf4c37af",
        "_score": 1,
        "_source": {
          "@timestamp": "2025-01-03T14:06:19.995Z",
          "expiration": "2025-02-08T14:06:08.633Z",
          "agents": [
            "9359e9e8-4219-4026-8268-6d12b2d8d416"
          ],
          "namespaces": [
            "default"
          ],
          "action_id": "7fcd8ade-31ad-4672-b0e6-f227152d0bab",
          "data": {
            "target_id": "e90de33b-bddf-42ed-a802-c1aca1ee0d91"
          },
          "type": "CANCEL",
          "traceparent": "00-53b8d9a2830a7109a5d24bbf941269fe-17b74f06a00c2d67-01"
        }
      }

@cmacknz
Copy link
Member

cmacknz commented Jan 3, 2025

The cancel action would work, but only makes sense for some of the earlier upgrade states.

          type: string
          enum:
            - UPG_REQUESTED
            - UPG_SCHEDULED
            - UPG_DOWNLOADING
            - UPG_EXTRACTING
            - UPG_REPLACING
            - UPG_RESTARTING
            - UPG_WATCHING

The cancel action can abort us from UPG_REQUESTED, UPG_SCHEDULED, and UPG_DOWNLOADING. Anything after that can't be interrupted safely (maybe extracting but it happens so fast it'd be hard to do). I don't think cancelling an upgrade should rollback the version, we will have a separate mechanism to roll back with #172745.

A cancel should mean "abort an upgrade which has not happened yet". Once the upgrade has happened, which I'll define as the point where the new version is on disk and agent has started the work to switch to it, there is no cancelling. The mechanism to undo that is a rollback from that point onward.

We do need to be careful about upgrade actions that were forwarded to endpoint-security as they cause it to unprotect itself when tamper protected. We have to investigate what endpoint does internally when we forward an action, and we possibly also need to start forwarding them the cancellation to ensure endpoint knows it won't be upgraded. The forwarding of the upgrade action for tamper protection happens quite early in the upgrade process.

Another quirk of a cancel action is that it is always received after an upgrade action, so there is always a chance the agent does the upgrade anyway. Perhaps it only got the cancel on the next checkin because of timing, or the upgrade completely too quickly.

It would be best if we could avoid dispatching known cancelled actions to agents at all, or guarantee that an upgrade and its cancellation always get delivered in the same checkin so agent can see that this was the case and not start the upgrade.

There is enough nuance here that I think we should write a small RFC before starting implementation, it will help gather opinions

@juliaElastic
Copy link
Contributor

@kpollich Can we ask someone from the agent team to write up the RFC? I can help with the Fleet related bits but it seems like most of the nuance is on agent side.

@jlind23
Copy link
Contributor

jlind23 commented Jan 7, 2025

@pchila adding this to your plate for one of the upcoming sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

9 participants