-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Interrupting Agent Updates #178735
Comments
Pinging @elastic/fleet (Team:Fleet) |
@Harmlos how many agents do you typically upgrade at the same time? |
@nimarezainia Considering the specifics of the operation, everything seems fine for the user and the company - email works. Launching an update for one agent affects the channel. Moreover, even one agent cannot download the update and fails to complete the download. And not downloading the update once - this very agent retries to download the update, leading to the full utilization of the already weak channel. There is no possibility to track such a state automatically, only by reviewing the events of the distribution server. `192.168.67.149 - - [03/Apr/2024:13:53:29 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209207041 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:13:57:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208322305 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:02:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 207945473 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:06:02 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 211287809 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:13:52 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208027393 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:17:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208109313 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:20:32 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 210009857 "-" "Beat elastic-agent v8.9.2" 192.168.67.149 - - [03/Apr/2024:14:24:16 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209583873 "-" "Beat elastic-agent v8.9.2"` |
@Harmlos We can provide a way to cancel actions (upgrade as an example) that haven't been acted on yet by the agent. This is mainly for cases where the admin realizes that there may be something wrong with the image and stop it's spread. However for your use case I fear that this become a case of trial and error. You are planning on issuing upgrade to block of agents and monitoring your networks and then potentially cancelling. Then perhaps repeating the process. Seems disruptive. Instead pick a set number of agents, across a large enough window and know that Fleet will take care of upgrading them across that window of time. Much more deterministic. In addition you have the ability to set a future time for this to happen (during a maintanance window perhaps) |
We are trying various methods to initiate agent updates. One of the options is to use a standalone script to retrieve a list of agents and sequentially start updates for them. The problem is that if issues are detected on an agent, there is no way to stop the update. The only option is to restart the agent on the host to stop it from being stuck in the update status. It would be very convenient to have the ability to cancel agent updates with a button or via an API, for example. In this case, it would be possible to describe the update logic in an external script, and only use the API functions to start or cancel updates in case of issues on the host. |
Currently upgrades can only be cancelled before the agent receives the action (scheduled in the future). We could send a cancel action to the agent from Fleet, and then the agent would stop the upgrade process and roll back to the initial version before the upgrade and reset the updating state. For example, an existing cancel action looks like this, the
|
The cancel action would work, but only makes sense for some of the earlier upgrade states. type: string
enum:
- UPG_REQUESTED
- UPG_SCHEDULED
- UPG_DOWNLOADING
- UPG_EXTRACTING
- UPG_REPLACING
- UPG_RESTARTING
- UPG_WATCHING The cancel action can abort us from UPG_REQUESTED, UPG_SCHEDULED, and UPG_DOWNLOADING. Anything after that can't be interrupted safely (maybe extracting but it happens so fast it'd be hard to do). I don't think cancelling an upgrade should rollback the version, we will have a separate mechanism to roll back with #172745. A cancel should mean "abort an upgrade which has not happened yet". Once the upgrade has happened, which I'll define as the point where the new version is on disk and agent has started the work to switch to it, there is no cancelling. The mechanism to undo that is a rollback from that point onward. We do need to be careful about upgrade actions that were forwarded to endpoint-security as they cause it to unprotect itself when tamper protected. We have to investigate what endpoint does internally when we forward an action, and we possibly also need to start forwarding them the cancellation to ensure endpoint knows it won't be upgraded. The forwarding of the upgrade action for tamper protection happens quite early in the upgrade process. Another quirk of a cancel action is that it is always received after an upgrade action, so there is always a chance the agent does the upgrade anyway. Perhaps it only got the cancel on the next checkin because of timing, or the upgrade completely too quickly. It would be best if we could avoid dispatching known cancelled actions to agents at all, or guarantee that an upgrade and its cancellation always get delivered in the same checkin so agent can see that this was the case and not start the upgrade. There is enough nuance here that I think we should write a small RFC before starting implementation, it will help gather opinions |
@kpollich Can we ask someone from the agent team to write up the RFC? I can help with the Fleet related bits but it seems like most of the nuance is on agent side. |
@pchila adding this to your plate for one of the upcoming sprint. |
Describe the feature:
It is necessary to add the ability to cancel the update of one or all agents.
Describe a specific use case for the feature:
Sometimes, an issue arises where initiating updates for multiple agents leads to significant network bandwidth consumption. Agents attempt to update at different intervals, and they display an "updating" status until the update is installed.
Being able to cancel updates from the Fleet console will allow for better management of network load caused by agents and the occasional cancellation of erroneous actions.
The text was updated successfully, but these errors were encountered: