-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Multi Cluster Fleet support - Phase 1: Global Visibility and Control, Local Data Plane #187129
Comments
Pinging @elastic/fleet (Team:Fleet) |
@nimarezainia @kpollich I would like to clarify some of these requirements:
I'm wondering why we need this in Phase 1, shouldn't it be enough to take all clusters defined as Remote ES outputs and consider them all part of the multi-site deployment (as data clusters)? Do we see any use case where some clusters wouldn't be included in the data views? Though now that I am reading the CCS docs, it seems CCS requires remote clusters to be set up, so it is not enough to set up Remote ES outputs. I think we should somehow unify these two, it doesn't sound right to make users set up remote clusters in 2 places.
I need to investigate this, it seems strange to me why CCS doesn't work in dashboards as it was noted here.
I think adding columns to the Agent list UI to show the remote outputs as data/monitoring outputs is doable, as we can take that from the agent policy config that the agent is enrolled to. There is more complexity to visualize outputs per integration, as it means more than one remote ES output can be assigned per agent. We need UX design for this, I'm thinking displaying multiple outputs in the table similar to the Reusable integration polices UX, with a popover that shows a list of integrations with assigned outputs.
This is doable with a dropdown filter on the Agent list UI with a new API that returns a list of remote ES outputs used by the agents' agent policies, and another API that returns agents for a selected remote ES output. Apart from this, Phase 1 doesn't seem to require changes from the agent management perspective, as all agents would be enrolled to one management cluster. Even with the additional on-prem Fleet server model, it sounds like Fleet server would be enrolled to the same management cluster, so it doesn't make a difference. One thing to confirm about Osquery/Endpoint actions, will we have all data in the management cluster for these to work properly? Is there any data written by agents required, that would be sent to the data clusters? |
+1 on trying to unify the remote ES output and cross-cluster UX. I think there will be a lot of duplication between those flows otherwise. |
I originally was thinking of full flexibility for the user when i wrote that requirement. But not sure if that use case exists. I was thinking of something like a cluster for CCR as an example. we may not want agent to have the option to write to that cluster. I'm curious to understand how the two workflows (CCS and Remote ES output) can be unified (or where is the opportunity for that). Remote ES output is really just a pointer to a cluster that's been setup already. I think the remote cluster is setup once in both cases. Sounds like the efficiency here is to have CCS also enabled when our Fleet user is setting up the Remote ES output to an already created remote cluster. |
These are great observations. Agree and clarifying on some:
|
@juliaElastic if all agents are writing to a single management cluster, as you mentioned, all these "dot" indices would be built there. But this would be something we need to check with the team for sure. |
a lot of these steps are unfortunately the laborious steps needs to setup a remote cluster, not sure what we can do on that.
The user can get this from Fleet->settings tab of the remote cluster also, but I think this is a better place.
@juliaElastic Can this step be automated in anyway? For example if there was a user setting that enabled them to opt-in to having the indices built on CCS results. We know all the remote cluster names, I don't know if we could do this via aliasing? |
I'm not sure how users use these 2 features currently together, I mean, do they always set up Remote clusters when they use Remote ES output? My thinking was this:
Now, we start to combine these features to set up sending data from agents to remote clusters, and then view the data collected by agents in a single dashboard using CCS. We could add a step in the Remote ES Output UI to also create the cross-cluster API key (similar to the instructions for the remote service token). |
IMO this would be the preferred option if we do anything in this regard. Also keep in mind that not every user who is setting up remote ES output is doing CCS also with another cluster. I'd be happy to keep this enhancement as a followup and based on user feedback. Installation of integration assets on the remote ES is extremely valuable (last requirement in that list). |
I checked and aliasing is not supported on remote indices: elastic/elasticsearch#43312 I think the CCS in dashboards could be automated like this:
So basically we would have to keep the index patterns in kibana asset definitions in sync with the current remote clusters (which opted in for CCS for integrations). |
I added an implementation plan to the description based on the discussion so far.
Commented on the linked issue too, I think it can be done earlier/separately from the rest of the enhancements here, if we think it's the most valuable. |
Added a comment to confirm here, I think it's likely we don't have to make any changes to these actions to work in phase 1. |
Because this is an opt-in it should be ok. Maybe we can consider this as a future enhancement. We could get infosec to provide feedback and see if this level of complication is worth it. Or do users prefer to build their dashboards using CCS? |
@jkakavas Hey, could we get your feedback on the solution laid out in #187129 (comment)? |
@aarju and @WiegerElastic could we test a proposal with you on the topic of this issue. Mainly the comment here. As discussed with you previously, we are looking at streamlining how our users, including Elastic, deploy Fleet in a multi-cluster scenario. As a user, would you be happy to create your own dashboards using CCS OR do you see a need for Fleet to ensure that all indices in all the remote clusters are available for visualizations. The latter has costs associated with it and we are reluctant to develop these features if it's not that valuable to our users. Basically is it ok to be expecting users to build their own dashboards using CCS. |
I don't know the exact numbers, but I think that unless the users have the dashboard up all the time with the polling period set to refresh every few seconds this cost should be pretty low. CCS only returns the results of the query and not all of the data so the cost of the queries from data transfer is pretty small compared to all of the other costs.
I think that as a customer we would want to have some built in dashboards that use CCS to provide a starting point and an idea of what is possible, but with the ability to customize the dashboards and the dataviews feeding them. For example, several of our data views always provide the remote cluster name because that type of data is stored on one or two specific remote clusters so there is no need to query the other clusters that don't contain that data. In other cases if we use the data view of |
I'm not sure how the customization would work in combination with trying to keep the remote cluster prefixes in sync automatically. |
thanks @aarju. How do you guys do this today? is it extremely arduous? More importantly, from Fleet's perspective, how do we go about choosing which dataview needs to get this treatment. |
At this time all of the monitoring dashboards are custom built. If you want to see some examples there are dashboards built by the team that manages our giant o11y cluster.
I think this would be any data view that we are centrally managing via the CCS fleet. However, for us the only data views that are an issue right now are those where we can take actions with the integration from inside of another App such as Defend and OSQuery. Health monitoring of all of our fleets in a single central location would be nice, but the big challenge we are having right now is that we can't use several of the capabilities of our SIEM because the Defend and OSQuery integrations are managed from a different cluster. If I have an alert in my SIEM I can't use the 'Isolate Host' or OSQuery action from that cluster, I have to log into our Endpoint Fleet cluster and isolate it from there. |
Could you give an example of a dashboard which already uses CCS in the overview cluster? While I see actions across clusters would be higher priority, those are not in scope in the current Phase 1 issue. |
@aarju For Defend and OSQuery - the actions will be taken care of by what Julia is proposing here in pahse 1., where there exists a single management cluster. The actions rely on the "dot" indices and will all be in one place at the Management Cluster. For Agent Health Monitoring: we already have the option of sending monitoring data to common cluster which can be nominated by the user. What we want to figure out here is what we should do (if any) to integration specific datastreams. Say you have nginx installed on agents across multiple sites/clusters and the user wants to see these in one consolidated dashboard. Should Fleet be creating this dataview based on CCS (there would be many datastreams, some the user probably wouldn't care for). Or do we leave this part of it to the user and have them utilize CCS to build the dashboards they need, which probably scales better. Fleet will ensure that the integration is installed on all the remote clusters (including all its assets). Does this make sense? |
We'll definitely want to dedicate a lot of time to building out end to end tests with cross-cluster search for this work, which will be a substantial investment just to spin up appropriate environments in CI against which to run tests. |
A globally distributed enterprise that operates in many regions will by definition have many data sources spread across these regions. Naturally they will be collecting and storing the data in clusters local to those regions. However when it comes to analysis of that data for Security and Observability, they would rely heavily on cross cluster technologies so that the collected data is seen and operated on singularly (as though they are in a local cluster).
Fleet users, with Elastic Agents deployed in many such regions currently don't have the ability to easily manage their deployment at a global level yet reap the benefits of having their data stored and handled locally. This issue is to track all the requirements for enabling Fleet in a multi-cluster deployment. The goal is to facilitate the deployment of Fleet in a manner shown below:
In this deployment model:
.fleet*
system indices are built. This will provide Global Control via Fleet in the Management Cluster.In this model how do we perform:
(1) Agent Upgrade
(2) Adding Integrations to the Agent Policy
(3) Build user dashboards
(4) OSquery
Requirements
Implementation plan
logs-*,remote1:logs-*
cc: @kpollich @cmacknz
The text was updated successfully, but these errors were encountered: