Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Multi Cluster Fleet support - Phase 1: Global Visibility and Control, Local Data Plane #187129

Open
5 tasks
nimarezainia opened this issue Jun 28, 2024 · 24 comments
Assignees
Labels
Feature:Fleet Fleet team's agent central management project Meta Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@nimarezainia
Copy link
Contributor

nimarezainia commented Jun 28, 2024

A globally distributed enterprise that operates in many regions will by definition have many data sources spread across these regions. Naturally they will be collecting and storing the data in clusters local to those regions. However when it comes to analysis of that data for Security and Observability, they would rely heavily on cross cluster technologies so that the collected data is seen and operated on singularly (as though they are in a local cluster).

Fleet users, with Elastic Agents deployed in many such regions currently don't have the ability to easily manage their deployment at a global level yet reap the benefits of having their data stored and handled locally. This issue is to track all the requirements for enabling Fleet in a multi-cluster deployment. The goal is to facilitate the deployment of Fleet in a manner shown below:

image

In this deployment model:

  1. Elastic Agent check-ins are sent to the Management Cluster, where .fleet* system indices are built. This will provide Global Control via Fleet in the Management Cluster.
  2. By utilizing Cross Cluster Search (CCS) dashboards can be built using datastreams from all the remote clusters, thereby providing Global Visibility.
  3. With a Local Data Plane, Integrations Data ingested by the Elastic Agents is stored at the local cluster, avoiding any extra cross regional egress charges and more importantly abiding by local data sovereignty rules.

In this model how do we perform:

(1) Agent Upgrade

  • The global Fleet UI enables the user to issue the upgrade command.
  • Actions are curated by the local Fleet Server and sent to individual agents.
  • Agents then reach out directly to the configured repo to get their artifacts.

(2) Adding Integrations to the Agent Policy

  • Integrations are added to the policy at the global Fleet level.
  • Policy will then be curated and utilizing the Fleet Server, it’s distributed to all agents
  • Agents will enable the input based on that integration
  • NOTE: local cluster will not install assets and ingest pipelines, this will be an enhancement

(3) Build user dashboards

  • Utilize CCS to query the datastreams of interest and build user dashboards
  • Users for the most part can perform this step directly. Can it be optimized?

(4) OSquery

  • The query is issued via .fleet-actions and the response from the query is read from .fleet-actions-results.
  • If these indices are available in the management cluster then the operator should be able to perform an OSquery

Requirements

Preview Give feedback
  1. Team:Fleet
    juliaElastic

Implementation plan

cc: @kpollich @cmacknz

@botelastic botelastic bot added the needs-team Issues missing a team label label Jun 28, 2024
@nimarezainia nimarezainia added the Team:Fleet Team label for Observability Data Collection Fleet team label Jun 28, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jun 28, 2024
@nimarezainia nimarezainia added the Feature:Fleet Fleet team's agent central management project label Jun 28, 2024
@nimarezainia
Copy link
Contributor Author

Alternate deployment types that can be supported:

Additional clusters locally

image

On-prem Fleet Server

image

@juliaElastic
Copy link
Contributor

juliaElastic commented Aug 8, 2024

@nimarezainia @kpollich I would like to clarify some of these requirements:

  • User should be able to nominate which clusters are members of a multi-site deployment.

I'm wondering why we need this in Phase 1, shouldn't it be enough to take all clusters defined as Remote ES outputs and consider them all part of the multi-site deployment (as data clusters)? Do we see any use case where some clusters wouldn't be included in the data views?

Though now that I am reading the CCS docs, it seems CCS requires remote clusters to be set up, so it is not enough to set up Remote ES outputs. I think we should somehow unify these two, it doesn't sound right to make users set up remote clusters in 2 places.

  • Dataviews to be dynamically modified based on the set of clusters nominated so make the operation of this type of deployment easier.

I need to investigate this, it seems strange to me why CCS doesn't work in dashboards as it was noted here.
Also have to find out how we can achieve modifying the data views dynamically to include remote clusters in the search.

  • Fleet UI to show which clusters agents are writing data to. Perhaps as a separate/new column (or customizable columns where the user would add the information they are interested in). Allowing for filtering and better UX for users to quickly identify agents in remote clusters.

I think adding columns to the Agent list UI to show the remote outputs as data/monitoring outputs is doable, as we can take that from the agent policy config that the agent is enrolled to.

There is more complexity to visualize outputs per integration, as it means more than one remote ES output can be assigned per agent. We need UX design for this, I'm thinking displaying multiple outputs in the table similar to the Reusable integration polices UX, with a popover that shows a list of integrations with assigned outputs.
In addition, the Agent details UI could show the output per integration too, in the tree view where we should the input health.

image image image
  • Fleet UI allows filtering based on the cluster.

This is doable with a dropdown filter on the Agent list UI with a new API that returns a list of remote ES outputs used by the agents' agent policies, and another API that returns agents for a selected remote ES output.

Apart from this, Phase 1 doesn't seem to require changes from the agent management perspective, as all agents would be enrolled to one management cluster. Even with the additional on-prem Fleet server model, it sounds like Fleet server would be enrolled to the same management cluster, so it doesn't make a difference.

One thing to confirm about Osquery/Endpoint actions, will we have all data in the management cluster for these to work properly? Is there any data written by agents required, that would be sent to the data clusters?
I'm referring to the data in the indices here: https://github.com/elastic/ingest-dev/issues/3093#issuecomment-2102223497

@kpollich
Copy link
Member

kpollich commented Aug 8, 2024

I'm wondering why we need this in Phase 1, shouldn't it be enough to take all clusters defined as Remote ES outputs and consider them all part of the multi-site deployment (as data clusters)? Do we see any use case where some clusters wouldn't be included in the data views?

Though now that I am reading the CCS docs, it seems CCS requires remote clusters to be set up, so it is not enough to set up Remote ES outputs. I think we should somehow unify these two, it doesn't sound right to make users set up remote clusters in 2 places.

+1 on trying to unify the remote ES output and cross-cluster UX. I think there will be a lot of duplication between those flows otherwise.

@juliaElastic
Copy link
Contributor

I managed to get the CCS on dashboards working with data sent by agent with remote ES output, with 2 Elastic Cloud instances. I haven't yet been able to set up the Remote cluster connection between 2 local or 1 local and 1 cloud instance where the remote cluster is also used as a remote ES output.

Steps to set up between 2 Elastic Cloud instances:

  • Create 2 instances in staging, go to the Manage deployment / Security page of the cluster that will be used as remote, copy the Proxy address from the bottom of the page
  • Open kibana of the remote instance, create a cross-cluster API key and copy the encoded value
  • Go to the Manage deployment / Security page of the cluster that will be used as management cluster, and add the API key under Remote connections / Trust management
  • Open kibana of the management cluster, go to Fleet settings, add a Remote ES output by using the host of the ES output host of the remote cluster, and a remote service token.
  • In the management cluster, add a Remote cluster called remote_ecs and using the Proxy address copied out earlier from the remote cluster as host:port, the remote cluster should be connected
  • In the management cluster, create an agent policy and set the remote ES output as integration and monitoring output. Enroll an agent to it (used a docker command in local machine)
  • In the management cluster, create a data view with index pattern remote_ecs:logs-*, check in Discover that data shows up from the remote cluster
  • In the management cluster, copy the dashboard [Elastic Agent] Agent Info, then go to Saved Objects and export the copied dashboard
  • Edit the exported dashboard and replace all logs-* occurrences by remote_ecs:logs-*
  • Import the modified dashboard back to the management cluster
  • Navigate to the modified dashboard, and check that data is showing up. To make sure the data is coming from the agent sending data to the remote cluster, filter on agent.id:<agent_id>
image image image image image image image

@nimarezainia
Copy link
Contributor Author

I'm wondering why we need this in Phase 1, shouldn't it be enough to take all clusters defined as Remote ES outputs and consider them all part of the multi-site deployment (as data clusters)? Do we see any use case where some clusters wouldn't be included in the data views?
Though now that I am reading the CCS docs, it seems CCS requires remote clusters to be set up, so it is not enough to set up Remote ES outputs. I think we should somehow unify these two, it doesn't sound right to make users set up remote clusters in 2 places.

+1 on trying to unify the remote ES output and cross-cluster UX. I think there will be a lot of duplication between those flows otherwise.

I originally was thinking of full flexibility for the user when i wrote that requirement. But not sure if that use case exists. I was thinking of something like a cluster for CCR as an example. we may not want agent to have the option to write to that cluster.

I'm curious to understand how the two workflows (CCS and Remote ES output) can be unified (or where is the opportunity for that). Remote ES output is really just a pointer to a cluster that's been setup already. I think the remote cluster is setup once in both cases.

Sounds like the efficiency here is to have CCS also enabled when our Fleet user is setting up the Remote ES output to an already created remote cluster.

@nimarezainia
Copy link
Contributor Author

  • Fleet UI to show which clusters agents are writing data to. Perhaps as a separate/new column (or customizable columns where the user would add the information they are interested in). Allowing for filtering and better UX for users to quickly identify agents in remote clusters.

I think adding columns to the Agent list UI to show the remote outputs as data/monitoring outputs is doable, as we can take that from the agent policy config that the agent is enrolled to.

There is more complexity to visualize outputs per integration, as it means more than one remote ES output can be assigned per agent. We need UX design for this, I'm thinking displaying multiple outputs in the table similar to the Reusable integration polices UX, with a popover that shows a list of integrations with assigned outputs. In addition, the Agent details UI could show the output per integration too, in the tree view where we should the input health.

These are great observations. Agree and clarifying on some:

  • Following the design pattern we have for Reusable integrations where multiple outputs are shown would be ideal, bonus if we can have the integrations name in there.
  • I don't think we need the same treatment for the monitoring output. That doesn't have a per integration concept and is defined at a policy level. Agent can only be sending monitoring to a single cluster.
  • I would add that we need to character limit these output columns somehow

@nimarezainia
Copy link
Contributor Author

One thing to confirm about Osquery/Endpoint actions, will we have all data in the management cluster for these to work properly? Is there any data written by agents required, that would be sent to the data clusters?
I'm referring to the data in the indices here: elastic/ingest-dev#3093 (comment)

@juliaElastic if all agents are writing to a single management cluster, as you mentioned, all these "dot" indices would be built there. But this would be something we need to check with the team for sure.

@nimarezainia
Copy link
Contributor Author

a lot of these steps are unfortunately the laborious steps needs to setup a remote cluster, not sure what we can do on that.

  • Create 2 instances in staging, go to the Manage deployment / Security page of the cluster that will be used as remote, copy the Proxy address from the bottom of the page

The user can get this from Fleet->settings tab of the remote cluster also, but I think this is a better place.

Edit the exported dashboard and replace all logs-* occurrences by remote_ecs:logs-*

@juliaElastic Can this step be automated in anyway? For example if there was a user setting that enabled them to opt-in to having the indices built on CCS results. We know all the remote cluster names, I don't know if we could do this via aliasing?
if the user has many clusters this could be an area for optimization but on the other hand I feel maybe too risky to do under the covers, even if it was doable.

@juliaElastic
Copy link
Contributor

juliaElastic commented Aug 13, 2024

I originally was thinking of full flexibility for the user when i wrote that requirement. But not sure if that use case exists. I was thinking of something like a cluster for CCR as an example. we may not want agent to have the option to write to that cluster.

I'm curious to understand how the two workflows (CCS and Remote ES output) can be unified (or where is the opportunity for that). Remote ES output is really just a pointer to a cluster that's been setup already. I think the remote cluster is setup once in both cases.

Sounds like the efficiency here is to have CCS also enabled when our Fleet user is setting up the Remote ES output to an already created remote cluster.

I'm not sure how users use these 2 features currently together, I mean, do they always set up Remote clusters when they use Remote ES output?

My thinking was this:

  • To send agent data to remote ES, users have to create a remote ES output (and create a service token in the remote cluster)
  • To use CCS, users have to add a Remote Cluster in kibana, and create a cross-cluster API key, and establish trust with certs if needed.

Now, we start to combine these features to set up sending data from agents to remote clusters, and then view the data collected by agents in a single dashboard using CCS. We could add a step in the Remote ES Output UI to also create the cross-cluster API key (similar to the instructions for the remote service token).
I don't think we can/want to automate the full setup of remote clusters from Fleet, that should be a prerequisite that we show on the UI with a link to the docs.
Alternatively we could move the Remote ES output setup to the Remote Clusters UI (that is owned by another team), and have Fleet read the config from there. Though that might be an overkill if not all users need Remote Cluster setup to use Remote ES output (which I suppose is the case today).

@nimarezainia
Copy link
Contributor Author

Now, we start to combine these features to set up sending data from agents to remote clusters, and then view the data collected by agents in a single dashboard using CCS. We could add a step in the Remote ES Output UI to also create the cross-cluster API key (similar to the instructions for the remote service token).

IMO this would be the preferred option if we do anything in this regard. Also keep in mind that not every user who is setting up remote ES output is doing CCS also with another cluster. I'd be happy to keep this enhancement as a followup and based on user feedback.

Installation of integration assets on the remote ES is extremely valuable (last requirement in that list).

@juliaElastic
Copy link
Contributor

juliaElastic commented Aug 14, 2024

Can this step be automated in anyway? For example if there was a user setting that enabled them to opt-in to having the indices built on CCS results. We know all the remote cluster names, I don't know if we could do this via aliasing?
if the user has many clusters this could be an area for optimization but on the other hand I feel maybe too risky to do under the covers, even if it was doable.

I checked and aliasing is not supported on remote indices: elastic/elasticsearch#43312

I think the CCS in dashboards could be automated like this:

  • Add a UI setting to Remote ES output to enable CCS. This could be a checkbox if the corresponding Remote Cluster can be identified (based on host name), or a list of Remote Clusters to select from.
  • When CCS is enabled on a remote output, update all kibana assets installed to include the remote cluster in the index patterns, e.g. logs-* would be updated to logs-*,remote1:logs-*.
  • When CCS is disabled, update all kibana assets to remote the remote cluster from index patterns
  • When an integration is installed/upgraded, modify all kibana assets to include all enabled remote clusters in the index patterns.

So basically we would have to keep the index patterns in kibana asset definitions in sync with the current remote clusters (which opted in for CCS for integrations).
One potential concern can be of the extra costs for using CCS for all searches in dashboards/visualisations once CCS is enabled. We should probably add warnings in Fleet/Analytics to remind users.
Another issue that we would have to solve is if users update an existing remote cluster config with another host that no longer matches a remote ES output. This is not likely to happen, but probably would be better to let users explicitly select which remote cluster they opt in, and not try to match on host.

@juliaElastic
Copy link
Contributor

IMO this would be the preferred option if we do anything in this regard. Also keep in mind that not every user who is setting up remote ES output is doing CCS also with another cluster. I'd be happy to keep this enhancement as a followup and based on user feedback.

I added an implementation plan to the description based on the discussion so far.

Installation of integration assets on the remote ES is extremely valuable (last requirement in that list).

Commented on the linked issue too, I think it can be done earlier/separately from the rest of the enhancements here, if we think it's the most valuable.

@juliaElastic
Copy link
Contributor

if all agents are writing to a single management cluster, as you mentioned, all these "dot" indices would be built there. But this would be something we need to check with the team for sure.

Added a comment to confirm here, I think it's likely we don't have to make any changes to these actions to work in phase 1.

@nimarezainia
Copy link
Contributor Author

So basically we would have to keep the index patterns in kibana asset definitions in sync with the current remote clusters (which opted in for CCS for integrations).
One potential concern can be of the extra costs for using CCS for all searches in dashboards/visualisations once CCS is enabled. We should probably add warnings in Fleet/Analytics to remind users.
Another issue that we would have to

Because this is an opt-in it should be ok. Maybe we can consider this as a future enhancement. We could get infosec to provide feedback and see if this level of complication is worth it. Or do users prefer to build their dashboards using CCS?

@juliaElastic
Copy link
Contributor

@jkakavas Hey, could we get your feedback on the solution laid out in #187129 (comment)?

@nimarezainia
Copy link
Contributor Author

@aarju and @WiegerElastic could we test a proposal with you on the topic of this issue. Mainly the comment here.

As discussed with you previously, we are looking at streamlining how our users, including Elastic, deploy Fleet in a multi-cluster scenario. As a user, would you be happy to create your own dashboards using CCS OR do you see a need for Fleet to ensure that all indices in all the remote clusters are available for visualizations. The latter has costs associated with it and we are reluctant to develop these features if it's not that valuable to our users. Basically is it ok to be expecting users to build their own dashboards using CCS.

@aarju
Copy link

aarju commented Sep 2, 2024

One potential concern can be of the extra costs for using CCS for all searches in dashboards/visualisations once CCS is enabled. We should
probably add warnings in Fleet/Analytics to remind users.

I don't know the exact numbers, but I think that unless the users have the dashboard up all the time with the polling period set to refresh every few seconds this cost should be pretty low. CCS only returns the results of the query and not all of the data so the cost of the queries from data transfer is pretty small compared to all of the other costs.

As a user, would you be happy to create your own dashboards using CCS OR do you see a need for Fleet to ensure that all indices in all the remote clusters are available for visualizations

I think that as a customer we would want to have some built in dashboards that use CCS to provide a starting point and an idea of what is possible, but with the ability to customize the dashboards and the dataviews feeding them. For example, several of our data views always provide the remote cluster name because that type of data is stored on one or two specific remote clusters so there is no need to query the other clusters that don't contain that data. In other cases if we use the data view of *:logs-* it can severely impact performance querying remote clusters unnecessarily because we have so much data of varying sources.

@juliaElastic
Copy link
Contributor

I think that as a customer we would want to have some built in dashboards that use CCS to provide a starting point and an idea of what is possible, but with the ability to customize the dashboards and the dataviews feeding them.

I'm not sure how the customization would work in combination with trying to keep the remote cluster prefixes in sync automatically.
Users would have to duplicate dashboards to customize the dataviews, and the duplicates would not be kept in sync (when adding a new remote cluster for example).
Not sure it's worth automating updating the data views, maybe we could start with documenting how to add remote clusters to the dataviews to use in dashboards?

@nimarezainia
Copy link
Contributor Author

As a user, would you be happy to create your own dashboards using CCS OR do you see a need for Fleet to ensure that all indices in all the remote clusters are available for visualizations

I think that as a customer we would want to have some built in dashboards that use CCS to provide a starting point and an idea of what is possible, but with the ability to customize the dashboards and the dataviews feeding them. For example, several of our data views always provide the remote cluster name because that type of data is stored on one or two specific remote clusters so there is no need to query the other clusters that don't contain that data. In other cases if we use the data view of *:logs-* it can severely impact performance querying remote clusters unnecessarily because we have so much data of varying sources.

thanks @aarju. How do you guys do this today? is it extremely arduous?

More importantly, from Fleet's perspective, how do we go about choosing which dataview needs to get this treatment.

@aarju
Copy link

aarju commented Sep 3, 2024

thanks @aarju. How do you guys do this today? is it extremely arduous?

At this time all of the monitoring dashboards are custom built. If you want to see some examples there are dashboards built by the team that manages our giant o11y cluster.

More importantly, from Fleet's perspective, how do we go about choosing which dataview needs to get this treatment.

I think this would be any data view that we are centrally managing via the CCS fleet. However, for us the only data views that are an issue right now are those where we can take actions with the integration from inside of another App such as Defend and OSQuery. Health monitoring of all of our fleets in a single central location would be nice, but the big challenge we are having right now is that we can't use several of the capabilities of our SIEM because the Defend and OSQuery integrations are managed from a different cluster. If I have an alert in my SIEM I can't use the 'Isolate Host' or OSQuery action from that cluster, I have to log into our Endpoint Fleet cluster and isolate it from there.

@juliaElastic
Copy link
Contributor

Could you give an example of a dashboard which already uses CCS in the overview cluster?

While I see actions across clusters would be higher priority, those are not in scope in the current Phase 1 issue.

@nimarezainia
Copy link
Contributor Author

@aarju For Defend and OSQuery - the actions will be taken care of by what Julia is proposing here in pahse 1., where there exists a single management cluster. The actions rely on the "dot" indices and will all be in one place at the Management Cluster.

For Agent Health Monitoring: we already have the option of sending monitoring data to common cluster which can be nominated by the user.

What we want to figure out here is what we should do (if any) to integration specific datastreams. Say you have nginx installed on agents across multiple sites/clusters and the user wants to see these in one consolidated dashboard. Should Fleet be creating this dataview based on CCS (there would be many datastreams, some the user probably wouldn't care for). Or do we leave this part of it to the user and have them utilize CCS to build the dashboards they need, which probably scales better. Fleet will ensure that the integration is installed on all the remote clusters (including all its assets). Does this make sense?

@kpollich kpollich added the Meta label Sep 9, 2024
@kpollich
Copy link
Member

We'll definitely want to dedicate a lot of time to building out end to end tests with cross-cluster search for this work, which will be a substantial investment just to spin up appropriate environments in CI against which to run tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Fleet Fleet team's agent central management project Meta Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

5 participants