Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Allow users to configure CCS pattern #120384

Open
simianhacker opened this issue Dec 3, 2021 · 17 comments
Open

[Monitoring] Allow users to configure CCS pattern #120384

simianhacker opened this issue Dec 3, 2021 · 17 comments
Assignees
Labels
enhancement New value added to drive a business result Feature:Stack Monitoring Team:Monitoring Stack Monitoring team

Comments

@simianhacker
Copy link
Member

simianhacker commented Dec 3, 2021

When Stack Monitoring queries indices with monitoring.ui.ccs.enabled: true, which is the default, we use the *: pattern. This means that it will query ALL of their remote CCS clusters, this can be problematic for users with multiple remote clusters.

Currently the only option is for users to hack around this limitation using the following settings in kibana.yml:

monitoring.ui.ccs.enabled: false
monitoring.ui.metricbeat.index: my-custom-remote-pattern-*:.monitoring-*

Ideally we would rather have users use this:

monitoring.ui.ccs.enabled: true
monitoring.ui.ccs.remotePattern: my-custom-remote-pattern-*

AC:

  • Create a new configuration setting monitoring.ui.ccs.remotePattern which defaults to *
  • Leave the default for enabled as true for now
  • Ensure docs/release notes indicate changes
  • Ensure this can be overridden via Cloud Console
@simianhacker simianhacker added enhancement New value added to drive a business result Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Dec 3, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@crisdarocha
Copy link

@simianhacker will this also allow for more than one remote cluster, in case they can't be written with a wildcard pattern?

@simianhacker
Copy link
Member Author

@crisdarocha That's a great question, I think in that case we would need a way to create multiple patterns. Maybe instead of a string it's an array like this:

monitoring.ui.ccs.enabled: true
monitoring.ui.ccs.remotePatterns:
  - my-remote-pattern-*
  - some-other-remote-pattern-*

Which would set the index patterns for queries to this:

POST .monitoring-6-*,my-remote-pattern-*:.monitoring-6-*,some-other-pattern-*:.monitoring-6-*,.monitoring-7-*,my-remote-pattern-*:.monitoring-7-*,some-other-remote-pattern-*:.monitoring-7-*/_search

@matschaffer
Copy link
Contributor

matschaffer commented Dec 8, 2021

We might want to be careful about path length explosion if we enable multiple patterns (rel #118646).

ES has a 4k line length default on http.max_initial_line_length that we've hit due to kibana alerting pre-expanding all remotes.

I doubt we'll hit it if we keep it under 10 patterns or so though.

@crisdarocha
Copy link

The array sounds like a nice solution. Specially in ESS where deployments can have the most unexpected names.

@jasonrhodes
Copy link
Member

Open question: Should the default for CCS enabled be "true"?

@simianhacker
Copy link
Member Author

simianhacker commented Jan 5, 2022

Open question: Should the default for CCS enabled be "true"?

@jasonrhodes I would vote NO. I feel like CCS is sufficiently complicated that enabling it by default just causes more problems than it solves. The worst case is when a someone adds a remote host for a different purpose, they visit Stack Monitoring to look at their cluster and things stop working because the requests fail. It's especially bad when the remote cluster has a HUGE Metricbeat installation which causes Stack Monitoring to timeout.

Another scenario is when they deploy their nodes via some orchestration framework which has an option for the node roles and they omit remote_cluster_client, then visit Stack Monitoring and it's not working because the queries are failing because it's required role for Stack Monitoring. I made this mistake the first time I setup a Docker cluster because "Why would I need a remote_cluster_client role if I'm not setting up remote clusters?"

I would prefer if there was explicit instructions for enabling CCS for Stack Monitoring.

@matschaffer
Copy link
Contributor

matschaffer commented Jan 6, 2022

+1 to @simianhacker 's sentiment. While it was kind of "oh awesome" to see the that CSS "just worked" when connecting our internal monitoring clusters (cloud.elastic.co managed), I don't think that it was enough of a win to make the potential problems worth it.

@crisdarocha
Copy link

+1 on @simianhacker's comment too. I think the risks of having it by default are surely high.

Not sure how it's done in ESS though. I can't see the settings myself, but I believe ESS has this by default and changing it requires an override.

@neptunian
Copy link
Contributor

@ravikesarwani Do you have any opinion on possibly turning CCS off by default? We're going to leave it as true for now and see if a combination of having removed metricbeat-* from the index pattern and letting them override the indices via something like monitoring.ui.ccs.remotePattern resolve the SDH issues for now, but it's something we're considering.

@jasonrhodes
Copy link
Member

Hey all! I just had a good chat with @neptunian about these changes and I asked if we can hold off on merging them for right now. She's going to merge some of her changes (after review) that will make it easier to enable this type of config in the future, but I want to hold off on introducing another configuration flag that we will have to maintain for a long time when the future of index selection across observability is currently under a lot of review (data views, saved views, saved searches, ESQL, etc etc).

So with that in mind, in order to avoid locking us into a config flag that controls just a part of the index selection required for all Stack Monitoring queries, we're going to hold off on introducing the flag. In addition, since we removed the metricbeat indices from our SM queries, there's a good chance users won't run into the problem where this is most dramatic, so we can hold off until we see if that's a true assumption or not as well. Again, if we find a set of customers that have the issue and need this flag, we'll be able to bring it back much more easily now that Sandy has done the investigation and most of the implementation work.

Let me know if you have questions about this!

@neptunian
Copy link
Contributor

neptunian commented Apr 18, 2022

Opened a new PR #130466

@neptunian
Copy link
Contributor

Other issues that need to be addressed before moving forward

  • Should this be in Advanced Settings instead? It would be space sensitive which could be confusing.
  • If the user provides an incorrect remote name or the remote name doesn't exist anymore, all queries fail including all rules for all clusters. The user won't be able to use the SM app anymore. See screenshots in [Stack Monitoring] support custom ccs remote prefixes #129806. This may or may not be acceptable but since it fairly easy break SM with this setting it might be better thinking of it as "Advanced Setting". A way around this is for the user to remember to select "skip if unavailable" when setting up a remote which is off by default. We could update the "Access Denied" error messaging to be more specific that SM could not connect to such and such remote. Rules might be a bit trickier.
  • This should be supported in Cloud as well, after speaking with Ravi. Currently, Cloud does not support monitoring.ui.ccs.enabled (default true) to be set so it cannot be turned off.
  • Instead of specifying remote names only should we allow remote patterns? eg: myremote-*:.monitoring-*
  • As @jasonrhodes do we see possibly ever moving to Data Views or letting the user customize indices in some other way? This would make backwards compatibility difficult in the future if so.

See draft PR #129806 for implementation

@weltenwort
Copy link
Member

weltenwort commented Apr 19, 2022

Maybe it's useful to recapitulate some advantages and disadvantages of Kibana's Advanced Settings:

Advantages

  • Can be discovered via the UI
  • Can be automatically migrated during an upgrade (since it's stored in saved objects)
  • Doesn't need to be allow-listed in cloud
  • Can be overridden in the kibana.yml if necessary
  • Space-specificity might be a feature (e.g. to monitor different remote clusters in different spaces, but might be a disadvantage too, see below)

Disadvantages

  • Registered during setup(), so the value is only available in start()
  • Space-specificity might be surprising (but might be an advantage too, see above)

On the topic of supporting data views, both Advanced settings and the kibana.yml would prohibit that. Neither could store a reference to a saved object safely. In order to support that, a "saved monitoring view" (which would be a first-class saved object itself) might be a way forward. Its implementation would be more costly than riding on top of either the Advanced Settings or yaml config mechanisms, though.

@jasonrhodes
Copy link
Member

I suppose an advantage of the Advanced Setting is that, through migration, we could potentially create a Data View for a customer who had configured this setting previously. With the kibana.yml config value, I suppose we could do something similar, but at start up instead?

I wonder if the deprecation process for Advanced Settings is substantially different than deprecating a kibana.yml config value.

I'm of two minds here because on the one hand it feels like overthinking things to be worried about something that might happen down the road re: switching to data views etc in Stack Monitoring, but on the other hand it really feels like index selection should be handled differently than concatenating strings stored in different places and in different ways, further separating ourselves from the rest of Kibana.

Based on that, I think we should plan a short R&D and investigate what it would look like to use Data Views in Stack Monitoring, so that we can remove some of that existing separation and potentially provide better ways for users to "fix" this remote CCS problem (I'm not sure if Data Views will help a ton here, but I'd like to explore it).

@neptunian
Copy link
Contributor

neptunian commented Apr 22, 2022

Some other thoughts on Advanced Settings...

Security Solution has an option to change the index patterns that the app uses to collect events.
Screen Shot 2022-04-22 at 11 25 22 AM

Doesn't make as much sense in the SM world for the same reason that Data Views might be complicated, but it could be an input of comma separated remote patterns that could be reset to the default *:
Screen Shot 2022-04-22 at 11 25 51 AM.

It would probably also make sense to have the ccs enabled switch here instead of monitoring.ui.ccs.enabled.

They also have this ccs setting which is probably way of ignoring when you dont have access to some remote in a rule.
Screen Shot 2022-04-22 at 11 31 43 AM

A support engineer recently had a client that wanted to see all their remote clusters in SM but did not necessarily want to create Rules for all their clusters. Perhaps a separate input for the remotes on which to run rules would be useful. They wanted to use a remote name pattern such as prod-*:{monitoring indices}.

I'll create an issue for a short R&D to investigate what it would look like to use Data Views in Stack Monitoring

@jasonrhodes
Copy link
Member

Thank you, @neptunian!

@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Stack Monitoring Team:Monitoring Stack Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants