Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Improve the elasticsearch module when used for Stack Monitoring #39058

Open
consulthys opened this issue Apr 18, 2024 · 2 comments · May be fixed by #40731
Open

[Metricbeat] Improve the elasticsearch module when used for Stack Monitoring #39058

consulthys opened this issue Apr 18, 2024 · 2 comments · May be fixed by #40731
Labels
Team:Infra Monitoring UI Infrastructure Monitoring UI team Team:Monitoring Stack Monitoring team

Comments

@consulthys
Copy link
Contributor

consulthys commented Apr 18, 2024

While investigating the root cause of indexing failures (also reported here in the past), we discovered that when using Metricbeat to feed Stack Monitoring, the elasticsearch module of Metricbeat ships elasticsearch.shard documents with concrete IDs that are made of the current cluster state (i.e., state_uuid) and some other constant data. Since the cluster state doesn't change at the same pace as Metricbeat collection rounds (10s by default), those version conflicts happen all the time.

Those version conflicts are probably a side-effect of switching to data streams in 8.0.0 (i.e. put if absent semantics with concrete ID) and weren't apparent earlier when the data was stored in simple indexes. Since each elasticsearch.shard document is about a shard placement in the cluster, the logic makes sense, i.e. there's no point re-indexing a document whose content hasn't changed since the last collection round.

However, we could/should go one step further and detect if the cluster state hasn't changed between two collection rounds. I'm naively thinking about "simply" comparing the old and new state_uuid, but it might be more involved than that. Anyway, if there's no change, there's no point in even rebuilding those documents and sending them again, since we know they'll bounce anyway, generate a version conflict and increase the indexing failure counter for no reason. In addition to that, that wastes network bandwidth and CPU/RAM resource on ES side. For big clusters with many thousands of shards, that can make a big difference.

Related issue: #36547 (comment)

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 18, 2024
@cmacknz cmacknz added Team:Monitoring Stack Monitoring team Team:Infra Monitoring UI Infrastructure Monitoring UI team labels Apr 23, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 23, 2024
@pickypg
Copy link
Member

pickypg commented Jul 26, 2024

The UI may need to be updated to understand the lack of a changing timestamp, but comparing the state_uuid should be all that's needed for that suggestion.

@consulthys
Copy link
Contributor Author

consulthys commented Dec 13, 2024

The UI may need to be updated to understand the lack of a changing timestamp, but comparing the state_uuid should be all that's needed for that suggestion.

@pickypg Queries on shard data don't have any time ranges, the state_uuid is used as an implied time range. You can find more about this in elastic/kibana#189728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Infra Monitoring UI Infrastructure Monitoring UI team Team:Monitoring Stack Monitoring team
Projects
None yet
3 participants