-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide more insights into indexing failures #107601
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
We discussed this as a team and agreed that we are aligned on two potential improvements.
Currently as outlined in the ticket, we only have a single counter keeping track of the number of failures. We would support adding additional counters for specific exceptions that it makes sense to track (ex: mapping errors, version conflicts, etc). Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.
Currently there is no dedicated logger that can be enabled to surface indexing errors. It is possible to get them by enabled We expect that as various Elasticsearch teams/engineers encounter scenarios where it would be helpful to track this information that it will be incrementally added. |
Wrt to metrics, I've found this external blog post which I think we can use as a baseline. For completeness, here are the indexing failure metrics mentioned in the post:
I hope we can expose all these metrics, although not necessary in this format (e.g. "red cluster" should probably be a case of "unallocated primary", and, separtely, maybe we don't need to expose "flood stage" cluster block as a particular type of block).
That sounds right to me, wrt to telemetry metrics. This is what we have today, and changing these to be index/shard based is a separate discussion. However, the stats exposed via the API, already support per index/shard metrics, including for the indexing failure metrics, and the new metrics we're planning to add here, should work the same way. |
@albertzaharovits @Tim-Brooks thanks for working on this!! The noisiest indexing failures we've seen so far are the version conflict ones but if we can count any other failures distinctly, it's even better!! |
Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete)) |
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
Regarding the "node-level" bit, when indexing failures happen and one needs to debug them, he needs to know what index is causing them. I think that reporting them only at the node-level defeats the purpose of having this indexing failure counter in the first place. When supporting customers with indexing failures, it's much more important to know which index are causing the problem than on which node those failures are happening. For the context, AutoOps is currently gathering indexing failures from |
This exposes new OTel node and index based metrics for indexing failures due to version conflicts. In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc. Relates: #107601
This exposes new OTel node and index based metrics for indexing failures due to version conflicts. In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc. Relates: elastic#107601 (cherry picked from commit 12eb1cf) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
#119761) * Metrics for indexing failures due to version conflicts (#119067) This exposes new OTel node and index based metrics for indexing failures due to version conflicts. In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc. Relates: #107601 (cherry picked from commit 12eb1cf) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java * types * Fix NodeIndexingMetricsIT * [CI] Auto commit changes from spotless * Fix RestShardsActionTests * Fix test/cat.shards/10_basic.yml for bwc --------- Co-authored-by: elasticsearchmachine <[email protected]>
@consulthys We've merged the change to expose a new metric for version conflicts #119067 . This is a subset of the indexing failures. (lands on 8.18) |
We've also merged a new logger (disabled by default) that specifically only logs indexing failure causes: #117728 (also lands on 8.18). |
@albertzaharovits This is HUGE !! Thank you so much 🥳 |
@albertzaharovits I just made a quick "version conflict" test on 8.18.0-SNAPSHOT and I'm getting the following. The first document creation works:
When I try to create a document with the same ID, I get a version conflict exception as expected. However, when looking at the stats, both counters are incremented, which is unexpected and defeats the purpose of having a dedicated counter for version conflicts:
Please let me know if I'm testing this the wrong way? |
Looking at how this was implemented I understand why I'm seeing the results I shared above. In order to get the count of "true" indexing failures (i.e. not related to version conflicts), I'm supposed to subtract Looking ahead, once we'll have more counters, we'll need to subtract all of them from Is there any chance to increment each counter independently? |
@consulthys Thanks for checking this! I think this is the safest approach... otherwise people might start seeing anomalies in existing graphs after the upgrade (I believe that most existing indexing failures are actually version conflicts, so it's reasonable to assume customers rely on the existing metric to report version conflicts, even though it covers more). Plus, the namespacing of OTel metrics with the total
If you're after non-version-conflict indexing failures, I personally believe it's clearer to simply do the difference. It might be hard to implement, in which case we can discuss to add another metric that implements the difference, but it's going to be named something like |
@albertzaharovits
That works for version conflicts because that's probably the noisiest of all due to our heavy reliance on optimistic concurrency control. I'm glad we tackled this one first and I'm eager to see how the above charts change once this is released. Let's keep it as designed and implemented, thanks for great job! |
Description
The
indexing.index_failed
counter returned from the Index Stats and Node Stats APIs currently counts how many indexing failures are happening at the index level, respectively the node level. A big shortcoming of this counter is that it batches many different "kinds" of indexing failures, some of which are important, and some aren't.Digging forums and discussion boards, we found some vague indications on what all goes into this counter, namely "version conflicts and lucene analysis chain errors". Lucene analysis chain errors can be very important issues that the user might want to know about since they relate to document-level problems raised by Lucene. However, version conflict issues might come in at a much higher rate and it might also be the case that they are not relevant at all, as we'll see in the next two cases below, and when that's the case, they tend to swamp the Lucene ones.
What is needed?
It would be great if another counter would take care of counting version conflict issues and let the
indexing.index_failed
counter really be about only those "Lucene chain analysis errors". Ideally, if the latter can be further broken down into specific error categories, even better, but that's another effort. For now, we'd like to just focus on splitting version conflicts from the rest.So instead of getting this:
We could get this:
As a bonus, we could even distinguish between general version conflicts and version conflicts on create operations since they are somewhat different in nature. Adding those two new fields would be backward compatible as the
index_failed
field would be left unchanged.Why is it needed?
There are numerous cases where this could be useful, but the two main ones that would greatly benefit from this change are explained below.
Case 1: Put if absent (op_type = create)
One use case where it's perfectly ok to have version conflicts is when indexing data with
op_type = create
to make sure to not index a document that already exists. When running such a case, it starts looking as shown in the figure below:Does the user have real Lucene analysis chain errors in there? At that indexing rate, there's no way to know unfortunately.
Case 2: Plain usage of Stack Monitoring
Another case where this problem is even more painful is when monitoring the cluster with Metricbeat. When doing so, the monitoring indexes are constantly suffering from indexing failures (link to issue) at a steady rate.
It took a while to discover what those indexing errors where, but the gist of it is that the
elasticsearch
module in Metricbeat sendselasticsearch.shard
documents with an ID that is built using the clusterstate_uuid
value that doesn't change at the same cadence as the Metricbeat collection rounds, which ultimately causes version conflicts. There's nothing important about this issue, as it makes sense to not index the same data twice (i.e. cluster state hasn't changed), however, it's still counted as a noisy indexing failure. Probably that Metricbeat should only sendelasticsearch.shard
documents if thestate_uuid
value has changed, but that's another issue tackled here.Some counter arguments
Indexing failures are client-side issues, let the user deal with them (i.e. enable logs and let them figure it out)
Agreed, but the problem here is that "client" encompasses a very wide range of different types of client components. It's not only about client-side applications that are under the control of the user, but also about any the Beats, Logstash, the elastic agent, Kibana, etc. For ESS users, those "clients" even run in ES Cloud, where they have no leverage at all.
Very few users complain about this, so why bother?
Fair enough, though that might also be because the
index_failed
metric is not shown anywhere in Stack Monitoring charts or anywhere else in Kibana. So unless the user looks at_cat/indices
and specifically requests theiif
field, she would never be aware of it.The text was updated successfully, but these errors were encountered: