Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more insights into indexing failures #107601

Open
consulthys opened this issue Apr 18, 2024 · 14 comments
Open

Provide more insights into indexing failures #107601

consulthys opened this issue Apr 18, 2024 · 14 comments
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed Indexing Meta label for Distributed Indexing team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@consulthys
Copy link
Contributor

consulthys commented Apr 18, 2024

Description

The indexing.index_failed counter returned from the Index Stats and Node Stats APIs currently counts how many indexing failures are happening at the index level, respectively the node level. A big shortcoming of this counter is that it batches many different "kinds" of indexing failures, some of which are important, and some aren't.

Digging forums and discussion boards, we found some vague indications on what all goes into this counter, namely "version conflicts and lucene analysis chain errors". Lucene analysis chain errors can be very important issues that the user might want to know about since they relate to document-level problems raised by Lucene. However, version conflict issues might come in at a much higher rate and it might also be the case that they are not relevant at all, as we'll see in the next two cases below, and when that's the case, they tend to swamp the Lucene ones.

What is needed?

It would be great if another counter would take care of counting version conflict issues and let the indexing.index_failed counter really be about only those "Lucene chain analysis errors". Ideally, if the latter can be further broken down into specific error categories, even better, but that's another effort. For now, we'd like to just focus on splitting version conflicts from the rest.

So instead of getting this:

GET .ds-.monitoring-es-8-mb-2024.04.14-000079/_stats?filter_path=**.index_failed
=>
  "indices": {
    ".ds-.monitoring-es-8-mb-2024.04.14-000079": {
      "primaries": {
        "indexing": {
          "index_failed": 12943888
        }
      }
    }
  }

We could get this:

GET .ds-.monitoring-es-8-mb-2024.04.14-000079/_stats?filter_path=**.index_failed,**.version_conflicts
=>
  "indices": {
    ".ds-.monitoring-es-8-mb-2024.04.14-000079": {
      "primaries": {
        "indexing": {
          "index_failed": 2,
          "version_conflicts": 12943886
        }
      }
    }
  }

As a bonus, we could even distinguish between general version conflicts and version conflicts on create operations since they are somewhat different in nature. Adding those two new fields would be backward compatible as the index_failed field would be left unchanged.

Why is it needed?

There are numerous cases where this could be useful, but the two main ones that would greatly benefit from this change are explained below.

Case 1: Put if absent (op_type = create)

One use case where it's perfectly ok to have version conflicts is when indexing data with op_type = create to make sure to not index a document that already exists. When running such a case, it starts looking as shown in the figure below:

Does the user have real Lucene analysis chain errors in there? At that indexing rate, there's no way to know unfortunately.

indexing_failed

Case 2: Plain usage of Stack Monitoring

Another case where this problem is even more painful is when monitoring the cluster with Metricbeat. When doing so, the monitoring indexes are constantly suffering from indexing failures (link to issue) at a steady rate.

312897434-cfae95e5-c20e-4471-a38c-9ebf255dc9b1

It took a while to discover what those indexing errors where, but the gist of it is that the elasticsearch module in Metricbeat sends elasticsearch.shard documents with an ID that is built using the cluster state_uuid value that doesn't change at the same cadence as the Metricbeat collection rounds, which ultimately causes version conflicts. There's nothing important about this issue, as it makes sense to not index the same data twice (i.e. cluster state hasn't changed), however, it's still counted as a noisy indexing failure. Probably that Metricbeat should only send elasticsearch.shard documents if the state_uuid value has changed, but that's another issue tackled here.

Some counter arguments

Indexing failures are client-side issues, let the user deal with them (i.e. enable logs and let them figure it out)

Agreed, but the problem here is that "client" encompasses a very wide range of different types of client components. It's not only about client-side applications that are under the control of the user, but also about any the Beats, Logstash, the elastic agent, Kibana, etc. For ESS users, those "clients" even run in ES Cloud, where they have no leverage at all.

Very few users complain about this, so why bother?

Fair enough, though that might also be because the index_failed metric is not shown anywhere in Stack Monitoring charts or anywhere else in Kibana. So unless the user looks at _cat/indices and specifically requests the iif field, she would never be aware of it.

@consulthys consulthys added >enhancement needs:triage Requires assignment of a team area label labels Apr 18, 2024
@iverase iverase added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed needs:triage Requires assignment of a team area label labels Apr 22, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 22, 2024
@Tim-Brooks
Copy link
Contributor

We discussed this as a team and agreed that we are aligned on two potential improvements.

  1. More granular error counters for engine level indexing failures.

Currently as outlined in the ticket, we only have a single counter keeping track of the number of failures. We would support adding additional counters for specific exceptions that it makes sense to track (ex: mapping errors, version conflicts, etc). Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.

  1. A dedicated indexing error logger.

Currently there is no dedicated logger that can be enabled to surface indexing errors. It is possible to get them by enabled trace on certain loggers. However, these loggers will surface additional unrelated logs. We think it would be useful to add a specific logger which only logs indexing errors for users to need to temporarily diagnose issues. The logger would include shard level labels to help diagnose specific shards.

We expect that as various Elasticsearch teams/engineers encounter scenarios where it would be helpful to track this information that it will be incrementally added.

@albertzaharovits
Copy link
Contributor

Currently as outlined in the ticket, we only have a single counter keeping track of the number of failures. We would support adding additional counters for specific exceptions that it makes sense to track (ex: mapping errors, version conflicts, etc).

Wrt to metrics, I've found this external blog post which I think we can use as a baseline. For completeness, here are the indexing failure metrics mentioned in the post:

  • index-related:
    • version conflict
    • mapping-related errors
    • "too many fields" error
  • cluster-related:
    • rejected indexing – queue full
    • circuit breaker errors
    • red cluster
    • disk flood stage cluster block
    • cluster block

I hope we can expose all these metrics, although not necessary in this format (e.g. "red cluster" should probably be a case of "unallocated primary", and, separtely, maybe we don't need to expose "flood stage" cluster block as a particular type of block).

Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.

That sounds right to me, wrt to telemetry metrics. This is what we have today, and changing these to be index/shard based is a separate discussion. However, the stats exposed via the API, already support per index/shard metrics, including for the indexing failure metrics, and the new metrics we're planning to add here, should work the same way.

@consulthys
Copy link
Contributor Author

@albertzaharovits @Tim-Brooks thanks for working on this!!
We (Opster) wrote that blog post in order to try to provide some kind of potential explanations into indexing failures, in the hope that one day the API could provide some distinct counters for each of them.

The noisiest indexing failures we've seen so far are the version conflict ones but if we can count any other failures distinctly, it's even better!!

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Dec 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@consulthys
Copy link
Contributor Author

consulthys commented Dec 13, 2024

Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.

That sounds right to me, wrt to telemetry metrics. This is what we have today, and changing these to be index/shard based is a separate discussion. However, the stats exposed via the API, already support per index/shard metrics, including for the indexing failure metrics, and the new metrics we're planning to add here, should work the same way.

Regarding the "node-level" bit, when indexing failures happen and one needs to debug them, he needs to know what index is causing them. I think that reporting them only at the node-level defeats the purpose of having this indexing failure counter in the first place. When supporting customers with indexing failures, it's much more important to know which index are causing the problem than on which node those failures are happening.

For the context, AutoOps is currently gathering indexing failures from /_cat/shards (iif field), so if we provide those new counters, they should also be available as new sibling columns in that call.

albertzaharovits added a commit that referenced this issue Jan 8, 2025
This exposes new OTel node and index based metrics for indexing failures due to version conflicts.

In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc.

Relates: #107601
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this issue Jan 8, 2025
This exposes new OTel node and index based metrics for indexing failures due to version conflicts.

In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc.

Relates: elastic#107601
(cherry picked from commit 12eb1cf)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
elasticsearchmachine added a commit that referenced this issue Jan 9, 2025
#119761)

* Metrics for indexing failures due to version conflicts (#119067)

This exposes new OTel node and index based metrics for indexing failures due to version conflicts.

In addition, the /_cat/shards, /_cat/indices and /_cat/nodes APIs also expose the same metric, under the newly added column iifvc.

Relates: #107601
(cherry picked from commit 12eb1cf)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java

* types

* Fix NodeIndexingMetricsIT

* [CI] Auto commit changes from spotless

* Fix RestShardsActionTests

* Fix test/cat.shards/10_basic.yml for bwc

---------

Co-authored-by: elasticsearchmachine <[email protected]>
@albertzaharovits
Copy link
Contributor

albertzaharovits commented Jan 9, 2025

@consulthys We've merged the change to expose a new metric for version conflicts #119067 . This is a subset of the indexing failures. (lands on 8.18)

@albertzaharovits
Copy link
Contributor

albertzaharovits commented Jan 9, 2025

We've also merged a new logger (disabled by default) that specifically only logs indexing failure causes: #117728 (also lands on 8.18).

@consulthys
Copy link
Contributor Author

@albertzaharovits This is HUGE !! Thank you so much 🥳

@consulthys
Copy link
Contributor Author

consulthys commented Jan 10, 2025

@albertzaharovits I just made a quick "version conflict" test on 8.18.0-SNAPSHOT and I'm getting the following.

The first document creation works:

PUT test/_create/1
{
    "field": 1
}
=>
OK

GET test/_stats
=> 
"index_failed": 0,
"index_failed_due_to_version_conflict": 0,

When I try to create a document with the same ID, I get a version conflict exception as expected. However, when looking at the stats, both counters are incremented, which is unexpected and defeats the purpose of having a dedicated counter for version conflicts:

PUT test/_create/1
{
    "field": 1
}
=>
version_conflict_exception

GET test/_stats
=> 
"index_failed": 1,
"index_failed_due_to_version_conflict": 1,

Please let me know if I'm testing this the wrong way?

@consulthys
Copy link
Contributor Author

Looking at how this was implemented I understand why I'm seeing the results I shared above. In order to get the count of "true" indexing failures (i.e. not related to version conflicts), I'm supposed to subtract index_failed_due_to_version_conflict from index_failed in order to get 0. I would very much prefer and think it would make more sense that each counter increments independently from each other.

Looking ahead, once we'll have more counters, we'll need to subtract all of them from index_failed in order to get the real count of indexing failures. We'll also need to modify the logic every time a new counter is added, which is cumbersome.

Is there any chance to increment each counter independently?

@albertzaharovits
Copy link
Contributor

@consulthys Thanks for checking this!
Indeed the approach is to conservatively not change the exiting metrics (either OTel or API based ones) - it felt safest from my perspective and it didn't came up otherwise in the review discussions.

I think this is the safest approach... otherwise people might start seeing anomalies in existing graphs after the upgrade (I believe that most existing indexing failures are actually version conflicts, so it's reasonable to assume customers rely on the existing metric to report version conflicts, even though it covers more).

Plus, the namespacing of OTel metrics with the total es.indexing.indexing.failed.total and sub-total es.indexing.indexing.failed.version_conflict.total works out nicely.

Is there any chance to increment each counter independently?

If you're after non-version-conflict indexing failures, I personally believe it's clearer to simply do the difference. It might be hard to implement, in which case we can discuss to add another metric that implements the difference, but it's going to be named something like es.indexing.indexing.failed.other.total, which is less clear than es.indexing.indexing.failed.total - es.indexing.indexing.failed.version_conflict.total.

@consulthys
Copy link
Contributor Author

@albertzaharovits
Thanks for the quick feedback and explaining why we'd preferably not change existing metrics.

I believe that most existing indexing failures are actually version conflicts

That works for version conflicts because that's probably the noisiest of all due to our heavy reliance on optimistic concurrency control. I'm glad we tackled this one first and I'm eager to see how the above charts change once this is released.

Let's keep it as designed and implemented, thanks for great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed Indexing Meta label for Distributed Indexing team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

6 participants