Improve information about _stats index_failures #80802

GlenRSmith · 2021-11-17T18:44:14Z

As is, when exceptions occur in InternalEngine.index, they bubble up and by way of a postIndex hook, two things happen:

if trace logging is enable for the right class context, a trace log is written that includes the root cause failure
the relevant InternalIndexingStats have the index_failed counter incremented

This makes it very challenging to investigate the cause of indexing failures. I think even the most surgical setting of trace logging (I think it would be org.elasticsearch.index.shard.IndexShard ?) in production environment will result in pretty massive amounts of logging.

I've been able to figure out that, for example, that a org.elasticsearch.index.engine.VersionConflictEngineException will contribute to this count when e.g. trying to update a document with a lagging version number, but only by suspecting that to be the case and testing it in isolation.

I'm not really sure how, exactly, I would prefer to see this improved. One approach would be to add granularity to those stats; that seems like a fairly high bar to clear in justifying as it would be disruptive to the client-facing API. Changes to logging seems more palatable in that regard, and the lowest hanging approach might be to promote the [1] logger.trace in IndexShard.index to, at a minimum, logger.debug. Another approach would be to add logging at each of the places where a root cause failure occurs. Of course that would fan out the changes needed and would be more difficult to be certain all relevant scenarios are covered.

(I would contend that, regardless any effort to address the general request I'm making, the point-by-point places where relevant exceptions are raised should generate log entries, arguably as much as warn level.)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-18T18:00:08Z

Pinging @elastic/es-distributed (Team:Distributed)

GlenRSmith · 2021-11-18T19:06:10Z

Another error that does increment the count is

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
  },
  "status": 400
}

DaveCTurner · 2021-11-22T11:27:49Z

The thinking here is that errors due to client requests don't really have a place in the server log - the folks with access to the server logs are often not the right people to address any problems with requests from clients, and failed indexing is rarely of concern to the server admin. It certainly doesn't warrant a WARN-level log entry for every such exception. Instead, we return the exception in the HTTP response and expect the client to handle it appropriately by logging it or retrying or raising an alert etc.

(edit: I see some value in logging these exceptions at DEBUG rather than TRACE, but no higher)

consulthys · 2022-10-20T06:59:18Z

I encountered a similar issue where all monitoring indexes actually suffer from indexing failures (see last column in the screenshot below), no other "business" indexes has this problem. So here the "client" is Metricbeat monitoring ES nodes but it's not clear from the logs why this is happening.

GET _cat/shards?v&s=iif:desc&h=index,shard,docs,id,iif

DaveCTurner · 2024-04-16T06:27:19Z

Sorry for the delay in responding here @consulthys. We wouldn't expect Elasticsearch to report these things in its logs, but they should be reported by Metricbeat (if they're not benign anyway -- which I hope they are given that they seem to happen millions of times per day)

consulthys · 2024-04-20T04:53:08Z

Thanks @DaveCTurner!
Thanks to our investigations, we've been able to file a new enhancement request for Elasticsearch to add an additional counter for version conflicts, as well as another enhancement request for Metricbeat that fixes the way elasticsearch.shard documents are sent to ES.

GlenRSmith added >enhancement needs:triage Requires assignment of a team area label labels Nov 17, 2021

nik9000 added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. team-discuss labels Nov 18, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 18, 2021

nik9000 removed the needs:triage Requires assignment of a team area label label Nov 18, 2021

DaveCTurner removed the team-discuss label Dec 9, 2021

consulthys mentioned this issue Apr 20, 2024

Provide more insights into indexing failures #107601

Open

consulthys mentioned this issue Apr 20, 2024

[Metricbeat] Improve the elasticsearch module when used for Stack Monitoring elastic/beats#39058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve information about _stats index_failures #80802

Improve information about _stats index_failures #80802

GlenRSmith commented Nov 17, 2021

elasticmachine commented Nov 18, 2021

GlenRSmith commented Nov 18, 2021

DaveCTurner commented Nov 22, 2021 •

edited

Loading

consulthys commented Oct 20, 2022

DaveCTurner commented Apr 16, 2024

consulthys commented Apr 20, 2024 •

edited

Loading

Improve information about _stats index_failures #80802

Improve information about _stats index_failures #80802

Comments

GlenRSmith commented Nov 17, 2021

elasticmachine commented Nov 18, 2021

GlenRSmith commented Nov 18, 2021

DaveCTurner commented Nov 22, 2021 • edited Loading

consulthys commented Oct 20, 2022

DaveCTurner commented Apr 16, 2024

consulthys commented Apr 20, 2024 • edited Loading

DaveCTurner commented Nov 22, 2021 •

edited

Loading

consulthys commented Apr 20, 2024 •

edited

Loading