Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve information about _stats index_failures #80802

Open
GlenRSmith opened this issue Nov 17, 2021 · 6 comments
Open

Improve information about _stats index_failures #80802

GlenRSmith opened this issue Nov 17, 2021 · 6 comments
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@GlenRSmith
Copy link
Contributor

As is, when exceptions occur in InternalEngine.index, they bubble up and by way of a postIndex hook, two things happen:

  • if trace logging is enable for the right class context, a trace log is written that includes the root cause failure
  • the relevant InternalIndexingStats have the index_failed counter incremented

This makes it very challenging to investigate the cause of indexing failures. I think even the most surgical setting of trace logging (I think it would be org.elasticsearch.index.shard.IndexShard ?) in production environment will result in pretty massive amounts of logging.

I've been able to figure out that, for example, that a org.elasticsearch.index.engine.VersionConflictEngineException will contribute to this count when e.g. trying to update a document with a lagging version number, but only by suspecting that to be the case and testing it in isolation.

I'm not really sure how, exactly, I would prefer to see this improved. One approach would be to add granularity to those stats; that seems like a fairly high bar to clear in justifying as it would be disruptive to the client-facing API. Changes to logging seems more palatable in that regard, and the lowest hanging approach might be to promote the [1] logger.trace in IndexShard.index to, at a minimum, logger.debug. Another approach would be to add logging at each of the places where a root cause failure occurs. Of course that would fan out the changes needed and would be more difficult to be certain all relevant scenarios are covered.

(I would contend that, regardless any effort to address the general request I'm making, the point-by-point places where relevant exceptions are raised should generate log entries, arguably as much as warn level.)

@GlenRSmith GlenRSmith added >enhancement needs:triage Requires assignment of a team area label labels Nov 17, 2021
@nik9000 nik9000 added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. team-discuss labels Nov 18, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@nik9000 nik9000 removed the needs:triage Requires assignment of a team area label label Nov 18, 2021
@GlenRSmith
Copy link
Contributor Author

Another error that does increment the count is

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
  },
  "status": 400
}

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Nov 22, 2021

The thinking here is that errors due to client requests don't really have a place in the server log - the folks with access to the server logs are often not the right people to address any problems with requests from clients, and failed indexing is rarely of concern to the server admin. It certainly doesn't warrant a WARN-level log entry for every such exception. Instead, we return the exception in the HTTP response and expect the client to handle it appropriately by logging it or retrying or raising an alert etc.

(edit: I see some value in logging these exceptions at DEBUG rather than TRACE, but no higher)

@consulthys
Copy link
Contributor

I encountered a similar issue where all monitoring indexes actually suffer from indexing failures (see last column in the screenshot below), no other "business" indexes has this problem. So here the "client" is Metricbeat monitoring ES nodes but it's not clear from the logs why this is happening.

GET _cat/shards?v&s=iif:desc&h=index,shard,docs,id,iif
index_failed

@DaveCTurner
Copy link
Contributor

Sorry for the delay in responding here @consulthys. We wouldn't expect Elasticsearch to report these things in its logs, but they should be reported by Metricbeat (if they're not benign anyway -- which I hope they are given that they seem to happen millions of times per day)

@consulthys
Copy link
Contributor

consulthys commented Apr 20, 2024

Thanks @DaveCTurner!
Thanks to our investigations, we've been able to file a new enhancement request for Elasticsearch to add an additional counter for version conflicts, as well as another enhancement request for Metricbeat that fixes the way elasticsearch.shard documents are sent to ES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

5 participants