-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve information about _stats index_failures #80802
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
Another error that does increment the count is
|
The thinking here is that errors due to client requests don't really have a place in the server log - the folks with access to the server logs are often not the right people to address any problems with requests from clients, and failed indexing is rarely of concern to the server admin. It certainly doesn't warrant a WARN-level log entry for every such exception. Instead, we return the exception in the HTTP response and expect the client to handle it appropriately by logging it or retrying or raising an alert etc. (edit: I see some value in logging these exceptions at |
I encountered a similar issue where all monitoring indexes actually suffer from indexing failures (see last column in the screenshot below), no other "business" indexes has this problem. So here the "client" is Metricbeat monitoring ES nodes but it's not clear from the logs why this is happening. |
Sorry for the delay in responding here @consulthys. We wouldn't expect Elasticsearch to report these things in its logs, but they should be reported by Metricbeat (if they're not benign anyway -- which I hope they are given that they seem to happen millions of times per day) |
Thanks @DaveCTurner! |
As is, when exceptions occur in InternalEngine.index, they bubble up and by way of a postIndex hook, two things happen:
This makes it very challenging to investigate the cause of indexing failures. I think even the most surgical setting of trace logging (I think it would be
org.elasticsearch.index.shard.IndexShard
?) in production environment will result in pretty massive amounts of logging.I've been able to figure out that, for example, that a
org.elasticsearch.index.engine.VersionConflictEngineException
will contribute to this count when e.g. trying to update a document with a lagging version number, but only by suspecting that to be the case and testing it in isolation.I'm not really sure how, exactly, I would prefer to see this improved. One approach would be to add granularity to those stats; that seems like a fairly high bar to clear in justifying as it would be disruptive to the client-facing API. Changes to logging seems more palatable in that regard, and the lowest hanging approach might be to promote the [1]
logger.trace
inIndexShard.index
to, at a minimum,logger.debug
. Another approach would be to add logging at each of the places where a root cause failure occurs. Of course that would fan out the changes needed and would be more difficult to be certain all relevant scenarios are covered.(I would contend that, regardless any effort to address the general request I'm making, the point-by-point places where relevant exceptions are raised should generate log entries, arguably as much as
warn
level.)The text was updated successfully, but these errors were encountered: