-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354
Comments
@ankitkala @macohen could you please update the issue with the RCA here and an ETA for the fix? Thanks. |
Please let know the RCA. wanted to understand why this issue is happening. Once this happens it never recovers and we have to restart the affected ES nodes or sometimes entire cluster. |
@macohen @ankitkala any help will be appreciated. would like to know what can we do to avoid this issue. |
Any update on this would be appreciated, @macohen @ankitkala . |
It looks like we're seeing a NPE while writing to translog here The doc ID here seems to be null. Are you using autogenerated IDs? if not, can you check the doc IDs for the docs ingested on your end? |
I'm wondering where the NPE is actually originating from.
|
Describe the bug
This is not a predictable behavior observed. At times when we are doing a series of bulk write operations on an index, we observe the following translog exception with no recovery. Once observed for one index, it begins returning the same error for all API calls. Even
GET _cluster/health
also starts returning null_pointer_exception consistently after that. In fact, the known API queries that have executed successfully earlier also return the same error consistently.Upon observing this issue consistently with no recovery, we disabled all the application requests fired into opensearch, attempted a rollout restart of the statefulset, after which, we observed many shards go into unassigned state. The response of
GET _cluster/allocation/explain
returned: null pointer exception and exhausts all 5 retries. It doesn't recover even after resetting allocation retries.Final resolution is to completely delete statefulset (without deleting PVC), and trigger creation again.
To Reproduce
Expected behavior
Bulk writes should go through successfully. Even if OpenSearch cluster returns failure because of the faulty query, the cluster should not entirely fail on the translog exception for all API calls with no scope of recovery.
Plugins
opensearch-security
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: