Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

Open
kushb5 opened this issue Nov 27, 2023 · 6 comments
Open

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

kushb5 opened this issue Nov 27, 2023 · 6 comments
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@kushb5
Copy link

kushb5 commented Nov 27, 2023

Describe the bug
This is not a predictable behavior observed. At times when we are doing a series of bulk write operations on an index, we observe the following translog exception with no recovery. Once observed for one index, it begins returning the same error for all API calls. Even GET _cluster/health also starts returning null_pointer_exception consistently after that. In fact, the known API queries that have executed successfully earlier also return the same error consistently.

[2023-11-16T21:28:31,752][WARN ][r.suppressed             ] [es-data-1] path: /agent-jobs/_doc/d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50, params: {index=agent-jobs, id=d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50}
org.opensearch.transport.RemoteTransportException: [es-data-1][172.62.0.91:9300][indices:data/write/bulk[s][p]]
Caused by: org.opensearch.index.translog.TranslogException: Failed to write operation [Index{id='d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50', type='_doc', seqNo=2397194, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}]
	at org.opensearch.index.translog.Translog.add(Translog.java:628) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.NullPointerException
	at org.opensearch.index.translog.Translog$Index.write(Translog.java:1315) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Index.access$300(Translog.java:1196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Operation.writeOperation(Translog.java:1169) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.writeOperationNoSize(Translog.java:1706) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.add(Translog.java:596) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

Upon observing this issue consistently with no recovery, we disabled all the application requests fired into opensearch, attempted a rollout restart of the statefulset, after which, we observed many shards go into unassigned state. The response of GET _cluster/allocation/explain returned: null pointer exception and exhausts all 5 retries. It doesn't recover even after resetting allocation retries.

Final resolution is to completely delete statefulset (without deleting PVC), and trigger creation again.

To Reproduce

  1. Initialize opensearch as a 3-replica statefulsets in a kubernetes environment with master+data+ingest
  2. Create many indices with 3+2 (p+r) indices
  3. Perform bulk write into one-third of indices at a rate that exhausts the active thread pool for bulk (though ensuring we deploy exponential backoff when OpenSearch returns write request failure upon fully using the queue size on a node)
  4. Upon observing translog exception and null pointer exception in all API calls including bulk/write, _cluster/health, attempt rollout restart of statefulset where you will observe NPE in allocation/explain response.
  5. Upon deleting and recreating statefulset (withou deleting PVC), the issue will be resolved.

Expected behavior
Bulk writes should go through successfully. Even if OpenSearch cluster returns failure because of the faulty query, the cluster should not entirely fail on the translog exception for all API calls with no scope of recovery.

Plugins
opensearch-security

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Ubuntu
  • Version: 20.04

Additional context
Add any other context about the problem here.

@kushb5 kushb5 added bug Something isn't working untriaged labels Nov 27, 2023
@macohen macohen added the Indexing Indexing, Bulk Indexing and anything related to indexing label Nov 30, 2023
@kushb5
Copy link
Author

kushb5 commented Jan 16, 2024

@ankitkala @macohen could you please update the issue with the RCA here and an ETA for the fix? Thanks.

@nitish2787
Copy link

Please let know the RCA. wanted to understand why this issue is happening. Once this happens it never recovers and we have to restart the affected ES nodes or sometimes entire cluster.

@nitish2787
Copy link

@macohen @ankitkala any help will be appreciated. would like to know what can we do to avoid this issue.

@kushb5
Copy link
Author

kushb5 commented Apr 8, 2024

Any update on this would be appreciated, @macohen @ankitkala .

@ankitkala
Copy link
Member

It looks like we're seeing a NPE while writing to translog here

The doc ID here seems to be null. Are you using autogenerated IDs? if not, can you check the doc IDs for the docs ingested on your end?

@ankitkala
Copy link
Member

I'm wondering where the NPE is actually originating from.

  • I've also verified the stack trace and the code lines maps correctly with the code in 1.2. So the code link i've shared in comment above is correct.
  • Looking at the stack trace i can see that id shouldn't be null in this case as the trace is logging the index operation which does have the valid doc id. Also, if the id is null, NPE should originate within the out.writeString method.
  • Only possible source of NPE in this case can be out being null which also shouldn't be the case given the out stream is being used in the line above as well.

@andrross andrross removed the untriaged label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

5 participants