[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

kushb5 · 2023-11-27T22:36:24Z

Describe the bug
This is not a predictable behavior observed. At times when we are doing a series of bulk write operations on an index, we observe the following translog exception with no recovery. Once observed for one index, it begins returning the same error for all API calls. Even GET _cluster/health also starts returning null_pointer_exception consistently after that. In fact, the known API queries that have executed successfully earlier also return the same error consistently.

[2023-11-16T21:28:31,752][WARN ][r.suppressed             ] [es-data-1] path: /agent-jobs/_doc/d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50, params: {index=agent-jobs, id=d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50}
org.opensearch.transport.RemoteTransportException: [es-data-1][172.62.0.91:9300][indices:data/write/bulk[s][p]]
Caused by: org.opensearch.index.translog.TranslogException: Failed to write operation [Index{id='d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50', type='_doc', seqNo=2397194, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}]
	at org.opensearch.index.translog.Translog.add(Translog.java:628) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.NullPointerException
	at org.opensearch.index.translog.Translog$Index.write(Translog.java:1315) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Index.access$300(Translog.java:1196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Operation.writeOperation(Translog.java:1169) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.writeOperationNoSize(Translog.java:1706) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.add(Translog.java:596) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

Upon observing this issue consistently with no recovery, we disabled all the application requests fired into opensearch, attempted a rollout restart of the statefulset, after which, we observed many shards go into unassigned state. The response of GET _cluster/allocation/explain returned: null pointer exception and exhausts all 5 retries. It doesn't recover even after resetting allocation retries.

Final resolution is to completely delete statefulset (without deleting PVC), and trigger creation again.

To Reproduce

Initialize opensearch as a 3-replica statefulsets in a kubernetes environment with master+data+ingest
Create many indices with 3+2 (p+r) indices
Perform bulk write into one-third of indices at a rate that exhausts the active thread pool for bulk (though ensuring we deploy exponential backoff when OpenSearch returns write request failure upon fully using the queue size on a node)
Upon observing translog exception and null pointer exception in all API calls including bulk/write, _cluster/health, attempt rollout restart of statefulset where you will observe NPE in allocation/explain response.
Upon deleting and recreating statefulset (withou deleting PVC), the issue will be resolved.

Expected behavior
Bulk writes should go through successfully. Even if OpenSearch cluster returns failure because of the faulty query, the cluster should not entirely fail on the translog exception for all API calls with no scope of recovery.

Plugins
opensearch-security

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Ubuntu
Version: 20.04

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

kushb5 · 2024-01-16T11:21:00Z

@ankitkala @macohen could you please update the issue with the RCA here and an ETA for the fix? Thanks.

nitish2787 · 2024-01-18T20:57:01Z

Please let know the RCA. wanted to understand why this issue is happening. Once this happens it never recovers and we have to restart the affected ES nodes or sometimes entire cluster.

nitish2787 · 2024-03-06T18:56:17Z

@macohen @ankitkala any help will be appreciated. would like to know what can we do to avoid this issue.

kushb5 · 2024-04-08T21:57:34Z

Any update on this would be appreciated, @macohen @ankitkala .

ankitkala · 2024-04-09T04:47:33Z

It looks like we're seeing a NPE while writing to translog here

The doc ID here seems to be null. Are you using autogenerated IDs? if not, can you check the doc IDs for the docs ingested on your end?

ankitkala · 2024-04-09T07:04:34Z

I'm wondering where the NPE is actually originating from.

I've also verified the stack trace and the code lines maps correctly with the code in 1.2. So the code link i've shared in comment above is correct.
Looking at the stack trace i can see that id shouldn't be null in this case as the trace is logging the index operation which does have the valid doc id. Also, if the id is null, NPE should originate within the out.writeString method.
Only possible source of NPE in this case can be out being null which also shouldn't be the case given the out stream is being used in the line above as well.

kushb5 added bug Something isn't working untriaged labels Nov 27, 2023

macohen added the Indexing Indexing, Bulk Indexing and anything related to indexing label Nov 30, 2023

ankitkala removed the untriaged label Dec 7, 2023

ankitkala added the untriaged label Apr 9, 2024

andrross removed the untriaged label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

kushb5 commented Nov 27, 2023

kushb5 commented Jan 16, 2024

nitish2787 commented Jan 18, 2024

nitish2787 commented Mar 6, 2024

kushb5 commented Apr 8, 2024

ankitkala commented Apr 9, 2024

ankitkala commented Apr 9, 2024

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

[BUG] Unrecoverable translog / NPE exception on index/bulk/write #11354

Comments

kushb5 commented Nov 27, 2023

kushb5 commented Jan 16, 2024

nitish2787 commented Jan 18, 2024

nitish2787 commented Mar 6, 2024

kushb5 commented Apr 8, 2024

ankitkala commented Apr 9, 2024

ankitkala commented Apr 9, 2024