Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Remote Store] NullPointerException in translog upload if node drops before uploading the first metadata #12554

Open
sachinpkale opened this issue Mar 7, 2024 · 0 comments
Assignees
Labels
bug Something isn't working Storage:Remote

Comments

@sachinpkale
Copy link
Member

sachinpkale commented Mar 7, 2024

Describe the bug

We get following error while uploading translog files.

[2024-02-27T10:53:26,255][ERROR][o.o.i.t.t.BlobStoreTransferService] [8c4f026d5ef9b5702a7d65da4c517316] Failed to upload blob translog-1147.ckp
java.lang.NullPointerException
    at java.base/java.util.Objects.requireNonNull(Objects.java:209)
    at org.opensearch.index.translog.transfer.BlobStoreTransferService.uploadBlob(BlobStoreTransferService.java:130)
    at org.opensearch.index.translog.transfer.BlobStoreTransferService.lambda$uploadBlobs$2(BlobStoreTransferService.java:99)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at org.opensearch.index.translog.transfer.BlobStoreTransferService.uploadBlobs(BlobStoreTransferService.java:94)
    at org.opensearch.index.translog.transfer.TranslogTransferManager.transferSnapshot(TranslogTransferManager.java:154)
    at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:348)
    at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:326)
    at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:375)
    at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:197)
    at org.opensearch.index.engine.InternalEngine.syncTranslog(InternalEngine.java:610)
    at org.opensearch.index.shard.IndexShard.sync(IndexShard.java:4412)
    at org.opensearch.index.IndexService.maybeFSyncTranslogs(IndexService.java:1008)
    at org.opensearch.index.IndexService$AsyncTranslogFSync.runInternal(IndexService.java:1143)
    at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

Related component

Storage:Remote

To Reproduce

  • Why were translog uploads failing due to NPE?

    • StackTrace showed NPE at this line: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java#L130
    • Backtracking the trace, we found out that checksum can be null for files that are already present on the local while RemoteFsTranslog is getting initialized.
    • This works as we delete all local files and download from remote in RemoteFsTranslog constructor.
    • So, in ideal flow, post RemoteFsTranslog initialization, we will have tlog and ckp files downloaded from remote to local and file tracker updated accordingly. All good!
    • Now, the issue happens if you have tlog files on local that are not part of file tracker. As these are not part of file tracker, we try to upload them and upload fails with NPE.
  • Why were translog files present on local and not in the file tracker?

    • In download translog flow, we fetch the translog metadata file, delete existing local files and download from remote.
    • If we don't find translog metadata file, we skip deleting existing local files as well.
    • As we don't download any tlog files, the file tracker is not updated.
    • This only happens if we have files in local but no translog metadata file in remote.
  • Why was the translog file missing from remote?

    • This can happen if we upload the translog files and before uploading translog metadata, process crashes (due to any reason).
    • In this case, the process will be restarted and the same node will get the same primary shard (due to 0 replica). Node will start accepting the writes but they were not acknowledged as the translog file upload was failing and metadata is never uploaded.
    • But the translog file on local remained with dirty writes and kept growing.

Expected behavior

Translog upload should not fail due to NPE.

Additional Details

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote
Projects
Status: Next (Next Quarter)
Development

No branches or pull requests

1 participant