Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][remote store] 'java.lang.ArithmeticException: long overflow' causes Segment Replication failed #9690

Closed
tlfeng opened this issue Sep 1, 2023 · 2 comments
Labels
bug Something isn't working feedback needed Issue or PR needs feedback Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage:Remote Storage Issues and PRs relating to data and metadata storage

Comments

@tlfeng
Copy link
Collaborator

tlfeng commented Sep 1, 2023

Describe the bug
Coming from issue #9556, I found a problem during running performance test against OpenSearch built from code in 2.x branch (issue #8874).
java.nio.file.FileAlreadyExistsException is kind of symptom but not the root cause. When checking the log, I found java.lang.ArithmeticException: long overflow always happen before the first FileAlreadyExistsException in each test run.

[2023-08-25T04:38:33,032][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-197.us-west-2.compute.internal] [shardId 39] [replication id 344583] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=93, FILE_DIFF=3, REPLICATING=0}

org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:511) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:171) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Caused by: java.lang.ArithmeticException: long overflow
	at java.lang.Math.addExact(Math.java:903) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:122) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.reopenStreamOrFail(S3RetryingInputStream.java:236) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:192) ~[?:?]
	at org.opensearch.index.store.RemoteIndexInput.readBytes(RemoteIndexInput.java:59) ~[opensearch-2.10.0.jar:2.10.0]
	at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:289) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:182) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) ~[opensearch-2.10.0.jar:2.10.0]
	... 7 more

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@tlfeng tlfeng added bug Something isn't working untriaged labels Sep 1, 2023
@tlfeng tlfeng changed the title [BUG][remote store] [BUG][remote store] 'java.lang.ArithmeticException: long overflow' causes Segment Replication failed Sep 1, 2023
@dreamer-89 dreamer-89 added distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage Issues and PRs relating to data and metadata storage and removed untriaged labels Sep 1, 2023
@Bukhtawar Bukhtawar added the feedback needed Issue or PR needs feedback label May 16, 2024
@Bukhtawar
Copy link
Collaborator

@tianleh Can you confirm if this is still reproducible on 2.14.

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10

@gbbafna
Copy link
Collaborator

gbbafna commented Jun 27, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 ]

Closing as there is no response .

@gbbafna gbbafna closed this as completed Jun 27, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feedback needed Issue or PR needs feedback Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage:Remote Storage Issues and PRs relating to data and metadata storage
Projects
Status: ✅ Done
Development

No branches or pull requests

5 participants