Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

Open
shwetathareja opened this issue Mar 21, 2024 · 2 comments
Open
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run

Comments

@shwetathareja
Copy link
Member

shwetathareja commented Mar 21, 2024

Describe the bug

Observed this during gradle-check run for #12813 (comment)

https://build.ci.opensearch.org/job/gradle-check/35542/console

3 generic threads were blocked for processing publication response (it was 3 node cluster in test)

Thread[id=5851, name=opensearch[node_t0][generic][T#2], state=BLOCKED, group=TGRP-SearchWeightedRoutingIT]
  2>         at org.opensearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1381)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:442)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:433)
  2>         at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:72)
  2>         at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1501)
  2>         at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:420)
  2>         at org.opensearch.transport.InboundHandler.lambda$handleResponse$3(InboundHandler.java:414)
  2>         at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

All the threads were probably waiting on below mutex :

private <T> ActionListener<T> wrapWithMutex(ActionListener<T> listener) {
return new ActionListener<T>() {
@Override
public void onResponse(T t) {
synchronized (mutex) {
listener.onResponse(t);
}
}
@Override
public void onFailure(Exception e) {
synchronized (mutex) {
listener.onFailure(e);
}
}
};
}

Related component

Cluster Manager

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Investigate which code path was holding mutex and if it can be optimized (lock duration for code path which was holding it). Right now, it is not clear for how long the threads were blocked.

Additional Details

No response

@shwetathareja shwetathareja changed the title [BUG] [] Cluster state reponse handling thread blocked [BUG] [Potential Issue] Cluster state reponse handling thread blocked Mar 21, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@shwetathareja Thanks for creating this issue

@rajiv-kv rajiv-kv moved this from 🆕 New to Next (Next Quarter) in Cluster Manager Project Board Dec 19, 2024
@rajiv-kv
Copy link
Contributor

[Attendees - 1, 2, 3]
Next Steps

  • Verify if the test is flaky
  • Understand as to why the state publication is performed from generic threadpool

@rajiv-kv rajiv-kv added the flaky-test Random test failure that succeeds on second run label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run
Projects
Status: Next (Next Quarter)
Development

No branches or pull requests

3 participants