[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

shwetathareja · 2024-03-21T09:09:01Z

Describe the bug

Observed this during gradle-check run for #12813 (comment)

https://build.ci.opensearch.org/job/gradle-check/35542/console

3 generic threads were blocked for processing publication response (it was 3 node cluster in test)

Thread[id=5851, name=opensearch[node_t0][generic][T#2], state=BLOCKED, group=TGRP-SearchWeightedRoutingIT]
  2>         at org.opensearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1381)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:442)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:433)
  2>         at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:72)
  2>         at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1501)
  2>         at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:420)
  2>         at org.opensearch.transport.InboundHandler.lambda$handleResponse$3(InboundHandler.java:414)
  2>         at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

All the threads were probably waiting on below mutex :

OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

Lines 1376 to 1392 in f3d2bee

    
           private <T> ActionListener<T> wrapWithMutex(ActionListener<T> listener) { 
        
               return new ActionListener<T>() { 
        
                   @Override 
        
                   public void onResponse(T t) { 
        
                       synchronized (mutex) { 
        
                           listener.onResponse(t); 
        
                       } 
        
                   } 
        
                   @Override 
        
                   public void onFailure(Exception e) { 
        
                       synchronized (mutex) { 
        
                           listener.onFailure(e); 
        
                       } 
        
                   } 
        
               }; 
        
           }

Related component

Cluster Manager

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

Investigate which code path was holding mutex and if it can be optimized (lock duration for code path which was holding it). Right now, it is not clear for how long the threads were blocked.

Additional Details

No response

The text was updated successfully, but these errors were encountered:

peternied · 2024-03-27T15:29:17Z

[Triage - attendees 1 2 3 4 5 6 7]
@shwetathareja Thanks for creating this issue

rajiv-kv · 2024-12-19T17:22:17Z

[Attendees - 1, 2, 3]
Next Steps

Verify if the test is flaky
Understand as to why the state publication is performed from generic threadpool

shwetathareja added bug Something isn't working untriaged Cluster Manager labels Mar 21, 2024

github-project-automation bot added this to Cluster Manager Project Board Mar 21, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Mar 21, 2024

shwetathareja changed the title ~~[BUG] [] Cluster state reponse handling thread blocked~~ [BUG] [Potential Issue] Cluster state reponse handling thread blocked Mar 21, 2024

peternied removed the untriaged label Mar 27, 2024

rajiv-kv moved this from 🆕 New to Next (Next Quarter) in Cluster Manager Project Board Dec 19, 2024

rajiv-kv added the flaky-test Random test failure that succeeds on second run label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

shwetathareja commented Mar 21, 2024 •

edited

Loading

peternied commented Mar 27, 2024

rajiv-kv commented Dec 19, 2024

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

Comments

shwetathareja commented Mar 21, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Mar 27, 2024

rajiv-kv commented Dec 19, 2024

shwetathareja commented Mar 21, 2024 •

edited

Loading