[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

amkhar · 2024-02-08T08:13:36Z

Describe the bug

OpenSearch/server/src/main/java/org/opensearch/cluster/service/TaskBatcher.java

Lines 197 to 201 in a0b5198

    
           if (toExecute.isEmpty() == false) { 
        
               final String tasksSummary = processTasksBySource.entrySet().stream().map(entry -> { 
        
                   String tasks = updateTask.describeTasks(entry.getValue()); 
        
                   return tasks.isEmpty() ? entry.getKey() : entry.getKey() + "[" + tasks + "]"; 
        
               }).reduce((s1, s2) -> s1 + ", " + s2).orElse("");

While executing a pending task, we first try to log the task summary. If the pending task batchingKey has 200K tasks in the linked list, we'll end up collecting task summary of all those tasks. This takes 10 minutes which blocks the overall execution of all the tasks. This summary is being used only for logging purpose in debug setting only.

Ideally we should not log excessive even in debug mode as calculating log string is taking minutes.

Related component

Cluster Manager

To Reproduce

Create 200K primary shards in a cluster which can take load of these many shards
kill opensearch process on all cluster manager nodes, so it triggers reroute flow
You'll see shards will go in init mode quickly
But actually starting these shards will take more than 10 minutes

Expected behavior

Ideally we should not log the same thing if content is same just the shardId is different. We should short circuit and log a smaller sized string to avid this delay.

Additional Details

Hot/threads

::: {6bb0987818c65f26bf4a1028fbc6d538}{l9ltzN3dT5GCddpWAkPLYg}{1JNBmxGoSrqq3Jpe4ywfgg}<redacted>
   Hot threads at 2024-02-06T16:08:40.878Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   99.8% (498.8ms out of 500ms) cpu usage by thread 'opensearch[6bb0987818c65f26bf4a1028fbc6d538][clusterManagerService#updateTask][T#1]'
     10/10 snapshots sharing following 15 elements
       [email protected]/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
       [email protected]/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
       [email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
       [email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
       [email protected]/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
       [email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
       [email protected]/java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:662)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:202)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:243)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
       [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       [email protected]/java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-02-09T07:20:41Z

We should look at 2 things:

debug log should be brief
trace can be more detailed but still need to be capped (considering there can be 100k or more tasks being batched together), it would be too expensive to log all those in clustermanagerService#updateTask which is single threaded.

sumitasr · 2024-07-04T07:15:40Z

Looking into it

sumitasr · 2024-07-09T05:21:11Z

Tasksummary string is being computed and passed to runTasks in Masterservice

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 284 in a0b5198

private void runTasks(TaskInputs taskInputs) {

Usage:

Logging -

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 287 in a0b5198

    
           logger.debug("processing [{}]: ignoring, cluster-manager service not started", summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 291 in a0b5198

logger.debug("executing cluster state update for [{}]", summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 295 in a0b5198

logger.debug("failing [{}]: local node is no longer cluster-manager", summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 304 in a0b5198

logExecutionTime(computationTime, "compute cluster state update", summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 310 in a0b5198

    
           logExecutionTime(executionTime, "notify listeners on unchanged cluster state", summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 314 in a0b5198

    
           logger.trace("cluster state updated, source [{}]\n{}", summary, newClusterState);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 316 in a0b5198

    
           logger.debug("cluster state updated, version [{}], source [{}]", newClusterState.version(), summary);

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 328 in a0b5198

summary,

Passed as source parameter in ClusterChangedEvent object -

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 320 in a0b5198

    
           ClusterChangedEvent clusterChangedEvent = new ClusterChangedEvent(summary, newClusterState, previousClusterState);

Next steps - Need to understand if changing summary value passed in ClusterChangedEvent will have any effect.

sumitasr · 2024-07-17T08:47:44Z

Looks like change in the summary value in ClusterChangedEvent should not have an impact on the flow. For now, i am working on introducing a short summary which will contain the task batching key instead of computing and logging full tasks details.

amkhar added bug Something isn't working untriaged labels Feb 8, 2024

github-actions bot added the Cluster Manager label Feb 8, 2024

shwetathareja removed the untriaged label Feb 9, 2024

rwali-aws added this to Cluster Manager Project Board Apr 20, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Apr 20, 2024

rwali-aws assigned sumitasr Jul 4, 2024

rwali-aws added v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.16.0 Issues and PRs related to version 2.16.0 labels Jul 11, 2024

sumitasr mentioned this issue Jul 17, 2024

Reduce logging in DEBUG for MasterService:run #14795

Merged

rwali-aws closed this as completed Sep 10, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Cluster Manager Project Board Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

amkhar commented Feb 8, 2024

shwetathareja commented Feb 9, 2024 •

edited

Loading

sumitasr commented Jul 4, 2024

sumitasr commented Jul 9, 2024 •

edited

Loading

sumitasr commented Jul 17, 2024

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

Comments

amkhar commented Feb 8, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented Feb 9, 2024 • edited Loading

sumitasr commented Jul 4, 2024

sumitasr commented Jul 9, 2024 • edited Loading

sumitasr commented Jul 17, 2024

shwetathareja commented Feb 9, 2024 •

edited

Loading

sumitasr commented Jul 9, 2024 •

edited

Loading