Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

Closed
amkhar opened this issue Feb 8, 2024 · 4 comments
Closed

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

amkhar opened this issue Feb 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working Cluster Manager v2.16.0 Issues and PRs related to version 2.16.0

Comments

@amkhar
Copy link
Contributor

amkhar commented Feb 8, 2024

Describe the bug

if (toExecute.isEmpty() == false) {
final String tasksSummary = processTasksBySource.entrySet().stream().map(entry -> {
String tasks = updateTask.describeTasks(entry.getValue());
return tasks.isEmpty() ? entry.getKey() : entry.getKey() + "[" + tasks + "]";
}).reduce((s1, s2) -> s1 + ", " + s2).orElse("");

While executing a pending task, we first try to log the task summary. If the pending task batchingKey has 200K tasks in the linked list, we'll end up collecting task summary of all those tasks. This takes 10 minutes which blocks the overall execution of all the tasks. This summary is being used only for logging purpose in debug setting only.

Ideally we should not log excessive even in debug mode as calculating log string is taking minutes.

Related component

Cluster Manager

To Reproduce

  1. Create 200K primary shards in a cluster which can take load of these many shards
  2. kill opensearch process on all cluster manager nodes, so it triggers reroute flow
  3. You'll see shards will go in init mode quickly
  4. But actually starting these shards will take more than 10 minutes

Expected behavior

Ideally we should not log the same thing if content is same just the shardId is different. We should short circuit and log a smaller sized string to avid this delay.

Additional Details

Hot/threads

::: {6bb0987818c65f26bf4a1028fbc6d538}{l9ltzN3dT5GCddpWAkPLYg}{1JNBmxGoSrqq3Jpe4ywfgg}<redacted>
   Hot threads at 2024-02-06T16:08:40.878Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   99.8% (498.8ms out of 500ms) cpu usage by thread 'opensearch[6bb0987818c65f26bf4a1028fbc6d538][clusterManagerService#updateTask][T#1]'
     10/10 snapshots sharing following 15 elements
       [email protected]/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
       [email protected]/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
       [email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
       [email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
       [email protected]/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
       [email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
       [email protected]/java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:662)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:202)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:243)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
       [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       [email protected]/java.lang.Thread.run(Thread.java:833)
@amkhar amkhar added bug Something isn't working untriaged labels Feb 8, 2024
@shwetathareja
Copy link
Member

shwetathareja commented Feb 9, 2024

We should look at 2 things:

  1. debug log should be brief
  2. trace can be more detailed but still need to be capped (considering there can be 100k or more tasks being batched together), it would be too expensive to log all those in clustermanagerService#updateTask which is single threaded.

@sumitasr
Copy link
Member

sumitasr commented Jul 4, 2024

Looking into it

@sumitasr
Copy link
Member

sumitasr commented Jul 9, 2024

Tasksummary string is being computed and passed to runTasks in Masterservice

private void runTasks(TaskInputs taskInputs) {

Usage:

Logging -

logger.debug("processing [{}]: ignoring, cluster-manager service not started", summary);

logger.debug("executing cluster state update for [{}]", summary);

logger.debug("failing [{}]: local node is no longer cluster-manager", summary);

logExecutionTime(computationTime, "compute cluster state update", summary);

logExecutionTime(executionTime, "notify listeners on unchanged cluster state", summary);

logger.trace("cluster state updated, source [{}]\n{}", summary, newClusterState);

logger.debug("cluster state updated, version [{}], source [{}]", newClusterState.version(), summary);

Passed as source parameter in ClusterChangedEvent object -

ClusterChangedEvent clusterChangedEvent = new ClusterChangedEvent(summary, newClusterState, previousClusterState);

Next steps - Need to understand if changing summary value passed in ClusterChangedEvent will have any effect.

@rwali-aws rwali-aws added v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.16.0 Issues and PRs related to version 2.16.0 labels Jul 11, 2024
@sumitasr
Copy link
Member

Looks like change in the summary value in ClusterChangedEvent should not have an impact on the flow. For now, i am working on introducing a short summary which will contain the task batching key instead of computing and logging full tasks details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager v2.16.0 Issues and PRs related to version 2.16.0
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants