Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Memory leak on the temporal worker side(OperatorSubscriber Class Objects are taking much of the memory) #1541

Open
jainshivsagar opened this issue Oct 8, 2024 · 4 comments
Labels

Comments

@jainshivsagar
Copy link

jainshivsagar commented Oct 8, 2024

What are you really trying to do?

We are using the TypeScript SDK v1.9.0 for our temporal worker.
We are doing the load testing in the local System. At the start of the worker the Memory(or Heap) utilization was <120MB, post the load testing, the Memory(or Heap) increased to ~300MB. After 5-10 minutes of load testing the memory utilization didn't come down.
Please refer to the below screenshots of Chrome DevTool(Memory Profiler):-

Memory Utilization after starting the worker:-

image

Memory Utilization after load testing:-
image

Comparison of first two Heap snapshots:-
image

Environment/Versions

  • OS and processor: 2.6 GHz 6-Core Intel Core i7, MacOs Sequoia(v15.0.1)
  • Temporal Version: Temporal TypeScript SDK v1.9.0
  • We running the worker in K8 nodes on the Staging/Prod ENV.

Worker Configurations:-

  const worker = await Worker.create({
    connection,
    namespace: config.Temporal.Namespace,
    taskQueue: config.Temporal.TaskQueue,
    workflowsPath: require.resolve('./workflows'),
    activities: getActivities(),
    // maxActivitiesPerSecond: 100,
    // maxTaskQueueActivitiesPerSecond: 100,
    // maxConcurrentActivityTaskExecutions: 2000,
    maxConcurrentWorkflowTaskExecutions: 200,
    maxConcurrentWorkflowTaskPolls: 100,
    // maxConcurrentActivityTaskPolls: 1000,
    // maxCachedWorkflows: 3000,
    maxConcurrentLocalActivityExecutions: 200,
    // workflowThreadPoolSize: 100,
    enableNonLocalActivities: false
  });
@jainshivsagar jainshivsagar added the bug Something isn't working label Oct 8, 2024
@mjameswh
Copy link
Contributor

mjameswh commented Oct 8, 2024

After 5-10 minutes of load testing the memory utilization didn't come down.

What happens if you continue feeding workflows to the worker? Does memory continue to go up, or does it stay stable around some value, e.g. ~300MB?

The Worker caches Workflows in an LRU; a Workflow stays in the cache until it gets evicted either 1) to make room for another workflow that’s coming in, or 2) because processing of a Workflow Task failed. Completion of a Workflow doesn’t result in eviction.

That means that, assuming there’s no Workflow Task failures, the sticky cache size should quickly grow up to its maximum value, and then stay at that number for very long period of time (i.e. until the pod gets shutdown).

@mjameswh mjameswh added support and removed bug Something isn't working labels Oct 8, 2024
@jainshivsagar
Copy link
Author

Hi @mjameswh ,
I posted the above data after 3-4 rounds of load testing. After each round of testing, I observed that heap memory utilization was growing incrementally. After the 4th round of testing & waiting for 5-10 mins, the heap memory utilization didn't come down.

@mjameswh
Copy link
Contributor

waiting for 5-10 mins, the heap memory utilization didn't come down.

As I said before, we do not expect a Worker's memory usage to come down once a Workflow has completed. Completed Workflows may still be queried, so caching them may still be beneficial.

What we'd expect is for memory usage to grow until the cache size reaches its maximum capacity (maxCachedWorkflows), after which memory usage should remain relatively stable, as less recently used Workflows will get evicted from cache.

After each round of testing, I observed that heap memory utilization was growing incrementally.

  • How many Workflows get started per round?
  • What is the capacity of your workflow cache? i.e. maxCachedWorkflows? If unsure, look at your Worker Options printed to logs on Worker start.
  • Can you please try this with SDK 1.11.2?
  • In your heap snapshot, how many instances of VMWorkflowThreadProxy do you have? How many instances of Activity?

Your screenshot indicates 43'890 instances of OperatorSubscriber. That's certainly a lot, yet could still be legitimate, depending on the number of cached Workflows; there are multiple OperatorSubscriber per cached Workflows and per pending Activities.

  • At which point in your test sequence was your heap snapshot taken?
  • By any chance, would it be possible for you to share your heap snapshot file? If you are a Temporal Cloud user, you may open a Zendesk ticket and include the file as a secure attachment.
  • Otherwise, could you please try to provide reproduction code that demonstrates this issue?

@mjameswh
Copy link
Contributor

mjameswh commented Nov 5, 2024

@jainshivsagar Are you still observing issues, or may we close this ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants