Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

availability-distribution: potential memory leak #5258

Closed
sandreim opened this issue Aug 6, 2024 · 2 comments · Fixed by #5321
Closed

availability-distribution: potential memory leak #5258

sandreim opened this issue Aug 6, 2024 · 2 comments · Fixed by #5321
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@sandreim
Copy link
Contributor

sandreim commented Aug 6, 2024

This is a nice pattern of increased CPU usage over time. I didn't look much at it, but it might be some iteration of a collection that doesn't get cleaned up.

Screenshot 2024-08-06 at 14 35 31

https://grafana.teleport.parity.io/goto/9zQcAE9IR?orgId=1

CC @alindima

@sandreim sandreim added the I2-bug The node fails to follow expected behavior. label Aug 6, 2024
@alindima
Copy link
Contributor

I looked into this.

The subsystem itself is not doing any work in the main loop, as it spawns separate tasks for everything.

The CPU usage for the tasks that the subsystem spawns (pov-receiver, chunk-receiver, pov-fetcher, chunk-fetcher) all look normal.

I discovered the culprit: Jaeger spans.
We hold a local hashmap of spans:

let mut spans: HashMap<Hash, jaeger::PerLeafSpan> = HashMap::new();

which adds a new entry on every leaf activation, but only gets cleaned up when a block is finalized:

FromOrchestra::Signal(OverseerSignal::BlockFinalized(hash, _)) => {

However, when a block gets finalized, we only get notified about the block with the largest height (the leaf), so we never clean up spans for relay parents that got finalized in a batch (since Grandpa finalizes chains, not blocks). Therefore, the size of the hashmap is almost always strictly increasing.

Now, to fix this there are a couple of options:

  1. Remove Jaeger altoghether: Remove jaeger #4995
  2. Fix this somehow. Simplest idea is to also store the block number for each span and when getting a finalized block, also trim all spans with lower block heights. Another option is to modify the overseer to also supply the active leaves chain that have been deactivated by this finalization.

@alindima
Copy link
Contributor

Here's a PR that fixes the issue #5321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

2 participants