Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

Closed
Tracked by #14931
carsonip opened this issue Dec 12, 2024 · 3 comments · Fixed by #15106
Closed
Tracked by #14931

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

carsonip opened this issue Dec 12, 2024 · 3 comments · Fixed by #15106
Assignees
Labels

Comments

@carsonip
Copy link
Member

carsonip commented Dec 12, 2024

Tail based sampling: There are observations where after storage limit is exceeded, lsm size remains greatly higher than vlog size.

The assumption is that lsm size should usually be smaller than vlog. It is unclear whether compactions are done in the background to reclaim expired keys in LSM tree. If vlog << lsm, any vlog gc would not be effective in reclaiming storage, and apm-server may be indefinitely stuck in a state where storage exceeds limit.

There are also cases where vlog files are days old in the file system. My hypothesis is that vlog gc thread is still running, but because it relies on stats from compactions but compactions are not run, vlog files are not cleaned up.

@carsonip
Copy link
Member Author

carsonip commented Dec 17, 2024

On one occasion all compactor just stopped logging messages like [Compactor: 1] Compaction for level: 2 DONE. There are no other interesting logs around the time. This indicates that there are no compactions, and as vlog gc rely on compaction stats, effectively, no vlog gc is done. The question remains, why does compaction stop? It is unclear whether it is a deadlock or crash. Debug log on a reproducible setup would be useful.

@carsonip
Copy link
Member Author

Managed to reproduce the issue by spamming small events. apm-server TBS stucks at a state where lsm size >> vlog size and compaction and gc are not run indefinitely.Image

@carsonip carsonip self-assigned this Dec 18, 2024
@carsonip
Copy link
Member Author

carsonip commented Dec 18, 2024

I have the root cause. Compactors didn't crash, nor deadlock. Just that no levels meet the compaction criteria.

  • Basically, there are 2 compactors running, each trying to run runCompactor. (code)
  • Each of them will look at levels which are candidates for compaction s.pickCompactLevels() (code).
  • However, the reality is there is no candidate.
  • Zooming into how candidates are evaluated (code), it is only added if the level is "compactable", as decided by l.isCompactable(delSize) where delSize is the size already under compaction, which is usually 0.
  • isCompactable (code) is simply l.getTotalSize()-delSize >= l.maxTotalSize.

Running a debugger on badger compaction routine on my badger db that's stuck in this buggy state,
L1: totalSize 262717549 maxTotalSize 268435456 delSize 0
L2: totalSize 2438371277 maxTotalSize 2684354560 delSize 0

That explains why compactions will never be done until a level grows to exceed maxTotalSize.

Since apm-server stops badger writes when storage limit is reached, badger will be stuck in this state forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants