TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

carsonip · 2024-12-12T14:28:26Z

Tail based sampling: There are observations where after storage limit is exceeded, lsm size remains greatly higher than vlog size.

The assumption is that lsm size should usually be smaller than vlog. It is unclear whether compactions are done in the background to reclaim expired keys in LSM tree. If vlog << lsm, any vlog gc would not be effective in reclaiming storage, and apm-server may be indefinitely stuck in a state where storage exceeds limit.

There are also cases where vlog files are days old in the file system. My hypothesis is that vlog gc thread is still running, but because it relies on stats from compactions but compactions are not run, vlog files are not cleaned up.

carsonip · 2024-12-17T17:51:57Z

On one occasion all compactor just stopped logging messages like [Compactor: 1] Compaction for level: 2 DONE. There are no other interesting logs around the time. This indicates that there are no compactions, and as vlog gc rely on compaction stats, effectively, no vlog gc is done. The question remains, why does compaction stop? It is unclear whether it is a deadlock or crash. Debug log on a reproducible setup would be useful.

carsonip · 2024-12-17T23:52:22Z

Managed to reproduce the issue by spamming small events. apm-server TBS stucks at a state where lsm size >> vlog size and compaction and gc are not run indefinitely.

carsonip · 2024-12-18T16:39:16Z

I have the root cause. Compactors didn't crash, nor deadlock. Just that no levels meet the compaction criteria.

Basically, there are 2 compactors running, each trying to run runCompactor. (code)
Each of them will look at levels which are candidates for compaction s.pickCompactLevels() (code).
However, the reality is there is no candidate.
Zooming into how candidates are evaluated (code), it is only added if the level is "compactable", as decided by l.isCompactable(delSize) where delSize is the size already under compaction, which is usually 0.
isCompactable (code) is simply l.getTotalSize()-delSize >= l.maxTotalSize.

Running a debugger on badger compaction routine on my badger db that's stuck in this buggy state,
L1: totalSize 262717549 maxTotalSize 268435456 delSize 0
L2: totalSize 2438371277 maxTotalSize 2684354560 delSize 0

That explains why compactions will never be done until a level grows to exceed maxTotalSize.

Since apm-server stops badger writes when storage limit is reached, badger will be stuck in this state forever.

carsonip assigned lahsivjar Dec 12, 2024

carsonip added the bug label Dec 12, 2024

carsonip mentioned this issue Dec 12, 2024

[meta] Tail-based sampling (TBS) improvements #14931

Open

carsonip self-assigned this Dec 18, 2024

lahsivjar removed their assignment Dec 20, 2024

This was referenced Dec 31, 2024

TBS: flatten badger when compaction is not running #15081

Closed

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

Merged

TBS: Expired entries stay much longer than TTL and consume disk space #15121

Open

carsonip changed the title ~~TBS: Investigate cases where apm-server keeps exceeding storage limit and lsm size >> vlog size~~ TBS: apm-server never recovers from storage limit exceeded in rare cases Jan 7, 2025

carsonip closed this as completed in #15106 Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

carsonip commented Dec 12, 2024 •

edited

Loading

carsonip commented Dec 17, 2024 •

edited

Loading

carsonip commented Dec 17, 2024

carsonip commented Dec 18, 2024 •

edited

Loading

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

Comments

carsonip commented Dec 12, 2024 • edited Loading

carsonip commented Dec 17, 2024 • edited Loading

carsonip commented Dec 17, 2024

carsonip commented Dec 18, 2024 • edited Loading

carsonip commented Dec 12, 2024 •

edited

Loading

carsonip commented Dec 17, 2024 •

edited

Loading

carsonip commented Dec 18, 2024 •

edited

Loading