-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: apm-server never recovers from storage limit exceeded in rare cases #14923
Comments
On one occasion all compactor just stopped logging messages like |
I have the root cause. Compactors didn't crash, nor deadlock. Just that no levels meet the compaction criteria.
Running a debugger on badger compaction routine on my badger db that's stuck in this buggy state, That explains why compactions will never be done until a level grows to exceed maxTotalSize. Since apm-server stops badger writes when storage limit is reached, badger will be stuck in this state forever. |
Tail based sampling: There are observations where after storage limit is exceeded, lsm size remains greatly higher than vlog size.
The assumption is that lsm size should usually be smaller than vlog. It is unclear whether compactions are done in the background to reclaim expired keys in LSM tree. If vlog << lsm, any vlog gc would not be effective in reclaiming storage, and apm-server may be indefinitely stuck in a state where storage exceeds limit.
There are also cases where vlog files are days old in the file system. My hypothesis is that vlog gc thread is still running, but because it relies on stats from compactions but compactions are not run, vlog files are not cleaned up.
The text was updated successfully, but these errors were encountered: