Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.17] TBS: drop and recreate badger db after exceeding storage limit for TTL time (backport #15106) #15169

Merged
merged 2 commits into from
Jan 7, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jan 7, 2025

Motivation/summary

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

Checklist

- [ ] Update CHANGELOG.asciidoc change will be backported. Changelog should be added on docs release.

  • Documentation has been updated

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

2 ways to test:

  1. Start apm-server with TBS, use apmsoak to send many small events <1KB, confirm it is affected by TBS: apm-server never recovers from storage limit exceeded in rare cases #14923, wait for TTL and ensure that DB is deleted and recreated, and writes to badger are resumed.
  2. Manually bloat the badger DB with either APM events or other irrelevant data, start apm-server, ensure DB is deleted and recreated, and writes to badger are resumed.

Related issues

Alternative to #15081

Fixes #14923


This is an automatic backport of pull request #15106 done by [Mergify](https://mergify.com).

…L time (#15106)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)
@mergify mergify bot added the backport label Jan 7, 2025
@mergify mergify bot requested a review from a team as a code owner January 7, 2025 21:56
@mergify mergify bot merged commit ed33af6 into 8.17 Jan 7, 2025
13 checks passed
@mergify mergify bot deleted the mergify/bp/8.17/pr-15106 branch January 7, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant