Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Tail-based sampling (TBS) improvements #14931

Open
7 tasks
carsonip opened this issue Dec 12, 2024 · 1 comment
Open
7 tasks

[meta] Tail-based sampling (TBS) improvements #14931

carsonip opened this issue Dec 12, 2024 · 1 comment
Labels

Comments

@carsonip
Copy link
Member

carsonip commented Dec 12, 2024

This is a meta-issue on tail-based sampling.

Tail-based sampling comes up frequently in bug reports, as there is minimal documentation and guidance on TBS configuration. It is not clear to users how TBS works, which leads to misconfigured TBS storage size, and consequently apm-server and ES issues.

When TBS local storage (badger) is filled, it results in error in writing traces (where apm-server logs received error writing sampled trace: configured storage limit reached (current: 127210377485, limit: 126000000000)) and bypassing TBS as sampling rate jumps to 100%, causing a performance cliff and downstream effects: surprising significant increase on writes to ES, and either slowing ES and causing backpressure to apm-server, or unexpected high storage usage in ES.

The task list contains tasks to either document it properly, investigate/fix bugs, and to provide escape hatches for compromises.

Impact: TBS is a popular feature among heavy apm-server users who rely on TBS to reduce ES storage requirements while retaining the value of the sampled traces. We need to ensure and show that TBS is good for high load, like the rest of apm-server.

Tasks

Preview Give feedback
  1. docs performance
  2. enhancement
  3. enhancement
  4. bug
    carsonip
  5. enhancement
    kruskall
  6. 9.0-candidate enhancement
  7. enhancement
@lucabelluccini
Copy link
Contributor

lucabelluccini commented Dec 19, 2024

  • [ESS/ECE only] Ability to see the disk size on Integration servers on the fly (even better, the available live disk usage) in Admin Console https://github.com/elastic/cloud/issues/128879
    • Mitigation until then: guide users to know what is the disk size via documentation pointers
  • [ESS priority] Ability to automatically set the TBS max disk usage in the Integration policy as percentage of the whole disk OR set it automatically to a sane max value and freeze it (so the customer cannot exceed the maximum)
  • [ESS/ECE and on-premise] Ability to monitor TBS disk-related metrics on self-hosted APM Servers, Integration Servers and Integration Servers in ESS via at least a Dashboard (likely not possible to add new graphs to Stack Monitoring). The dashboards could be shipped with the apm input package or with the Elastic Agent
  • [ALL] Once the metrics are shipped, it would be nice to provide an out-of-the-box alert if disk is getting full due to TBS or hit the soft limit in order to be aware when APM Server will let through all the transactions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants