-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential missing spans and non-root transactions when TBS storage limit is reached #11857
Comments
I looked into the issue. It does not require multiple APM servers nor multiple agents to reproduce. This particular scenario does not involve ES pubsub. TLDR: Our existing understanding that TBS indexes every event when storage limit is reached is untrue. (TODO: docs will need to be updated, as a recent update stated this assumption but it is not true.) When storage limit is reached, in
|
I took a brief look at the code. Currently, TBS has multiple (GOMAXPROCS) ReadWriter, each will only flush after 200 writes are batched up in the transaction. We did this for performance. The storage limit check is only performed during flush. Example 1: if apm-server starts with TBS storage limit already reached, ReadWriter will happily write events and return to caller without indicating any errors, only to return an error on the 200th event, and discarding all the other 199 events in the transaction. Since the 200th event write returns an error, the caller knows that it has to write directly to ES. But for the first 199 events, they are effectively dropped without error handling. Proposed solutionsSolution 1When flush fails, decode all the events in the transaction and index them to ES. Bad
Solution 2Instead of checking storage limit during flush, we check it in every writeEntry call. Good
Bad
Solution 3Same as solution 2 but let's say we stop writing new events at 90% of storage limit. And we proceed to flush all the pending events in transaction once 90% of storage limit is reached, and return an error to all new events. Bad
Solution 4Solution 3 but add tracking of in-flight transaction size so that we don't overshoot storage limit Bad
@axw let's go over these and also discuss if there are better alternatives. cc @marclop as you initially wrote the code I believe. The idea is that we may not need a bulletproof solution, but more of a mitigation that is good enough. |
I think (4) sounds OK. I vaguely recall considering this option in the first place - performing atomic updates to a counter to track how many bytes have been written, rather than writing them and then calculating after the fact. I don't recall any more than that, but we probably didn't do this because it's more complicated. |
Right, that sounds like a proper fix. At first glance there are no public methods to get a size of a transaction / an entry, despite a private |
APM Server version (
apm-server version
): reproduced in 8.10 but should affect more versionsDescription of the problem including expected versus actual behavior:
Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.
Provide logs (if relevant):
The text was updated successfully, but these errors were encountered: