Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

Closed
carsonip opened this issue Oct 16, 2023 · 4 comments · Fixed by #12120
Closed

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

carsonip opened this issue Oct 16, 2023 · 4 comments · Fixed by #12120
Assignees
Labels
Milestone

Comments

@carsonip
Copy link
Member

carsonip commented Oct 16, 2023

APM Server version (apm-server version): reproduced in 8.10 but should affect more versions

Description of the problem including expected versus actual behavior:

  • Expected: When TBS is enabled and storage limit is reached, since everything is indexed by default, traces should be complete.
  • Actual: When TBS is enabled and storage limit is reached, we may not see spans that are expected in the trace.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.

  1. Setup 1 APM server with TBS enabled and a low storage limit
  2. Configure 1 APM agent to send a transaction containing spans in high throughput.
  3. Traces may look incomplete with missing spans in the waterfall

Provide logs (if relevant):

@carsonip carsonip added the bug label Oct 16, 2023
@carsonip carsonip self-assigned this Oct 16, 2023
@carsonip carsonip changed the title Potential missing spans when TBS storage limit is reached in a multi APM server setup Potential missing spans when TBS storage limit is reached Oct 16, 2023
@carsonip
Copy link
Member Author

carsonip commented Oct 16, 2023

I looked into the issue. It does not require multiple APM servers nor multiple agents to reproduce. This particular scenario does not involve ES pubsub.

TLDR: Our existing understanding that TBS indexes every event when storage limit is reached is untrue. (TODO: docs will need to be updated, as a recent update stated this assumption but it is not true.)

When storage limit is reached, in ProcessBatch processTransaction or processSpan, it is highly likely that writeEntry does not return any error since we batch writes for performance, such that processTransaction or processSpan returns err == nil && report == false && stored == true but in reality events are neither stored or reported. Therefore, when the root transaction ends and a decision is published, the goroutine that handles the decision only finds a subset or an empty set of events of the trace in ReadTraceEvents.

image

  • [See screenshot] When the throughput is extremely high, the trace will only contain the root transaction with no spans.
  • Under low throughput, the chance of reproducing it is low, probably because some internal buffers in badger causes ReadTraceEvents to return the full set of events in a trace.
  • Therefore, under moderate throughput, it is likely that some spans / children transactions are dropped randomly.

@carsonip carsonip changed the title Potential missing spans when TBS storage limit is reached Potential missing spans and non-root transactions when TBS storage limit is reached Oct 16, 2023
@carsonip
Copy link
Member Author

carsonip commented Oct 18, 2023

I took a brief look at the code. Currently, TBS has multiple (GOMAXPROCS) ReadWriter, each will only flush after 200 writes are batched up in the transaction. We did this for performance. The storage limit check is only performed during flush.

Example 1: if apm-server starts with TBS storage limit already reached, ReadWriter will happily write events and return to caller without indicating any errors, only to return an error on the 200th event, and discarding all the other 199 events in the transaction. Since the 200th event write returns an error, the caller knows that it has to write directly to ES. But for the first 199 events, they are effectively dropped without error handling.

Proposed solutions

Solution 1

When flush fails, decode all the events in the transaction and index them to ES.

Bad

  • May kill performance
  • writeEntry handles both trace events and trace decisions

Solution 2

Instead of checking storage limit during flush, we check it in every writeEntry call.

Good

  • storage size call seems to be cached, so performance should be ok

Bad

  • it is possible that the storage limit is not reached when writeEvent is called the 1st time, but it is reached when 2nd event comes in. Do we drop the 1st event "silently"?

Solution 3

Same as solution 2 but let's say we stop writing new events at 90% of storage limit. And we proceed to flush all the pending events in transaction once 90% of storage limit is reached, and return an error to all new events.

Bad

  • May exceed storage limit as we have multiple ReadWriter that will flush at around the same time
  • 90% may be a bad heuristic depending on average event size and GOMAXPROCS. We may waste 10% of storage limit if the number is too safe.

Solution 4

Solution 3 but add tracking of in-flight transaction size so that we don't overshoot storage limit

Bad

  • Too complex
  • May need to synchronize between ReadWriters

@axw let's go over these and also discuss if there are better alternatives. cc @marclop as you initially wrote the code I believe.

The idea is that we may not need a bulletproof solution, but more of a mitigation that is good enough.

@axw
Copy link
Member

axw commented Oct 19, 2023

I think (4) sounds OK. I vaguely recall considering this option in the first place - performing atomic updates to a counter to track how many bytes have been written, rather than writing them and then calculating after the fact. I don't recall any more than that, but we probably didn't do this because it's more complicated.

@carsonip
Copy link
Member Author

Right, that sounds like a proper fix.

At first glance there are no public methods to get a size of a transaction / an entry, despite a private txn.size field. I guess we could duplicate a little bit of size estimation logic to track pending write size, and this will still be better than writing events and calculating after the fact (what we do now). We can go into the details of the implementation once this task is scheduled but the idea is that the fix is not going to be a simple mitigation.

@carsonip carsonip removed their assignment Nov 15, 2023
@carsonip carsonip added this to the 8.11 milestone Nov 15, 2023
@simitt simitt modified the milestones: 8.11, 8.12 Nov 17, 2023
@simitt simitt modified the milestones: 8.12, 8.13 Jan 4, 2024
@simitt simitt modified the milestones: 8.13, 8.12 Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants