Potential missing spans and non-root transactions when TBS storage limit is reached #11857

carsonip · 2023-10-16T11:35:50Z

APM Server version (apm-server version): reproduced in 8.10 but should affect more versions

Description of the problem including expected versus actual behavior:

Expected: When TBS is enabled and storage limit is reached, since everything is indexed by default, traces should be complete.
Actual: When TBS is enabled and storage limit is reached, we may not see spans that are expected in the trace.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.

Setup 1 APM server with TBS enabled and a low storage limit
Configure 1 APM agent to send a transaction containing spans in high throughput.
Traces may look incomplete with missing spans in the waterfall

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

carsonip · 2023-10-16T22:07:53Z

I looked into the issue. It does not require multiple APM servers nor multiple agents to reproduce. This particular scenario does not involve ES pubsub.

TLDR: Our existing understanding that TBS indexes every event when storage limit is reached is untrue. (TODO: docs will need to be updated, as a recent update stated this assumption but it is not true.)

When storage limit is reached, in ProcessBatch processTransaction or processSpan, it is highly likely that writeEntry does not return any error since we batch writes for performance, such that processTransaction or processSpan returns err == nil && report == false && stored == true but in reality events are neither stored or reported. Therefore, when the root transaction ends and a decision is published, the goroutine that handles the decision only finds a subset or an empty set of events of the trace in ReadTraceEvents.

[See screenshot] When the throughput is extremely high, the trace will only contain the root transaction with no spans.
Under low throughput, the chance of reproducing it is low, probably because some internal buffers in badger causes ReadTraceEvents to return the full set of events in a trace.
Therefore, under moderate throughput, it is likely that some spans / children transactions are dropped randomly.

carsonip · 2023-10-18T17:47:17Z

I took a brief look at the code. Currently, TBS has multiple (GOMAXPROCS) ReadWriter, each will only flush after 200 writes are batched up in the transaction. We did this for performance. The storage limit check is only performed during flush.

Example 1: if apm-server starts with TBS storage limit already reached, ReadWriter will happily write events and return to caller without indicating any errors, only to return an error on the 200th event, and discarding all the other 199 events in the transaction. Since the 200th event write returns an error, the caller knows that it has to write directly to ES. But for the first 199 events, they are effectively dropped without error handling.

Proposed solutions

Solution 1

When flush fails, decode all the events in the transaction and index them to ES.

Bad

May kill performance
writeEntry handles both trace events and trace decisions

Solution 2

Instead of checking storage limit during flush, we check it in every writeEntry call.

Good

storage size call seems to be cached, so performance should be ok

Bad

it is possible that the storage limit is not reached when writeEvent is called the 1st time, but it is reached when 2nd event comes in. Do we drop the 1st event "silently"?

Solution 3

Same as solution 2 but let's say we stop writing new events at 90% of storage limit. And we proceed to flush all the pending events in transaction once 90% of storage limit is reached, and return an error to all new events.

Bad

May exceed storage limit as we have multiple ReadWriter that will flush at around the same time
90% may be a bad heuristic depending on average event size and GOMAXPROCS. We may waste 10% of storage limit if the number is too safe.

Solution 4

Solution 3 but add tracking of in-flight transaction size so that we don't overshoot storage limit

Bad

Too complex
May need to synchronize between ReadWriters

@axw let's go over these and also discuss if there are better alternatives. cc @marclop as you initially wrote the code I believe.

The idea is that we may not need a bulletproof solution, but more of a mitigation that is good enough.

axw · 2023-10-19T05:31:18Z

I think (4) sounds OK. I vaguely recall considering this option in the first place - performing atomic updates to a counter to track how many bytes have been written, rather than writing them and then calculating after the fact. I don't recall any more than that, but we probably didn't do this because it's more complicated.

carsonip · 2023-10-19T15:32:02Z

Right, that sounds like a proper fix.

At first glance there are no public methods to get a size of a transaction / an entry, despite a private txn.size field. I guess we could duplicate a little bit of size estimation logic to track pending write size, and this will still be better than writing events and calculating after the fact (what we do now). We can go into the details of the implementation once this task is scheduled but the idea is that the fix is not going to be a simple mitigation.

carsonip added the bug label Oct 16, 2023

carsonip self-assigned this Oct 16, 2023

carsonip changed the title ~~Potential missing spans when TBS storage limit is reached in a multi APM server setup~~ Potential missing spans when TBS storage limit is reached Oct 16, 2023

carsonip changed the title ~~Potential missing spans when TBS storage limit is reached~~ Potential missing spans and non-root transactions when TBS storage limit is reached Oct 16, 2023

simitt added the 8.11.2-candidate label Oct 18, 2023

carsonip removed their assignment Nov 15, 2023

carsonip added this to the 8.11 milestone Nov 15, 2023

simitt modified the milestones: 8.11, 8.12 Nov 17, 2023

simitt removed the 8.11.2-candidate label Nov 17, 2023

simitt assigned kruskall Nov 17, 2023

carsonip mentioned this issue Dec 11, 2023

feat: compute tbs entry size before writing and flush to avoid limit error #12120

Merged

3 tasks

simitt modified the milestones: 8.12, 8.13 Jan 4, 2024

kruskall closed this as completed in #12120 Jan 18, 2024

simitt modified the milestones: 8.13, 8.12 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

carsonip commented Oct 16, 2023 •

edited

Loading

carsonip commented Oct 16, 2023 •

edited

Loading

carsonip commented Oct 18, 2023 •

edited

Loading

axw commented Oct 19, 2023

carsonip commented Oct 19, 2023

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

Potential missing spans and non-root transactions when TBS storage limit is reached #11857

Comments

carsonip commented Oct 16, 2023 • edited Loading

carsonip commented Oct 16, 2023 • edited Loading

carsonip commented Oct 18, 2023 • edited Loading

Proposed solutions

Solution 1

Solution 2

Solution 3

Solution 4

axw commented Oct 19, 2023

carsonip commented Oct 19, 2023

carsonip commented Oct 16, 2023 •

edited

Loading

carsonip commented Oct 16, 2023 •

edited

Loading

carsonip commented Oct 18, 2023 •

edited

Loading