fix: do not stop sampling processor when failing to delete trace events #12509

kruskall · 2024-01-29T03:06:59Z

Motivation/summary

The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event.

Handle ErrTxnTooBig error when deleting trace events: because deleting adds a transaction it can increase the size above the limit.
Flush the events before deleting.

Checklist

Update CHANGELOG.asciidoc
Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

Related issues

Closes #12053

The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event.

mergify · 2024-01-29T03:07:33Z

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.

NOTE: backport-skip has been added to this pull request.

simitt · 2024-01-29T06:46:03Z

From the issue description:

To fix this we need to

already reject events that exceed a max size when they are received, as well as ensuring that events of allowed size can be processed by the system

This PR addresses the part of the github issue to not stop the TBS processor, but it does not address the problem that events exceeding the max size fill up storage without ever getting deleted. Events that are too large to be deleted, should be rejected already when received.

carsonip · 2024-01-29T10:17:39Z

already reject events that exceed a max size when they are received

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

simitt · 2024-01-29T11:10:18Z

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

The code ensures that an event is added to a new transaction if the max size of the write transaction would otherwise be exceeded. I overlooked that the event cannot be stored in the first place if the event itself is exceeding the max transaction size.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

Yes, that sounds like the only reason why this might fail. +1 on the suggested solution.

kruskall · 2024-02-05T21:29:12Z

This PR addresses the part of the github issue to not stop the TBS processor, but it does not address the problem that events exceeding the max size fill up storage without ever getting deleted.

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

Yep, we're keeping track of buffered writes so the storage shouldn't become too big as opposed to the previous approach.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

I'm not sure what this means, can you elaborate ?

carsonip · 2024-02-06T17:12:08Z

I'm not sure what this means, can you elaborate ?

DeleteTraceEvent and WriteTraceEvent share the same rw. Imagine processor calls WriteTraceEvent multiple times right before reaching ErrTxnTooBig, then calls DeleteTraceEvent, which takes the txn right over the txn size limit. Since there is no ErrTxnTooBig handling inside DeleteTraceEvent unlike writeEntry, it will throw an error. I suspect this is the root cause.

To fix it, handle ErrTxnTooBig gracefully in DeleteTraceEvent like what we have in writeEntry

kruskall · 2024-02-06T17:37:57Z

Ah right, deleting adds a new entry 🤦

carsonip

lgtm

Can you update the PR description?
Do we want to backport this to 8.12, i.e. 8.12.2?

kruskall · 2024-02-06T17:59:27Z

Can you update the PR description?

Done

Do we want to backport this to 8.12, i.e. 8.12.2?

I guess we could do that since we did the same for the other tbs fix. cc @simitt who opened the issue for confirmation

simitt · 2024-02-07T09:51:33Z

It's a small enough fix, so no objections to backport to 8.12.2.

kruskall · 2024-02-15T18:08:58Z

@mergify backport 8.12

mergify · 2024-02-15T18:11:16Z

backport 8.12

✅ Backports have been created

#12664 fix: do not stop sampling processor when failing to delete trace events (backport #12509) has been created for branch 8.12

…ts (#12509) * fix: do not stop sampling processor when failing to delete trace events The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event. * fix: handle ErrTxnTooBig when deleting trace events (cherry picked from commit 6f0be72)

…ts (#12509) (#12664) * fix: do not stop sampling processor when failing to delete trace events The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event. * fix: handle ErrTxnTooBig when deleting trace events (cherry picked from commit 6f0be72) Co-authored-by: kruskall <[email protected]>

carsonip · 2024-02-20T15:42:57Z

Testing notes

✔️ test-plan-ok

Tested via #12688 as it is almost impossible to manually test this behavior. Test in #12688 fails without the handling in this PR.

fix: do not stop sampling processor when failing to delete trace events

3669a86

The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event.

kruskall requested a review from a team as a code owner January 29, 2024 03:07

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 29, 2024

fix: handle ErrTxnTooBig when deleting trace events

f2ab81c

Merge branch 'main' into fix/tbs-fail-large-event

c6f579d

kruskall requested a review from carsonip February 6, 2024 17:38

carsonip approved these changes Feb 6, 2024

View reviewed changes

kruskall merged commit 6f0be72 into elastic:main Feb 6, 2024
7 of 9 checks passed

kruskall deleted the fix/tbs-fail-large-event branch February 6, 2024 17:54

mergify bot mentioned this pull request Feb 15, 2024

fix: do not stop sampling processor when failing to delete trace events (backport #12509) #12664

Merged

carsonip mentioned this pull request Feb 20, 2024

test: Add TBS whitebox test for ErrTxnTooBig handling #12688

Merged

2 tasks

carsonip added test-plan test-plan-ok v8.12.2 labels Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not stop sampling processor when failing to delete trace events #12509

fix: do not stop sampling processor when failing to delete trace events #12509

kruskall commented Jan 29, 2024 •

edited

Loading

mergify bot commented Jan 29, 2024

simitt commented Jan 29, 2024

carsonip commented Jan 29, 2024 •

edited

Loading

simitt commented Jan 29, 2024

kruskall commented Feb 5, 2024

carsonip commented Feb 6, 2024

kruskall commented Feb 6, 2024

carsonip left a comment

kruskall commented Feb 6, 2024

simitt commented Feb 7, 2024

kruskall commented Feb 15, 2024

mergify bot commented Feb 15, 2024 •

edited

Loading

carsonip commented Feb 20, 2024

fix: do not stop sampling processor when failing to delete trace events #12509

fix: do not stop sampling processor when failing to delete trace events #12509

Conversation

kruskall commented Jan 29, 2024 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jan 29, 2024

simitt commented Jan 29, 2024

carsonip commented Jan 29, 2024 • edited Loading

simitt commented Jan 29, 2024

kruskall commented Feb 5, 2024

carsonip commented Feb 6, 2024

kruskall commented Feb 6, 2024

carsonip left a comment

Choose a reason for hiding this comment

kruskall commented Feb 6, 2024

simitt commented Feb 7, 2024

kruskall commented Feb 15, 2024

mergify bot commented Feb 15, 2024 • edited Loading

✅ Backports have been created

carsonip commented Feb 20, 2024

Testing notes

kruskall commented Jan 29, 2024 •

edited

Loading

carsonip commented Jan 29, 2024 •

edited

Loading

mergify bot commented Feb 15, 2024 •

edited

Loading