Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not stop sampling processor when failing to delete trace events #12509

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Jan 29, 2024

Motivation/summary

The sampling processor should never stop when apm-server is running. Instead log an error on Warn level and skip the current event.

Handle ErrTxnTooBig error when deleting trace events: because deleting adds a transaction it can increase the size above the limit.
Flush the events before deleting.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Related issues

Closes #12053

The sampling processor should never stop when apm-server is running. Instead log
an error on Warn level and skip the current event.
@kruskall kruskall requested a review from a team as a code owner January 29, 2024 03:07
Copy link
Contributor

mergify bot commented Jan 29, 2024

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 29, 2024
@simitt
Copy link
Contributor

simitt commented Jan 29, 2024

From the issue description:

To fix this we need to

already reject events that exceed a max size when they are received, as well as ensuring that events of allowed size can be processed by the system

This PR addresses the part of the github issue to not stop the TBS processor, but it does not address the problem that events exceeding the max size fill up storage without ever getting deleted. Events that are too large to be deleted, should be rejected already when received.

@carsonip
Copy link
Member

carsonip commented Jan 29, 2024

already reject events that exceed a max size when they are received

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

@simitt
Copy link
Contributor

simitt commented Jan 29, 2024

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

The code ensures that an event is added to a new transaction if the max size of the write transaction would otherwise be exceeded. I overlooked that the event cannot be stored in the first place if the event itself is exceeding the max transaction size.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

Yes, that sounds like the only reason why this might fail. +1 on the suggested solution.

@kruskall
Copy link
Member Author

kruskall commented Feb 5, 2024

This PR addresses the part of the github issue to not stop the TBS processor, but it does not address the problem that events exceeding the max size fill up storage without ever getting deleted.

There is the same ErrTxnTooBig check in writeEntry, so we should already be doing it.

Yep, we're keeping track of buffered writes so the storage shouldn't become too big as opposed to the previous approach.

Did the deletion txn get too large because writes are batched? Should we handle the ErrTxnTooBig thrown by flushing and retrying?

I'm not sure what this means, can you elaborate ?

@carsonip
Copy link
Member

carsonip commented Feb 6, 2024

I'm not sure what this means, can you elaborate ?

DeleteTraceEvent and WriteTraceEvent share the same rw. Imagine processor calls WriteTraceEvent multiple times right before reaching ErrTxnTooBig, then calls DeleteTraceEvent, which takes the txn right over the txn size limit. Since there is no ErrTxnTooBig handling inside DeleteTraceEvent unlike writeEntry, it will throw an error. I suspect this is the root cause.

To fix it, handle ErrTxnTooBig gracefully in DeleteTraceEvent like what we have in writeEntry

@kruskall
Copy link
Member Author

kruskall commented Feb 6, 2024

Ah right, deleting adds a new entry 🤦

@kruskall kruskall requested a review from carsonip February 6, 2024 17:38
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

  • Can you update the PR description?
  • Do we want to backport this to 8.12, i.e. 8.12.2?

@kruskall kruskall merged commit 6f0be72 into elastic:main Feb 6, 2024
7 of 9 checks passed
@kruskall kruskall deleted the fix/tbs-fail-large-event branch February 6, 2024 17:54
@kruskall
Copy link
Member Author

kruskall commented Feb 6, 2024

Can you update the PR description?

Done

Do we want to backport this to 8.12, i.e. 8.12.2?

I guess we could do that since we did the same for the other tbs fix. cc @simitt who opened the issue for confirmation

@simitt
Copy link
Contributor

simitt commented Feb 7, 2024

It's a small enough fix, so no objections to backport to 8.12.2.

@kruskall
Copy link
Member Author

@mergify backport 8.12

Copy link
Contributor

mergify bot commented Feb 15, 2024

backport 8.12

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Feb 15, 2024
…ts (#12509)

* fix: do not stop sampling processor when failing to delete trace events

The sampling processor should never stop when apm-server is running. Instead log
an error on Warn level and skip the current event.

* fix: handle ErrTxnTooBig when deleting trace events

(cherry picked from commit 6f0be72)
mergify bot added a commit that referenced this pull request Feb 15, 2024
…ts (#12509) (#12664)

* fix: do not stop sampling processor when failing to delete trace events

The sampling processor should never stop when apm-server is running. Instead log
an error on Warn level and skip the current event.

* fix: handle ErrTxnTooBig when deleting trace events

(cherry picked from commit 6f0be72)

Co-authored-by: kruskall <[email protected]>
@carsonip
Copy link
Member

Testing notes

✔️ test-plan-ok

Tested via #12688 as it is almost impossible to manually test this behavior. Test in #12688 fails without the handling in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify test-plan test-plan-ok v8.12.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TBS: processor fails and stops when transaction/span is too large
3 participants