Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pdk) - respect the trace flags from the parent span (if present) #13015

Closed

Conversation

andyroyle
Copy link

Summary

Kong respects the should_sample decision from a parent span if one was provided. Previously it was not doing this resulting in incomplete traces.

When using kong as an internal api gateway we are seeing incomplete traces. The logs show that the following appears to be happening:

(this is based on a sampling rate of 0.01)

  • Service A makes a request to Service B with no traceparent header specified
  • Kong generates a traceparent header and specifies trace flags of "01" (i.e. the trace should be sampled)
  • Service B handles the request from Service A and ingests the traceparent header
  • Service B then makes a request to Service C, setting the traceparent header appropriately (i.e. using flags "01")
  • Kong then resets the trace flags to "00" (i.e. the trace should not be sampled)
  • Service C handles the request from Service B and, respecting the flags from the traceparent header, does not sample the trace

This results in Kong and Services A and B sampling the trace, but Service C does not, and so we get an incomplete trace.

I identified the get_sampling_decision method as the most likely place where the issue occurs, and since parent_should_sample is a tri-state boolean (i.e. it can be true, false, or nil) it makes most sense (to me) to respect the value of parent_should_sample if it exists.

I tested this locally and it behaves as I would expect; when a traceparent header is included in the request, then the flags are respected, however if not, probabilistic sampling is applied.

However rereading the code, I can't work out why that should be the case, since the probablistic sampler is deterministic based on trace id (i.e. the same trace-id will always return the same outcome). Even stepping through the code, I couldn't work out exactly what was going on. It seems likely to me that it's something to do with how the root span is constructed, but I couldn't tell for sure.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

@CLAassistant
Copy link

CLAassistant commented May 10, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added core/pdk core/tracing cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels May 10, 2024
@team-eng-enablement team-eng-enablement added the author/community PRs from the open-source community (not Kong Inc) label May 10, 2024
Kong respects the `should_sample` decision from a parent span if one was provided. Previously it was not doing this resulting in incomplete traces.

When using kong as an internal api gateway we are seeing incomplete traces. The logs show that the following appears to be happening:

(this is based on a sampling rate of 0.01)

- Service A makes a request to Service B with no traceparent header specified
- Kong generates a traceparent header and specifies trace flags of `"01"` (i.e. the trace should be sampled)
- Service B handles the request from Service A and ingests the traceparent header
- Service B then makes a request to Service C, setting the traceparent header appropriately (i.e. using flags `"01"`)
- Kong then resets the trace flags to `"00"` (i.e. the trace should *not* be sampled)
- Service C handles the request from Service B and, respecting the flags from the traceparent header, does not sample the trace

This results in Kong and Services A and B sampling the trace, but Service C does not, and so we get an incomplete trace.

I identified the `get_sampling_decision` method as the most likely place where the issue occurs, and since `parent_should_sample` is a tri-state boolean (i.e. it can be `true`, `false`, or `nil`) it makes most sense (to me) to respect the value of `parent_should_sample` if it exists.

I tested this locally and it behaves as I would expect; when a traceparent header is included in the request, then the flags are respected, however if not, probabilistic sampling is applied.

However rereading the code, I can't work out *why* that should be the case, since the probablistic sampler is deterministic based on trace id (i.e. the same trace-id will always return the same outcome). Even stepping through the code, I couldn't work out exactly what was going on. It seems likely to me that it's something to do with how the root span is constructed, but I couldn't tell for sure.
@andyroyle andyroyle force-pushed the respect-parent-span-sampledflags branch from 19edd70 to f92ec2e Compare June 19, 2024 13:13
@jschmid1 jschmid1 requested review from samugi and hanshuebner June 21, 2024 08:37
@samugi samugi self-assigned this Jun 21, 2024
@samugi
Copy link
Member

samugi commented Jun 21, 2024

Hi @andyroyle, thank you for your contribution!

Could you please test the branch from this PR and let me know if it soves your problem? The explanation of the fix (which should answer your questions) is in the PR description.

Regarding your changes:

I identified the get_sampling_decision method as the most likely place where the issue occurs, and since parent_should_sample is a tri-state boolean (i.e. it can be true, false, or nil) it makes most sense (to me) to respect the value of parent_should_sample if it exists.

This is not wrong, however, it is also not the way Kong's sampler is currently designed and I'm afraid your logic might turn out to be a breaking change for some. At the moment, as you noticed, the parent's sampling decision is only inherited when false, else the probabilistic logic is applied.

This (if everything worked correctly) should allow configuring the sampler to do either of the following:

  1. Always inherit the parent's decision, when the sampling_rate matches that of the parent
  2. Apply additional sampling on top of the incoming trace, with a different value of sampling_rate

Since some might be relying on (2.), if we wanted to introduce a change and make Kong support a "parent based sampler", it should be configurable. However, I believe that with the fix I linked, this should not be needed any longer, but please do let me know in case I missed anything or you have a different opinion.

I am converting this to draft for the time being.
Once again, thank you for your investigation and for bringing this up!

@samugi samugi marked this pull request as draft June 21, 2024 20:23
@andyroyle
Copy link
Author

Yep, I think you identified the correct way to solve this issue. I think the key piece of context I was missing was that the sampling_rate could be empty (because it's actually the plugin's sampling rate, which can be different from the global sampling rate).

Amazing work, thanks for highlighting this, otherwise I would probably have continued to use my "fixed" version forever, believing it to be correct 😬

@samugi
Copy link
Member

samugi commented Jul 5, 2024

Yep, I think you identified the correct way to solve this issue. I think the key piece of context I was missing was that the sampling_rate could be empty (because it's actually the plugin's sampling rate, which can be different from the global sampling rate).

Amazing work, thanks for highlighting this, otherwise I would probably have continued to use my "fixed" version forever, believing it to be correct 😬

Thanks for getting back to me @andyroyle and for confirming the fix (now merged) works! I'm really glad that helped. I'm going to close this PR, your contribution was very much appreciated!

@samugi samugi closed this Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author/community PRs from the open-source community (not Kong Inc) cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/pdk core/tracing size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants