Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed trace sampling #52

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

reachfh
Copy link

@reachfh reachfh commented Aug 19, 2022

This PR configures distributed trace sampling according to the new Datadog ingestion controls.

Datadog now uses the x-datadog-sampling-priority header to determine whether a trace should be sampled.
There are four values:

  • MANUAL_KEEP(2) indicates that the application wants to ensure that a trace is sampled, e.g. if there is an error.
  • AUTO_KEEP (1) indicates that a trace has been selected for sampling
  • AUTO_REJECT (0) indicates that the trace has not been selected for sampling
  • MANUAL_REJECT (-1) indicates that the application wants a trace to be dropped

Before this PR, when initializing a distributed trace via headers, if the x-datadog-sampling-priority header was not set, a default value of 1 was used. This effectively forced all traces to be sampled. Now, if the header is not present, the priority is not set (default is nil).

Distributed traces can set the priority of the trace as follows:

  • Set the priority based on the x-datadog-sampling-priority header on the inbound request, propagating it to downstream http requests (handled by this module).
  • If no priority header is present (i.e. this is the first app in the trace), make a sampling decision (e.g. in a Phoenix plug) and set the priority to 1 on the current trace (see Spandex.Tracer.update_priority/2)
  • Force tracking manually by setting priority to 2 on the trace in application code

The new module SpandexDatadog.RateSampler supports sampling a proportion of traces based on the random trace_id.

In addition, this PR changes the sample rate parameters which are set when sending a trace to Datadog.
Most of these are now obsolete with the new Datadog ingest handling, and only the distributed sampling priority should be used. In the past, values were hard coded to 1.0. Now the default is to not send these metrics, but they can be configured via span tags.

The x-datadog-origin tag is set when traces are started by Datadog RUM and synthetic traces. This PR propagates that header via baggage.

Breaking Changes:

  • Change default priority to nil (not set) instead of 1
  • Do not send metric for priority unless set
  • Do not set x-datadog-sampling-priority header unless priority is set
  • Use sample_rate tag to set _dd1.sr.eausr value
  • Use rule_sample_rate tag to set _dd.rule_psr without default (previously hard coded to 1.0)
  • Use rate_limiter_rate tag to set _dd.limit_psr, without default (previously hard coded to 1.0)
  • Update library deps
  • Require a minimum of Elixir 1.11 (was 1.7)

Features:

  • Use agent_sample_rate tag to set _dd.agent_psr without default (was 1.0)
  • Add sampler which samples a proportion of traces, setting priority
  • Handle x-datadog-origin header
  • Add constants for priority in lib/spandex_datadog/constants.ex
  • Add GitHub Actions to run tests
  • Use current Config instead of Mix.Config in config files

@GregMefford
Copy link
Member

Hey! Thanks a lot for all the effort you put into this PR! ❤️
I am planning to spend some time this week and next to give Spandex some more attention, so I will try to get this reviewed and merged soon.

Copy link
Member

@GregMefford GregMefford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to look through this PR before realizing that I should focus on the spandex PR first. Just committing the comments that I had already collected here.

lib/spandex_datadog/adapter.ex Outdated Show resolved Hide resolved
@@ -52,9 +52,11 @@ defmodule SpandexDatadog.MixProject do
defp deps do
[
{:msgpax, "~> 2.2.1 or ~> 2.3"},
{:spandex, "~> 3.0"},
{:spandex, github: "TheRealReal/spandex", branch: "distributed-sampling-priority"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should focus on getting https://github.com/spandex-project/spandex/pull/142/files merged first, eh? 😁

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the spandex changes are simpler and enable spandex_datadog. All the dirty work is done here.

Comment on lines +9 to +14
* Change default `priority` to `nil` (not set) instead of 1
* Do not send metric for `priority` unless set
* Do not set `x-datadog-sampling-priority` header unless priority is set
* Use `sample_rate` tag to set `_dd1.sr.eausr` value
* Use `rule_sample_rate` tag to set `_dd.rule_psr` without default (previously hard coded to 1.0)
* Use `rate_limiter_rate` tag to set `_dd.limit_psr`, without default (previously hard coded to 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you know whether these breaking changes are necessary or if you're just proposing them. If we could avoid making breaking changes in Spandex, that would obviously be preferred, but if the current behavior is incorrect in terms of what Datadog expects, then we can definitely move toward fixing it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are necessary for us to have control so we can reduce our Datadog bill :-).
Forcing the span analytics on results in broken traces for us, i.e. we are sending (and paying for) spans even when the original trace was sampled out and not retained. The span analytics and rule settings are also obsolete with the new Datadog server-based ingestion controls.

@zachdaniel
Copy link
Member

Trying this out, but it doesn't seem to work. Not sure why, but we're still seeing 100% of spans being ingested for "unknown" reason which I assume is that the agent didn't think it could sample. @reachfh does this work for you?

@kamilkowalski
Copy link
Contributor

@zachdaniel I've verified the changes in this PR to see how it would affect our trace ingestion and created a demo app for testing. Here's how to set it up and documentation on some findings (including a bug): https://github.com/kamilkowalski/spandex_sampling_test#readme

So it seems the sampling works when viewing live traces in Datadog, but it also seems Ingestion Control constantly reports 100% of traces being ingested even when the rate is 0.5.

Maybe we're not sending some necessary statistics or headers? The library is not being recognized by Datadog - could that be a factor?

@kamilkowalski
Copy link
Contributor

Okay, Ingestion Control reporting weird numbers is actually documented behavior1:

Application-side sampling drops traces as early as possible. This causes the Ingestion Controls page to not receive enough information to report accurate sampling rates. Use only when reducing the tracing overhead is paramount.

Footnotes

  1. https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/ruby/#application-side-sampling

Copy link
Contributor

@kamilkowalski kamilkowalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your awesome work @reachfh! I'm not one of the maintainers but I went through the changes and left some comments. We're interested in getting some form of sampling working in Spandex so if there's any way I can help just let me know.

- name: Run tests
run: |
mix test
mix format --check-formatted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mix format doesn't need deps to be installed and could even run in a separate job.

Comment on lines +44 to +45
mix local.rebar --force
mix local.hex --force
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebar3 and Hex are installed by erlef/setup-beam by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

uses: actions/cache@v2
with:
path: deps/
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deps directory only contains dependency source files, no need to depend on anything else than mix.lock:

Suggested change
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }}
key: deps-${{ hashFiles('mix.lock') }}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@@ -1,7 +1,7 @@
# Elixir CircleCI 2.0 configuration file
#
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be removed now that there's a GitHub Actions workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use CircleCI, so it's fine with me.

`priority` value `AUTO_REJECT` (0) indicates that the trace has not been
selected for sampling.
"""
def auto_reject, do: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def auto_reject, do: 1
def auto_reject, do: 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@@ -0,0 +1,44 @@
defmodule SpandexDatadog.RateSampler do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this could implement a more generic Sampler behaviour, and it should be possible to set this sampler as the one used for all traces when doing application-side sampling. This is how dd-trace-rb does it. We could then implement more sampling strategies. Otherwise, every time you start a trace, you will need to run this sampler and set the priority. In our services, traces can be started in HTTP endpoints, gRPC endpoints and Kafka consumers - that's a lot of places to update. Even if it's a middleware, I don't see a reason why there shouldn't be a way to configure a global sampler for all traces starting in the service.

@reachfh
Copy link
Author

reachfh commented Apr 27, 2023

I no longer have access to the TheRealReal repo which was used to create this PR. I can make another PR, or feel free to make whatever changes you like to this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants