Optimize Trace collection #32

keathley · 2021-01-25T21:17:38Z

After #28, we determined that the call to add a new trace was taking several milliseconds in the general case and spiking up to 20ms under load. This PR is a re-write of the existing trace collection process in order to optimize callers and remove this as a bottleneck.

New Technique

With this new method, all traces are buffered in a collection of write-optimized ETS tables. There is one ETS table per scheduler. When a caller writes a trace into the ets table, it first determines its current scheduler, and then writes the trace to the corresponding ets table. Using multiple ets tables in this way helps to reduce on write contention. Periodically (by default every second) a reporting process flushes the data from these ets tables and sends it to the datadog collector.

Future Improvements

There are a few other improvements to make here down the line. The first is that this is now an unbounded buffer which is always a bad idea. We should provide a per-table, maximum number of traces that can be stored and reject new traces (or old traces) based on this maximum. This should remove the bottleneck from the callers and help ensure that we dont' have unbounded memory growth.

It is also probably still worth doing certain operations in a Task or separate, short-lived process to avoid memory bloat from binaries.

I'll work on both of these ideas, but I wanted to present this PR sooner in order to get y'alls input.

keathley · 2021-01-25T22:22:38Z

As an update, I've added buffer sizes, tasks, and re-added support for batch sizes.

keathley · 2021-01-26T18:50:00Z

I've tested this fix in our environment. We see a marked decrease in overall latency. Along with that, we see consistent latencies even under load. Before, we could see pauses of up to 20ms when sending a single trace. After these changes, the max latencies stay consistent at ~600 usecs even under load.

GregMefford · 2021-01-26T20:32:00Z

Nice, I like where this is heading, but it's obviously a big change so I want to take some time to grok it and play with it before we do a wholesale replacement like this.

novaugust

Surface level comments as I went through*. Was interested in seeing what you did. The new buffer is groovy.

Only issue I saw is real minor: the signature of send_trace/{1,2} changed but that's easily fixed.

* I should specify that's because the questions on the approach I just chatted @ keathley directly :P Looks good to me. Assumed Client is just the same code that existed before so didn't look at it.

novaugust · 2021-01-26T20:24:25Z

lib/spandex_datadog/api_server.ex

 trace = %Trace{spans: spans}
- GenServer.call(__MODULE__, {:send_trace, trace}, timeout)
+ send_trace(trace, opts)


could switch to just send_trace/1 here

novaugust · 2021-01-26T20:24:52Z

lib/spandex_datadog/api_server.ex

 end

 @doc """
 Send spans asynchronously to DataDog.
 """
 @spec send_trace(Trace.t(), Keyword.t()) :: :ok
- def send_trace(%Trace{} = trace, opts \\ []) do
+ def send_trace(%Trace{} = trace, _opts \\ []) do


should send_trace/2 be deprecated?

novaugust · 2021-01-26T20:25:06Z

lib/spandex_datadog/api_server.ex

- sync_threshold: opts[:sync_threshold],
- agent_pid: agent_pid
- }
+ task_sup = __MODULE__.TaskSupervisor


maybe module attribute this eh

novaugust · 2021-01-26T20:36:27Z

lib/spandex_datadog/api_server.ex

 :telemetry.span([:spandex_datadog, :send_trace], %{trace: trace}, fn ->
- timeout = Keyword.get(opts, :timeout, 30_000)
- result = GenServer.call(__MODULE__, {:send_trace, trace}, timeout)
+ result = Buffer.add_trace(trace)


looks like Buffer.add_trace returns true | :ok, so spec for this function needs to be updated... or maybe just change the next line to be {:ok, %{trace: trace}} and ignore the result of add_trace?

novaugust · 2021-01-26T21:20:04Z

lib/spandex_datadog/api_server/buffer.ex

@@ -0,0 +1,75 @@
+defmodule SpandexDatadog.ApiServer.Buffer do


might be good to spec and doc false all the pub funs in here

novaugust · 2021-01-26T21:23:15Z

lib/spandex_datadog/api_server/buffer.ex

+ def add_trace(trace) do
+ config = :persistent_term.get(@config_key)
+ id = :erlang.system_info(:scheduler_id)
+ buffer = :"#{__MODULE__}-#{id}"


Suggested change

buffer = :"#{__MODULE__}-#{id}"

buffer = tab_name(id)

novaugust · 2021-01-27T00:13:18Z

lib/spandex_datadog/api_server/reporter.ex

+ |> update_in([:flush_period], & &1 || 1_000)
+ |> put_in([:collector_url], collector_url)


any reason you're using the *_in instead of Map.?

novaugust · 2021-01-27T00:15:27Z

lib/spandex_datadog/api_server/reporter.ex

+ # next time = min(max(min_time * 2, 1_000), 1_000)
+ # If our minimum requests are taking way longer than 1 second than don't try
+ # schedule another


are these comments still relevant?

novaugust · 2021-01-27T00:20:45Z

lib/spandex_datadog/api_server/reporter.ex

+ |> Enum.each(fn batch ->
+ Client.send(state.http, state.collector_url, batch, verbose?: state.verbose?)


any thoughts on fanning this out in a task async_stream or the like if there's enough batches?

…izes

keathley changed the title ~~Optimize api server collection~~ Optimize Trace collection Jan 25, 2021

novaugust reviewed Jan 27, 2021

View reviewed changes

keathley added 4 commits June 21, 2021 10:44

Optimize api server

22daba4

Fix compiler warnings

a7ed345

Add max buffer size, run reports in a task, continue to chunk batch s…

f877494

…izes

Remove dead code

3798378

keathley force-pushed the improve-collector-performance branch from c7c507e to 3798378 Compare June 21, 2021 14:44

jeffutter mentioned this pull request Jun 25, 2021

Optimize Exporting to Datadog #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Trace collection #32

Optimize Trace collection #32

keathley commented Jan 25, 2021

keathley commented Jan 25, 2021

keathley commented Jan 26, 2021

GregMefford commented Jan 26, 2021

novaugust left a comment •

edited

Loading

novaugust Jan 26, 2021

novaugust Jan 26, 2021

novaugust Jan 26, 2021

novaugust Jan 26, 2021

novaugust Jan 26, 2021

novaugust Jan 26, 2021

novaugust Jan 27, 2021

novaugust Jan 27, 2021

novaugust Jan 27, 2021

		@@ -0,0 +1,75 @@
		defmodule SpandexDatadog.ApiServer.Buffer do

		\|> update_in([:flush_period], & &1 \|\| 1_000)
		\|> put_in([:collector_url], collector_url)

		\|> Enum.each(fn batch ->
		Client.send(state.http, state.collector_url, batch, verbose?: state.verbose?)

Optimize Trace collection #32

Are you sure you want to change the base?

Optimize Trace collection #32

Conversation

keathley commented Jan 25, 2021

New Technique

Future Improvements

keathley commented Jan 25, 2021

keathley commented Jan 26, 2021

GregMefford commented Jan 26, 2021

novaugust left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

novaugust left a comment •

edited

Loading