-
Notifications
You must be signed in to change notification settings - Fork 57
Probes and Instrumentation
Computation requires a thread of execution. In synchronous code, this thread is explicit. In asynchronous code this thread is implicit, but no less real. Event-driven programming is the process of applying structure to events until they become ordered.
But what about the inverse operation, where structured execution of code is transformed into a collection of loosely-ordered events? We do this all the time: it’s called logging.
Logging is a lossy operation; log statements lose any context that isn’t explicitly captured. While we can sometimes recreate the context by considering all the log statements collectively, this is akin to archaeology; we’re trying to create a detailed picture of what happened using sparse, unconnected details and educated guesswork.
At the same time, logging is a necessary tool. It lets us understand our code without having to fundamentally change its structure, because the stream of events it generates exists outside our thread of execution. Lamina gives us powerful tools for transforming, aggregating, and analyzing streams of events. Therefore, if we can transform these log streams into something Lamina can consume, we have a powerful tool for understanding the inner workings of our software.
The lamina.trace
namespace provides tools for instrumenting and understanding the flow of code.
A probe-channel is a permanent, grounded channel. This means that it cannot be closed or put into an error state, and that messages will never accumulate in its queue. A probe-channel is also named, which means that when we interact with probe-channel abc:def
, that will always be the same channel.
Probe-channels represent a stream of events, typically from an instrumented piece of code. The easiest way to enqueue a message into a probe-channel is by using trace
:
(trace :foo:bar
(calculate-pi-to-1e6-digits))
This will send a million digits of pi into the probe-channel returned by (probe-channel :foo:bar)
. By convention, probe namespaces are colon-delimited. If the argument given to probe-channel
is a sequence, they will be joined with colon separators. All of these function calls are equivalent:
(probe-channel :foo:bar)
(probe-channel "foo:bar")
(probe-channel [:foo :bar])
(probe-channel ["foo" :bar])
However, the body of the trace
statement will only be executed if something is consuming messages from the probe-channel. Unless we do something with the output of (probe-channel :foo:bar)
, the above trace
statement is effectively a no-op.
We can also consume collections of probe-channels using select-probes
. This takes either a wildcard string or a regex, and returns a channel representing the combined output of all probe-channels whose names match that pattern.
> (use 'lamina.core 'lamina.trace)
nil
> (receive-all (select-probes "foo*") #(println "traced:" %))
true
> (trace :foo:bar 1)
traced: 1
true
> (trace :foo 2)
traced: 2
true
> (trace :bar 3)
false
Notice that trace
returns true
if the message was consumed, and false
otherwise.
While littering the code with trace
statements won’t affect its semantics, it is a bit of an eyesore. Since most of the time meaningful events correlate with function calls, there’s an easy way to automatically instrument functions.
Calling lamina.trace/instrument
on a function will create probes for every time the function is called, returns, and throws an exception. These are called enter
, return
, and error
probes respectively.
> (use 'clojure.pprint)
nil
> (defn f [x y] (+ x y))
#'f
> (def f* (instrument f {:name :instrumented :capture :in-out}))
#'f*
> (receive-all (probe-channel :instrumented:enter) #(do (println "ENTERED") (pprint %)))
true
> (receive-all (probe-channel :instrumented:return) #(do (println "RETURNED") (pprint %)))
true
> (f* 1 2)
ENTERED
{:name "instrumented",
:timestamp 1367021930248,
:args [1 2],
:context {:host "xanadu", :pid 34403}}
RETURNED
{:name "instrumented",
:context {:host "xanadu", :pid 34403},
:timestamp 1367021930256,
:offset 0,
:compute-duration 20000,
:duration 20000,
:sub-tasks nil,
:counts {},
:args [1 2],
:result 3}
3
Notice that the we have defined our function with the name :instrumented
, so the return
probe can be accessed at :instrumented:return
.
The fields emitted by the enter
probe are:
:description |
the function’s name |
:timestamp |
the timestamp, in milliseconds, of the event |
:args |
the arguments passed to the function |
:context |
a map describing the local context, defaulting to the :host and :pid of the process |
The additional fields emitted by the return
probe are:
:result |
the value returned by the function |
:duration |
the elapsed time, in nanoseconds, since the function was called |
:computed-duration |
the elapsed time, in nanoseconds, that was not spent waiting for the return value to resolve. If the return value was not an async-promise, then this will always equal the :duration |
:sub-tasks |
the return data for any instrumented functions called in the scope of the parent function |
:counts |
values captured by increment-trace-counter or @add-to-trace-counter (see below for further explanation) |
:offset |
the elapsed time, in nanoseconds, from the topmost parent function’s invocation to this function’s invocation |
Notice that we capture the return
data for all other instrumented functions automatically. To explore this, let’s use defn-instrumented
, a combination of defn
and instrument
.
> (defn-instrumented g [x y] (+ x y))
#'g
> (defn-instrumented f
{:probes {:return (sink->> pprint)}}
[x y]
(+ (g x x) (g y y)))
In this, we have two instrumented functions, f
and g
. We do not need to explicitly define a :name
in the function metadata, since that defaults to :the:function:namespace:the-function-name
.
We have also defined a :probes
map for f
, which allows us to consume probes without explicitly calling probe-channel
. The keys of the map are the probe suffix (in this case, just :return
), and the values are a channel that will consume messages from that probe-channel. For this, we use the sink->>
function.
> (f 1 2)
{:name "user:f",
:context {:host "xanadu", :pid 34403},
:timestamp 1367022313773,
:offset 0,
:compute-duration 72000,
:duration 72000,
:sub-tasks
({:name "user:g",
:context {:host "xanadu", :pid 34403},
:timestamp 1367022313773,
:offset 17000,
:compute-duration 22000,
:duration 22000,
:sub-tasks nil,
:counts {},
:args nil,
:result nil}
{:name "user:g",
:context {:host "xanadu", :pid 34403},
:timestamp 1367022313773,
:offset 50000,
:compute-duration 13000,
:duration 13000,
:sub-tasks nil,
:counts {},
:args nil,
:result nil}),
:counts {},
:args nil,
:result nil}
6
Notice that the :sub-tasks
field now has two entries, one for each time g
was called. Each of these has their own :args
and :result
, and an :offset
with respect to when f
was initially invoked.
Notice also that neither the :args
or :result
fields have any entries. This is because we have not specified what should be captured within the functions via the :capture
option, and this defaults to capturing neither. There are a variety of options that can be used when instrumenting functions:
:capture |
controls whether the input arguments or return value will be captured by the timing; may be :in , :out , :in-out , or :none . Defaults to :none . |
:name |
the name which will be prefixed to the enter , return , and exit probes; in the case of defn-instrumented defaults to the function name with the namespace prefixed |
:probes |
a map of sub-names (with :name implicitly prefixed) onto channels that consume those probes |
:executor |
the executor which will invoke the function body; if defined the function will return an async-promise rather than a realized value |
:timeout |
only used when :executor is defined; a function which takes the arguments and returns the timeout in milliseconds |
:implicit? |
if true, when this function is called within the scope of another instrumented function its timing will be represented in the top-level timing’s :sub-tasks . Defaults to true. |
:with-bindings |
only used when :executor is defined; if true, the thread-local bindings are conveyed to the executing thread. Defaults to false. |
When we specify that a function should be executed on another thread, we’re inserting a delay between the invocation of the function and its execution. With normal thread pools it can be hard to differentiate between time spent on the pool’s queue and time spent actually executing the function. However, Lamina’s executors are specially instrumented, so the timing for any instrumented function that’s executed on an executor will have an :enqueued-duration
field, which is the amount of time, in nanoseconds, that the function spent waiting to be executed. The :duration
field, in this case, is the sum of the :enqueued-duration
and time spent actually executing the function.
Using the with-instrumentation
macro, which captures and returns the timing information for all instrumented functions, we can see this in action:
> (pprint
(with-instrumentation
(task
(+ 1 1)))
{:name "with-instrumentation",
:context {:host "xanadu.local", :pid 34403},
:timestamp 1367022932809,
:offset 0,
:compute-duration 5470000,
:duration 5470000,
:sub-tasks
({:name "user:task",
:context {:host "xanadu.local", :pid 34403},
:offset 1872000,
:timestamp 1367022932811,
:duration 1381000,
:enqueued-duration 1181000,
:compute-duration 1381000,
:sub-tasks nil,
:args nil,
:result nil,
:counts {}}),
:counts {},
:args nil,
:result 1}
These nested timing data structures can be hard to read, so in many cases it’s useful to use the time*
macro, which behaves like Clojure’s time
macro but gives timing information for every instrumented function:
> (time* (task (+ 1 1)))
time - 657.0us
user:task - 229.0us - enqueued for 177.0us
<< ... >>
Notice that in the entry for task
, it displays both the total time and the time spent waiting for execution. This printed format may be reproduced by passing the timing data into format-timing
.
time*
can be a very powerful way to understand the performance characteristics of our code, if important functions are instrumented. In Aleph, the http-request
function is already instrumented, which lets us examine exactly where time is spent:
> (use 'aleph.http 'lamina.trace)
nil
> (time*
(http-request
{:method :get, :url "http://google.com"}))
time - 62.0ms
aleph:http-request - 61.9ms - computed for 658.0us, waited for 61.3ms
aleph:http-connection - 16.4ms - computed for 620.0us, waited for 15.8ms
Since both the inner http-request
and http-connection
functions return async-promises, the printed output is able to differentiate between time spent computing and time spent waiting for the value to resolve. In this case, we can see that we spent ~16ms waiting for the TCP connection to open, and ~62ms for both the connection to open and the HTTP response to arrive.
There is a query language for writing basic transformers on probes (or other channels) using a specialized string-based syntax.
TODO: stomp endpoints