-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10484] refactor: use standard profiler in serverless contexts #4652
Conversation
In a non-serverless environment, the profiler runs for 65 seconds, then we grab the profile data and submit it. In a serverless context, we do something different, using a different scheduler. Serverless functions may run for a short interval before being suspended (or frozen, or paused, or whatever you want to call it) and then be resumed. The intent is for us to grab profile data in smaller chunks of 1 second each. After 65 chunks, we submit to the server. I'm not sure of the exactly _why_ we want to do it like that, because in practice we should end up obtaining the same profile data (the `profiler.profile()` call takes as parameters the start and end date that we care about, along with a `restart` parameter if we want the profiler to continue sampling; the fact that our timers may pause and resume should have no impact on the timestamps that get passed into the profiler). Comments in Slack suggest that the intent of the original PR: - #2495 was: > if a function is called once and runs for 1s, then frozen for a minute > until the next call, the profiler would flush a very small profile > > so instead the idea was to collect/merge the 65 profiles over the > course of multiple calls However: 1. What the current scheduler is actually doing is just setting 65 1-second timers in a row and then grabbing all the profile data and submitting it. This commit adds a failing test to show that this is the case (the existing test only asserted that we "ticked" 65 times before flushing, not that we collected 65 1-second chunks of profiling data before flushing). 2. It's not clear to me that the scheduler is needed at all. In its current form, it is behaving basically equivalent to the non-serverless scheduler. But I don't think a scheduler that did what that PR intended to do would stop us from submitting small profiles anyway: if a Lambda is frozen (along with its timers) and resumed, we'd still only submit after 65 seconds of actual profiling. And if the Lambda is stopped (gracefully as opposed to being unceremoniously terminated), then our `_stop()` logic will run which cleans up the timers and does not flush. (The only time we eagerly submit before the time is up is if our `_nearOOMExport()` method gets triggered via a callback.)
Overall package sizeSelf size: 7.05 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | @datadog/native-appsec | 8.1.1 | 18.67 MB | 18.68 MB | | @datadog/native-iast-taint-tracking | 3.1.0 | 12.27 MB | 12.28 MB | | @datadog/pprof | 5.3.0 | 9.85 MB | 10.22 MB | | protobufjs | 7.2.5 | 2.77 MB | 5.16 MB | | @datadog/native-iast-rewriter | 2.4.1 | 2.14 MB | 2.23 MB | | @opentelemetry/core | 1.14.0 | 872.87 kB | 1.47 MB | | @datadog/native-metrics | 2.0.0 | 898.77 kB | 1.3 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | jsonpath-plus | 9.0.0 | 580.4 kB | 1.03 MB | | import-in-the-middle | 1.8.1 | 71.67 kB | 785.15 kB | | msgpack-lite | 0.1.26 | 201.16 kB | 281.59 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.0 | 109.9 kB | 109.9 kB | | semver | 7.6.3 | 95.82 kB | 95.82 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | lru-cache | 7.14.0 | 74.95 kB | 74.95 kB | | ignore | 5.3.1 | 51.46 kB | 51.46 kB | | int64-buffer | 0.1.10 | 49.18 kB | 49.18 kB | | shell-quote | 1.8.1 | 44.96 kB | 44.96 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.3.1 | 25.21 kB | 25.21 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | dc-polyfill | 0.1.4 | 23.1 kB | 23.1 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | path-to-regexp | 0.1.10 | 6.38 kB | 6.38 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
BenchmarksBenchmark execution time: 2024-09-05 09:09:27 Comparing candidate commit e8c5045 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 260 metrics, 6 unstable metrics. |
As shown/discussed in the parent commit, the ServerlessProfiler is behaving just like the non-serverless one, so remove it. I am keeping the constant around just in case anybody is importing it and using it, but now the constant just refers to the main `Profiler` class. Lots of whitespace changes in the test file when I removed the `describe` block, so be sure to view with `git diff -b` or similar.
Closing this after verifying that there is a difference between the two profilers in the context of short-lived invocations interspersed with frozen pauses. In the following examples, let's assume invocations that take about 100ms, and the Lambda getting frozen immediately after each one:
The net effect is that the Normal profiler might flush a tiny profile if it spends a lot of time frozen, but the Serverless profiler will at least accumulate samples across multiple invocations (ie. they may still be small, but not as small; the trade-off, however, is that there may be significant latency before flushing1). For active/busy Lambdas, there's not really any difference between the two profilers; they'll both end up flushing roughly every 65s. Reason it took a while to establish this is: (a) I was going off some old (2018) writing that suggested that AWS may Lambda's may wait a bit before freezing an active lambda (ie. allow timers to continue running); and (b) I wanted to verify that the timer implementation used by Node is deadline-based and furthermore, that the counter that it is comparing against is not only monotonically increasing but also independent of freeze/thaw events2; logging confirms this. Footnotes
|
What does this PR do?
This PR removes the implementation of the
ServerlessProfiler
, because its behavior was effectively identical to that of the standardProfiler
. It keeps the old constant around, just in case anybody is importing and using it, but now it's just a reference to the baseProfiler
.Lots of whitespace differences here in the tests because I removed an outer
describe()
, so view with whitespace off to get a clearer picture of what actually changed.Motivation
See the original description below for my initial investigation, when I added a failing test to reveal the failure of the serverless scheduler to actually do anything different from the non-serverless one. I could have kept the scheduler around and fixed it, but it seems better to remove it. When a Lambda function is frozen, any timers will get frozen too, so the base
Profiler
class will work just fine (instead of waiting 65 ⨉ 1-second chunks, it can just wait a single 65-second chunk with the same effect).Original description
In a non-serverless environment, the profiler runs for 65 seconds, then we grab the profile data and submit it.
In a serverless context, we do something different, using a different scheduler. Serverless functions may run for a short interval before being suspended (or frozen, or paused, or whatever you want to call it) and then be resumed. The intent is for us to grab profile data in smaller chunks of 1 second each. After 65 chunks, we submit to the server. I'm not sure of exactly why we want to do it like that, because in practice we should end up obtaining the same profile data (the
profiler.profile()
call takes as parameters the start and end date that we care about, along with arestart
parameter if we want the profiler to continue sampling; the fact that our timers may pause and resume should have no impact on the timestamps that get passed into the profiler).Comments in Slack suggest that the intent of the original PR:
was:
However:
_stop()
logic will run which cleans up the timers and does not flush. (The only time we eagerly submit before the time is up is if our_nearOOMExport()
method gets triggered via a callback.)Additional Notes
Nope.