[PROF-10484] refactor: use standard profiler in serverless contexts #4652

wincent · 2024-09-05T09:03:00Z

What does this PR do?

This PR removes the implementation of the ServerlessProfiler, because its behavior was effectively identical to that of the standard Profiler. It keeps the old constant around, just in case anybody is importing and using it, but now it's just a reference to the base Profiler.

Lots of whitespace differences here in the tests because I removed an outer describe(), so view with whitespace off to get a clearer picture of what actually changed.

Motivation

See the original description below for my initial investigation, when I added a failing test to reveal the failure of the serverless scheduler to actually do anything different from the non-serverless one. I could have kept the scheduler around and fixed it, but it seems better to remove it. When a Lambda function is frozen, any timers will get frozen too, so the base Profiler class will work just fine (instead of waiting 65 ⨉ 1-second chunks, it can just wait a single 65-second chunk with the same effect).

Original description

In a non-serverless environment, the profiler runs for 65 seconds, then we grab the profile data and submit it.

In a serverless context, we do something different, using a different scheduler. Serverless functions may run for a short interval before being suspended (or frozen, or paused, or whatever you want to call it) and then be resumed. The intent is for us to grab profile data in smaller chunks of 1 second each. After 65 chunks, we submit to the server. I'm not sure of exactly why we want to do it like that, because in practice we should end up obtaining the same profile data (the profiler.profile() call takes as parameters the start and end date that we care about, along with a restart parameter if we want the profiler to continue sampling; the fact that our timers may pause and resume should have no impact on the timestamps that get passed into the profiler).

Comments in Slack suggest that the intent of the original PR:

Add scheduling for serverless profiling #2495

was:

if a function is called once and runs for 1s, then frozen for a minute until the next call, the profiler would flush a very small profile

so instead the idea was to collect/merge the 65 profiles over the course of multiple calls

However:

What the current scheduler is actually doing is just setting 65 1-second timers in a row and then grabbing all the profile data and submitting it. This commit adds a failing test to show that this is the case (the existing test only asserted that we "ticked" 65 times before flushing, not that we collected 65 1-second chunks of profiling data before flushing).
It's not clear to me that the scheduler is needed at all. In its current form, it is behaving basically equivalent to the non-serverless scheduler. But I don't think a scheduler that did what that PR intended to do would stop us from submitting small profiles anyway: if a Lambda is frozen (along with its timers) and resumed, we'd still only submit after 65 seconds of actual profiling. And if the Lambda is stopped (gracefully as opposed to being unceremoniously terminated), then our _stop() logic will run which cleans up the timers and does not flush. (The only time we eagerly submit before the time is up is if our _nearOOMExport() method gets triggered via a callback.)

Additional Notes

Nope.

In a non-serverless environment, the profiler runs for 65 seconds, then we grab the profile data and submit it. In a serverless context, we do something different, using a different scheduler. Serverless functions may run for a short interval before being suspended (or frozen, or paused, or whatever you want to call it) and then be resumed. The intent is for us to grab profile data in smaller chunks of 1 second each. After 65 chunks, we submit to the server. I'm not sure of the exactly _why_ we want to do it like that, because in practice we should end up obtaining the same profile data (the `profiler.profile()` call takes as parameters the start and end date that we care about, along with a `restart` parameter if we want the profiler to continue sampling; the fact that our timers may pause and resume should have no impact on the timestamps that get passed into the profiler). Comments in Slack suggest that the intent of the original PR: - #2495 was: > if a function is called once and runs for 1s, then frozen for a minute > until the next call, the profiler would flush a very small profile > > so instead the idea was to collect/merge the 65 profiles over the > course of multiple calls However: 1. What the current scheduler is actually doing is just setting 65 1-second timers in a row and then grabbing all the profile data and submitting it. This commit adds a failing test to show that this is the case (the existing test only asserted that we "ticked" 65 times before flushing, not that we collected 65 1-second chunks of profiling data before flushing). 2. It's not clear to me that the scheduler is needed at all. In its current form, it is behaving basically equivalent to the non-serverless scheduler. But I don't think a scheduler that did what that PR intended to do would stop us from submitting small profiles anyway: if a Lambda is frozen (along with its timers) and resumed, we'd still only submit after 65 seconds of actual profiling. And if the Lambda is stopped (gracefully as opposed to being unceremoniously terminated), then our `_stop()` logic will run which cleans up the timers and does not flush. (The only time we eagerly submit before the time is up is if our `_nearOOMExport()` method gets triggered via a callback.)

github-actions · 2024-09-05T09:03:46Z

Overall package size

Self size: 7.05 MB
Deduped: 62.41 MB
No deduping: 62.69 MB

Dependency sizes

| name | version | self size | total size | |------|---------|-----------|------------| | @datadog/native-appsec | 8.1.1 | 18.67 MB | 18.68 MB | | @datadog/native-iast-taint-tracking | 3.1.0 | 12.27 MB | 12.28 MB | | @datadog/pprof | 5.3.0 | 9.85 MB | 10.22 MB | | protobufjs | 7.2.5 | 2.77 MB | 5.16 MB | | @datadog/native-iast-rewriter | 2.4.1 | 2.14 MB | 2.23 MB | | @opentelemetry/core | 1.14.0 | 872.87 kB | 1.47 MB | | @datadog/native-metrics | 2.0.0 | 898.77 kB | 1.3 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | jsonpath-plus | 9.0.0 | 580.4 kB | 1.03 MB | | import-in-the-middle | 1.8.1 | 71.67 kB | 785.15 kB | | msgpack-lite | 0.1.26 | 201.16 kB | 281.59 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.0 | 109.9 kB | 109.9 kB | | semver | 7.6.3 | 95.82 kB | 95.82 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | lru-cache | 7.14.0 | 74.95 kB | 74.95 kB | | ignore | 5.3.1 | 51.46 kB | 51.46 kB | | int64-buffer | 0.1.10 | 49.18 kB | 49.18 kB | | shell-quote | 1.8.1 | 44.96 kB | 44.96 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.3.1 | 25.21 kB | 25.21 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | dc-polyfill | 0.1.4 | 23.1 kB | 23.1 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | path-to-regexp | 0.1.10 | 6.38 kB | 6.38 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |

_{🤖 This report was automatically generated by heaviest-objects-in-the-universe}

pr-commenter · 2024-09-05T09:09:31Z

Benchmarks

Benchmark execution time: 2024-09-05 09:09:27

Comparing candidate commit e8c5045 in PR branch wincent/PROF-10484-serverless-scheduling with baseline commit 4b565ac in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 260 metrics, 6 unstable metrics.

As shown/discussed in the parent commit, the ServerlessProfiler is behaving just like the non-serverless one, so remove it. I am keeping the constant around just in case anybody is importing it and using it, but now the constant just refers to the main `Profiler` class. Lots of whitespace changes in the test file when I removed the `describe` block, so be sure to view with `git diff -b` or similar.

wincent · 2024-09-13T10:51:01Z

Closing this after verifying that there is a difference between the two profilers in the context of short-lived invocations interspersed with frozen pauses. In the following examples, let's assume invocations that take about 100ms, and the Lambda getting frozen immediately after each one:

Normal profiler: single timer with 65s deadline; if the Lambda freezes for > 65s, when it resumes, runtime will check timer, see deadline has passed, and invoke callback, flushing the profile data.
Serverless profile: 65 ⨉ 1-second deadlines; if the Lambda freezes for > 65s, when it resumes, only first timer's deadline will have passed, and next 1-second timer will be scheduled; no flushing happens until at least 65 timers have run.

The net effect is that the Normal profiler might flush a tiny profile if it spends a lot of time frozen, but the Serverless profiler will at least accumulate samples across multiple invocations (ie. they may still be small, but not as small; the trade-off, however, is that there may be significant latency before flushing¹). For active/busy Lambdas, there's not really any difference between the two profilers; they'll both end up flushing roughly every 65s.

Reason it took a while to establish this is: (a) I was going off some old (2018) writing that suggested that AWS may Lambda's may wait a bit before freezing an active lambda (ie. allow timers to continue running); and (b) I wanted to verify that the timer implementation used by Node is deadline-based and furthermore, that the counter that it is comparing against is not only monotonically increasing but also independent of freeze/thaw events²; logging confirms this.

Anecdotally, we see frozen Lambdas getting shut down after about 7 minutes of inactivity. So, the worst cases would be Lambdas which don't submit profiling data at all (because they get shutdown before we can flush), or Lambdas with big gaps between invocations (ie. imagine a Lambda that gets one invocation every 6 minutes; 390 minutes of wall-clock time could elapse before flushing). ↩
(Hand-waving) I gather it's CPU clock on the physical hardware that the Lambda VMs are running on; so, when they wake up, they see the new (incremented) value. ↩

wincent changed the title ~~test: add failing test for serverless scheduling bug~~ [PROF-10484] test: add failing test for serverless scheduling bug Sep 5, 2024

wincent added the semver-patch label Sep 5, 2024

wincent marked this pull request as ready for review September 10, 2024 13:08

wincent requested a review from a team as a code owner September 10, 2024 13:08

wincent requested review from szegedi and astuyve September 10, 2024 13:08

wincent changed the title ~~[PROF-10484] test: add failing test for serverless scheduling bug~~ [PROF-10484] refactor: use standard profiler in serverless contexts Sep 10, 2024

wincent closed this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10484] refactor: use standard profiler in serverless contexts #4652

[PROF-10484] refactor: use standard profiler in serverless contexts #4652

wincent commented Sep 5, 2024 •

edited

Loading

github-actions bot commented Sep 5, 2024 •

edited

Loading

pr-commenter bot commented Sep 5, 2024

wincent commented Sep 13, 2024 •

edited

Loading

[PROF-10484] refactor: use standard profiler in serverless contexts #4652

[PROF-10484] refactor: use standard profiler in serverless contexts #4652

Conversation

wincent commented Sep 5, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

github-actions bot commented Sep 5, 2024 • edited Loading

Overall package size

pr-commenter bot commented Sep 5, 2024

Benchmarks

wincent commented Sep 13, 2024 • edited Loading

Footnotes

wincent commented Sep 5, 2024 •

edited

Loading

github-actions bot commented Sep 5, 2024 •

edited

Loading

wincent commented Sep 13, 2024 •

edited

Loading