Fix race condition in custom libbeat instrumentation #8900

lahsivjar · 2022-08-18T13:21:21Z

Motivation/summary

The PR fixes race condition in libbeat instrumentation that surfaces during reload (initial and on subsequent configuration change). The race condition is in setting up custom libbeat instrumentation by using beats monitoring#DefaultRegistry. The race arises from two reload events captured by APM-Server for every reload operation including the initial config load. The two reloads are used by APM-Server to load inputs using the beats reload registry and output using beats OutputConfigReloader.

Specifically, the race condition is in serverRunner#newFinalBatchProcessor (called from serverRunner#run) due to updates to a global monitoring#Default registry and the two reloads explained above. Even though the access to the monitoring#Default is protected by mutex in serverRunner#newFinalBatchProcessor; it is still possible that serverRunner#newFinalBatchProcessor call for a temporary reload happens later for a reload event that happens earlier. This will cause monitoring#Default to be in an incorrect state:

For initial load, the incorrect state would be no custom instrumentation
For subsequent reloads, the incorrect state would be instrumentation with an incorrect model indexer

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

Run TestFleetIntegrationMonitoring multiple times and observe they are not flaky
APM-Server metrics for elasticsearch output events should always be present in stack monitoring

Related issues

Closes #8383

mergify · 2022-08-18T13:21:31Z

This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

apmmachine · 2022-08-18T13:25:21Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-08-22T03:07:05.035+0000
Duration: 25 min 0 sec

Test stats 🧪

Test	Results
Failed	0
Passed	128
Skipped	1
Total	129

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2022-08-18T13:48:42Z

📚 Go benchmark report

Diff with the main branch

name                                                            old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
CompressedRequestReader/gzip_content_encoding-12                  17.5µs ± 2%    18.2µs ± 1%    +3.60%  (p=0.016 n=5+4)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor-12                                               0.00ns ±40%    0.00ns ±79%  +182.64%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2-12    0.00ns ±21%    0.00ns ±34%  +155.69%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/nop_codec_big_tx-12                               530ns ±29%     783ns ±21%   +47.65%  (p=0.016 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12                       8.46ms ± 2%    8.82ms ± 3%    +4.26%  (p=0.016 n=5+5)
ReadEvents/nop_codec_big_tx/1000_events-12                         941µs ±13%     872µs ± 1%    -7.29%  (p=0.016 n=5+4)

name                                                            old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ReadEvents/nop_codec/10_events-12                                 25.7kB ± 0%    25.7kB ± 0%    -0.10%  (p=0.016 n=5+4)
ReadEvents/nop_codec_big_tx/1000_events-12                        2.18MB ± 0%    2.18MB ± 0%    -0.31%  (p=0.032 n=5+5)

name                                                            old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

lahsivjar · 2022-08-18T13:29:27Z

internal/beater/beater.go

@@ -664,19 +690,12 @@ func (s *serverRunner) waitReady(ctx context.Context, kibanaClient kibana.Client
 	return waitReady(ctx, s.config.WaitReadyInterval, s.tracer, s.logger, check)
 }

-// This mutex must be held when updating the libbeat monitoring registry,
-// as there may be multiple servers running concurrently.
-var monitoringRegistryMu sync.Mutex


[For reviewers] IIUC, this mutex is not required. Earlier the mutex was preventing the default monitoring registry to be accessed concurrently however, that was not enough to prevent race as a temporary reload's run method might run later which will result in an inconsistent state in the default registry.

The current PR makes sure that run is called only once during reload thus not requiring this mutex anymore. Let me know if my understanding here is incorrect.

axw

Nice one! Thanks for tracking this down.

Maybe in the future we can look at splitting serverRunner.run into two methods: one which initialises things, called with a mutex; and one which actively runs, which may be concurrent.

internal/beater/beater.go

marclop

LGTM! Should this PR also be back ported to 8.4? Seems like a good candidate to include with the TBS reload fix.

internal/beater/beater.go

marclop · 2022-08-22T02:49:12Z

internal/beater/beater.go

+	case <-runner.done:
+		return errors.New("runner exited unexpectedly")
+	case <-runner.started:
+		// runner has started


Nice catch and great solution!

(cherry picked from commit 7494a95) # Conflicts: # changelogs/head.asciidoc

…8900) (#8916) * Fix race condition in custom libbeat instrumentation (#8900) (cherry picked from commit 7494a95) # Conflicts: # changelogs/head.asciidoc * Fix conflicts Co-authored-by: Vishal Raj <[email protected]>

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Aug 18, 2022

lahsivjar added 3 commits August 18, 2022 21:23

Fix race condition on custom libbeat instrumentation

329a670

Remove unnecessary locking

92312d8

Update changelog

780c15b

lahsivjar force-pushed the 8383_outputevents branch from a13b930 to 780c15b Compare August 18, 2022 13:23

lahsivjar commented Aug 18, 2022

View reviewed changes

lahsivjar requested a review from a team August 18, 2022 14:44

lahsivjar added 2 commits August 18, 2022 23:00

fix typo

a780cf3

Update changelog entry

19b79fe

lahsivjar changed the title ~~Fix race condition on custom libbeat instrumentation~~ Fix race condition in custom libbeat instrumentation Aug 18, 2022

axw approved these changes Aug 22, 2022

View reviewed changes

internal/beater/beater.go Outdated Show resolved Hide resolved

marclop approved these changes Aug 22, 2022

View reviewed changes

lahsivjar added 2 commits August 22, 2022 11:03

Inline call to reload for standalone mode

160cdad

Merge branch 'main' into 8383_outputevents

d358eab

lahsivjar added backport-8.4 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Aug 22, 2022

lahsivjar merged commit 7494a95 into elastic:main Aug 22, 2022

lahsivjar deleted the 8383_outputevents branch August 22, 2022 03:36

mergify bot mentioned this pull request Aug 22, 2022

[8.4] Fix race condition in custom libbeat instrumentation (backport #8900) #8916

Merged

mergify bot pushed a commit that referenced this pull request Aug 22, 2022

Fix race condition in custom libbeat instrumentation (#8900)

f71b3e5

(cherry picked from commit 7494a95) # Conflicts: # changelogs/head.asciidoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in custom libbeat instrumentation #8900

Fix race condition in custom libbeat instrumentation #8900

lahsivjar commented Aug 18, 2022 •

edited

Loading

mergify bot commented Aug 18, 2022

apmmachine commented Aug 18, 2022 •

edited

Loading

Build stats

Test stats 🧪

apmmachine commented Aug 18, 2022 •

edited

Loading

lahsivjar Aug 18, 2022 •

edited

Loading

axw left a comment

marclop left a comment

marclop Aug 22, 2022

Fix race condition in custom libbeat instrumentation #8900

Fix race condition in custom libbeat instrumentation #8900

Conversation

lahsivjar commented Aug 18, 2022 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Aug 18, 2022

apmmachine commented Aug 18, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

apmmachine commented Aug 18, 2022 • edited Loading

📚 Go benchmark report

lahsivjar Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

marclop left a comment

Choose a reason for hiding this comment

marclop Aug 22, 2022

Choose a reason for hiding this comment

lahsivjar commented Aug 18, 2022 •

edited

Loading

apmmachine commented Aug 18, 2022 •

edited

Loading

apmmachine commented Aug 18, 2022 •

edited

Loading

lahsivjar Aug 18, 2022 •

edited

Loading