Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in custom libbeat instrumentation #8900

Merged
merged 7 commits into from
Aug 22, 2022

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented Aug 18, 2022

Motivation/summary

The PR fixes race condition in libbeat instrumentation that surfaces during reload (initial and on subsequent configuration change). The race condition is in setting up custom libbeat instrumentation by using beats monitoring#DefaultRegistry. The race arises from two reload events captured by APM-Server for every reload operation including the initial config load. The two reloads are used by APM-Server to load inputs using the beats reload registry and output using beats OutputConfigReloader.

Specifically, the race condition is in serverRunner#newFinalBatchProcessor (called from serverRunner#run) due to updates to a global monitoring#Default registry and the two reloads explained above. Even though the access to the monitoring#Default is protected by mutex in serverRunner#newFinalBatchProcessor; it is still possible that serverRunner#newFinalBatchProcessor call for a temporary reload happens later for a reload event that happens earlier. This will cause monitoring#Default to be in an incorrect state:

  1. For initial load, the incorrect state would be no custom instrumentation
  2. For subsequent reloads, the incorrect state would be instrumentation with an incorrect model indexer

Checklist

How to test these changes

  1. Run TestFleetIntegrationMonitoring multiple times and observe they are not flaky
  2. APM-Server metrics for elasticsearch output events should always be present in stack monitoring

Related issues

Closes #8383

@mergify
Copy link
Contributor

mergify bot commented Aug 18, 2022

This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Aug 18, 2022
@apmmachine
Copy link
Contributor

apmmachine commented Aug 18, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-22T03:07:05.035+0000

  • Duration: 25 min 0 sec

Test stats 🧪

Test Results
Failed 0
Passed 128
Skipped 1
Total 129

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@apmmachine
Copy link
Contributor

apmmachine commented Aug 18, 2022

📚 Go benchmark report

Diff with the main branch

name                                                            old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
CompressedRequestReader/gzip_content_encoding-12                  17.5µs ± 2%    18.2µs ± 1%    +3.60%  (p=0.016 n=5+4)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor-12                                               0.00ns ±40%    0.00ns ±79%  +182.64%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2-12    0.00ns ±21%    0.00ns ±34%  +155.69%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/nop_codec_big_tx-12                               530ns ±29%     783ns ±21%   +47.65%  (p=0.016 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12                       8.46ms ± 2%    8.82ms ± 3%    +4.26%  (p=0.016 n=5+5)
ReadEvents/nop_codec_big_tx/1000_events-12                         941µs ±13%     872µs ± 1%    -7.29%  (p=0.016 n=5+4)

name                                                            old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ReadEvents/nop_codec/10_events-12                                 25.7kB ± 0%    25.7kB ± 0%    -0.10%  (p=0.016 n=5+4)
ReadEvents/nop_codec_big_tx/1000_events-12                        2.18MB ± 0%    2.18MB ± 0%    -0.31%  (p=0.032 n=5+5)

name                                                            old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

@@ -664,19 +690,12 @@ func (s *serverRunner) waitReady(ctx context.Context, kibanaClient kibana.Client
return waitReady(ctx, s.config.WaitReadyInterval, s.tracer, s.logger, check)
}

// This mutex must be held when updating the libbeat monitoring registry,
// as there may be multiple servers running concurrently.
var monitoringRegistryMu sync.Mutex
Copy link
Contributor Author

@lahsivjar lahsivjar Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewers] IIUC, this mutex is not required. Earlier the mutex was preventing the default monitoring registry to be accessed concurrently however, that was not enough to prevent race as a temporary reload's run method might run later which will result in an inconsistent state in the default registry.

The current PR makes sure that run is called only once during reload thus not requiring this mutex anymore. Let me know if my understanding here is incorrect.

@lahsivjar lahsivjar requested a review from a team August 18, 2022 14:44
@lahsivjar lahsivjar changed the title Fix race condition on custom libbeat instrumentation Fix race condition in custom libbeat instrumentation Aug 18, 2022
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Thanks for tracking this down.

Maybe in the future we can look at splitting serverRunner.run into two methods: one which initialises things, called with a mutex; and one which actively runs, which may be concurrent.

internal/beater/beater.go Outdated Show resolved Hide resolved
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should this PR also be back ported to 8.4? Seems like a good candidate to include with the TBS reload fix.

internal/beater/beater.go Outdated Show resolved Hide resolved
Comment on lines +362 to +365
case <-runner.done:
return errors.New("runner exited unexpectedly")
case <-runner.started:
// runner has started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and great solution!

@lahsivjar lahsivjar added backport-8.4 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Aug 22, 2022
@lahsivjar lahsivjar merged commit 7494a95 into elastic:main Aug 22, 2022
@lahsivjar lahsivjar deleted the 8383_outputevents branch August 22, 2022 03:36
mergify bot pushed a commit that referenced this pull request Aug 22, 2022
(cherry picked from commit 7494a95)

# Conflicts:
#	changelogs/head.asciidoc
lahsivjar added a commit that referenced this pull request Aug 25, 2022
…8900) (#8916)

* Fix race condition in custom libbeat instrumentation (#8900)

(cherry picked from commit 7494a95)

# Conflicts:
#	changelogs/head.asciidoc

* Fix conflicts

Co-authored-by: Vishal Raj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.4 Automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Output Events Rate" in stack monitoring is always zero
4 participants