-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in custom libbeat instrumentation #8900
Conversation
This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
NOTE: |
a13b930
to
780c15b
Compare
📚 Go benchmark reportDiff with the
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
@@ -664,19 +690,12 @@ func (s *serverRunner) waitReady(ctx context.Context, kibanaClient kibana.Client | |||
return waitReady(ctx, s.config.WaitReadyInterval, s.tracer, s.logger, check) | |||
} | |||
|
|||
// This mutex must be held when updating the libbeat monitoring registry, | |||
// as there may be multiple servers running concurrently. | |||
var monitoringRegistryMu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[For reviewers] IIUC, this mutex is not required. Earlier the mutex was preventing the default monitoring registry to be accessed concurrently however, that was not enough to prevent race as a temporary reload's run
method might run later which will result in an inconsistent state in the default registry.
The current PR makes sure that run
is called only once during reload thus not requiring this mutex anymore. Let me know if my understanding here is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! Thanks for tracking this down.
Maybe in the future we can look at splitting serverRunner.run
into two methods: one which initialises things, called with a mutex; and one which actively runs, which may be concurrent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Should this PR also be back ported to 8.4
? Seems like a good candidate to include with the TBS reload fix.
case <-runner.done: | ||
return errors.New("runner exited unexpectedly") | ||
case <-runner.started: | ||
// runner has started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch and great solution!
(cherry picked from commit 7494a95) # Conflicts: # changelogs/head.asciidoc
…8900) (#8916) * Fix race condition in custom libbeat instrumentation (#8900) (cherry picked from commit 7494a95) # Conflicts: # changelogs/head.asciidoc * Fix conflicts Co-authored-by: Vishal Raj <[email protected]>
Motivation/summary
The PR fixes race condition in libbeat instrumentation that surfaces during reload (initial and on subsequent configuration change). The race condition is in setting up custom libbeat instrumentation by using beats
monitoring#DefaultRegistry
. The race arises from two reload events captured by APM-Server for every reload operation including the initial config load. The two reloads are used by APM-Server to loadinputs
using the beats reload registry andoutput
using beatsOutputConfigReloader
.Specifically, the race condition is in
serverRunner#newFinalBatchProcessor
(called fromserverRunner#run
) due to updates to a globalmonitoring#Default
registry and the two reloads explained above. Even though the access to themonitoring#Default
is protected by mutex inserverRunner#newFinalBatchProcessor
; it is still possible thatserverRunner#newFinalBatchProcessor
call for a temporary reload happens later for a reload event that happens earlier. This will causemonitoring#Default
to be in an incorrect state:Checklist
- [ ] Update package changelog.yml (only if changes toapmpackage
have been made)- [ ] Documentation has been updatedHow to test these changes
TestFleetIntegrationMonitoring
multiple times and observe they are not flakyRelated issues
Closes #8383