-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making Xinfra Monitor highly available #355
Open
RiaPradeep
wants to merge
17
commits into
linkedin:master
Choose a base branch
from
RiaPradeep:HA-monitor
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Check for HAMonitoring class more robustly. Co-authored-by: hgeraldino <[email protected]>
* Add HA monitoring using Abstract Coordinator * Change some naming * Update log messages * Add service factory for HA monitoring * Update src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java Check for HAMonitoring class more robustly. Co-authored-by: hgeraldino <[email protected]> * Correct build error * Cleanup and clarify HA variable Co-authored-by: Christopher Beard <[email protected]> Co-authored-by: hgeraldino <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, Xinfra Monitor is designed to run on a single machine. To monitor a Kafka cluster, you might have a single monitor instance running separately. What if the machine running Xinfra Monitor goes down or crashes? There are no more metrics being reported for that cluster, until the monitor is completely restarted.
We (the Kafka team at Bloomberg) propose a modification to Xinfra Monitor to make it highly available, avoiding this problem. A new service,
HAMonitoringService
, uses Kafka’s AbstractCoordinator to manage multiple instances of Xinfra Monitor running simultaneously, which are put into the same group. The group coordinator selects 1 instance to run the monitor (including internal services producing metrics). If that instance goes down, one of the other instances in the group can quickly take over reporting.We’ve deployed these changes internally over the past two months with promising results. The monitor consistently reports metrics; any gap in metrics has been less than 5 minutes.
Specifically, these changes create
HAMonitoringService
, which instantiates and polls theHAMonitoringCoordinator
. All instances of Xinfra Monitor will join a group this coordinator manages. The coordinator picks one group member to report metrics, and that instance will start Xinfra Monitor (as defined here). All other instances will stop Xinfra Monitor (defined here).The HA option can be configured in the
.config
file like any other service. If no HA config is specified, Xinfra Monitor will run normally.For example, including the following in the config file would run Xinfra Monitor with this feature:
This starting & stopping method leads to the potential of an instance starting reporting, stopping reporting, then later starting again. This required some instantiation to be moved from the constructor to the
start
method in some services.