Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Xinfra Monitor highly available #355

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

RiaPradeep
Copy link

Currently, Xinfra Monitor is designed to run on a single machine. To monitor a Kafka cluster, you might have a single monitor instance running separately. What if the machine running Xinfra Monitor goes down or crashes? There are no more metrics being reported for that cluster, until the monitor is completely restarted.

We (the Kafka team at Bloomberg) propose a modification to Xinfra Monitor to make it highly available, avoiding this problem. A new service, HAMonitoringService, uses Kafka’s AbstractCoordinator to manage multiple instances of Xinfra Monitor running simultaneously, which are put into the same group. The group coordinator selects 1 instance to run the monitor (including internal services producing metrics). If that instance goes down, one of the other instances in the group can quickly take over reporting.

We’ve deployed these changes internally over the past two months with promising results. The monitor consistently reports metrics; any gap in metrics has been less than 5 minutes.

Specifically, these changes create HAMonitoringService, which instantiates and polls the HAMonitoringCoordinator. All instances of Xinfra Monitor will join a group this coordinator manages. The coordinator picks one group member to report metrics, and that instance will start Xinfra Monitor (as defined here). All other instances will stop Xinfra Monitor (defined here).

The HA option can be configured in the .config file like any other service. If no HA config is specified, Xinfra Monitor will run normally.

For example, including the following in the config file would run Xinfra Monitor with this feature:

"HA-monitoring-service": {
  "class.name": "com.linkedin.xinfra.monitor.services.HAMonitoringService",
  "bootstrap.servers": <connection to kafka cluster>,
  "group.id": "HA-monitoring-group"
}

This starting & stopping method leads to the potential of an instance starting reporting, stopping reporting, then later starting again. This required some instantiation to be moved from the constructor to the start method in some services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants