Making Xinfra Monitor highly available #355

RiaPradeep · 2021-12-10T23:09:08Z

Currently, Xinfra Monitor is designed to run on a single machine. To monitor a Kafka cluster, you might have a single monitor instance running separately. What if the machine running Xinfra Monitor goes down or crashes? There are no more metrics being reported for that cluster, until the monitor is completely restarted.

We (the Kafka team at Bloomberg) propose a modification to Xinfra Monitor to make it highly available, avoiding this problem. A new service, HAMonitoringService, uses Kafka’s AbstractCoordinator to manage multiple instances of Xinfra Monitor running simultaneously, which are put into the same group. The group coordinator selects 1 instance to run the monitor (including internal services producing metrics). If that instance goes down, one of the other instances in the group can quickly take over reporting.

We’ve deployed these changes internally over the past two months with promising results. The monitor consistently reports metrics; any gap in metrics has been less than 5 minutes.

Specifically, these changes create HAMonitoringService, which instantiates and polls the HAMonitoringCoordinator. All instances of Xinfra Monitor will join a group this coordinator manages. The coordinator picks one group member to report metrics, and that instance will start Xinfra Monitor (as defined here). All other instances will stop Xinfra Monitor (defined here).

The HA option can be configured in the .config file like any other service. If no HA config is specified, Xinfra Monitor will run normally.

For example, including the following in the config file would run Xinfra Monitor with this feature:

"HA-monitoring-service": {
  "class.name": "com.linkedin.xinfra.monitor.services.HAMonitoringService",
  "bootstrap.servers": <connection to kafka cluster>,
  "group.id": "HA-monitoring-group"
}

This starting & stopping method leads to the potential of an instance starting reporting, stopping reporting, then later starting again. This required some instantiation to be moved from the constructor to the start method in some services.

Check for HAMonitoring class more robustly. Co-authored-by: hgeraldino <[email protected]>

* Add HA monitoring using Abstract Coordinator * Change some naming * Update log messages * Add service factory for HA monitoring * Update src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java Check for HAMonitoring class more robustly. Co-authored-by: hgeraldino <[email protected]> * Correct build error * Cleanup and clarify HA variable Co-authored-by: Christopher Beard <[email protected]> Co-authored-by: hgeraldino <[email protected]>

chrisbeard and others added 17 commits September 23, 2021 13:33

Add HA monitoring using Abstract Coordinator

b69abf3

Change some naming

054dc0d

Update log messages

f302b1f

Add service factory for HA monitoring

192bb1c

Update src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java

79ff63a

Check for HAMonitoring class more robustly. Co-authored-by: hgeraldino <[email protected]>

Correct build error

e9dbe87

Cleanup and clarify HA variable

2e73cf3

Improve static assignment and fix restart errors

56d0fa7

Merge remote-tracking branch 'oss/master' into HA-abstract

41536c2

Add solution for records lost

ba86de1

Remove unnecessary re-initialization

18a9a66

Cleanup

b6141e6

Remove rethrown exception

4611850

Merge remote-tracking branch 'upstream/master' into HA-upstream

2f2bea9

Cleanup and add comments

976082d

Clean up style

7158089

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Xinfra Monitor highly available #355

Making Xinfra Monitor highly available #355

RiaPradeep commented Dec 10, 2021

Making Xinfra Monitor highly available #355

Are you sure you want to change the base?

Making Xinfra Monitor highly available #355

Conversation

RiaPradeep commented Dec 10, 2021