Fault Tolerance

In general, Kafunk handles failed operations by retrying using RetryPolicy. If all retry attempts have failed, Kafunk will throw an exception. Users of Kafunk shouldn't have to retry Kafunk operations; instead, tune the retry policies to suit your needs.

Let It Crash™

Note that even though you can tune the retry policy to retry indefinitely consider letting it crash instead to allow other environmental factors to take effect. Take for example the retry policy for a TCP connection to an individual broker. It might make sense to retry an operation on that connection once or twice, but beyond that, you're better off rediscovering the state of the cluster and attempting the operation with the new state taken into account. The broker may no longer be active, in which case retrying the operation is pointless. Likewise, you may try connecting to the configured list of bootstrap hosts several times, but its possible that the host name has changed entirely, in which case its better to restart the entire application and pick up new configuration. As such, tune the retry policies with respect to your particular environment. Retry policies are effective for transient errors, allowing you to continue operation without incurring a costly application restart. However, they aren't a complete fault tolerance solution for your entire service.

The applicable configuration points are:

ChanConfig.requestRetryPolicy controls the retry policy for requests to an individual broker. This is helpful for recovering from transient network failures or timeouts, but it can't recover from cluster topology changes. Most of the time, this retry policy should be short so that cluster topology changes are detected sooner. See KafkaConfig.requestRetryPolicy bellow.
ChanConfig.requestTimeout controls the maximum duration of requests to individual brokers after which the request is cancelled, making it available for retry.
ChanConfig.connectRetryPolicy controls the retry policy for connection attempts to an individual Kafka broker. This is helpful for recovering from transient network failures or timeout, but it can't recover from cluster topology changes. See KafkaConfig.bootstrapConnectRetryPolicy bellow.
ChanConfig.connectTimeout controls the maximum duration of a TCP connection attempt to an individual broker.
KafkaConfig.requestRetryPolicy controls the retry policy for all requests. This setting works in conjunction with ChanConfig.requestRetryPolicy to determine how a request is retried. The latter controls the retry policy with respect to an individual broker and defines the first tier of retries. These are meant to recover from transient network issues or timeouts, but aren't able to recover from cluster topology changes. This is where KafkaConfig.requestRetryPolicy comes into play - it controls retries which consist of re-discovering the state of the cluster and retrying the operation on a potentially new broker.
KafkaConfig.bootstrapConnectRetryPolicy controls the retry policy for connecting to a bootstrap broker. This setting works in conjunction with ChanConfig.connectRetryPolicy to determine how connections are retries. The latter controls the connection retry policy with respect to an individual broker. The former controls the connection retries with respect to the entire bootstrap broker list.

A Kafunk consumer handles consumer group failures by rejoining the group. The applicable configuration points are:

ConsumerConfig.sessionTimeout controls the time window during which a consumer must send heartbeats to the group coordinator, and where failure to do so results in a group rebalance.
ConsumerConfig.heartbeatFrequency controls the number of times heartbeats are sent during sessionTimeout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault Tolerance

Let It Crash™

Clone this wiki locally