Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails #51

Open
jyates opened this issue Nov 15, 2021 · 0 comments

Comments

@jyates
Copy link
Contributor

jyates commented Nov 15, 2021

Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):

    // if all the consumer measurements succeed, then we return the cluster name
    // otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
    // failure to successfully monitor the cluster).
    return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
  }

  /**
   * Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
   * the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
   *  - burrow group status lookup fails
   *  - execution is interrupted
   * Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
   * than ideal for many cases, but it will be addressed later.

However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp metric) since that will not get incremented for the failures.

jyates added a commit to jyates/kafka-helmsman that referenced this issue Jun 28, 2022
Across all the partitions for all topics for all consumers for a given
cluster, if we succeed at reading the freshness for at least one of
the consumers then the cluster freshness succeeds. Without this, if
none the consumers could be evaluated successfully then the cluster
would still be marked successful, but that often indicated an
incorrectly configured cluster. However, if we are able to read even
a single partition, then we can reach the cluster. Maybe its a transient
so we should be allowed a next round to try to get more successes.

Addresses teslamotors#51
jyates added a commit to jyates/kafka-helmsman that referenced this issue Jun 28, 2022
Across all the partitions for all topics for all consumers for a given
cluster, if we succeed at reading the freshness for at least one of
the consumers then the cluster freshness succeeds. Without this, if
none the consumers could be evaluated successfully then the cluster
would still be marked successful, but that often indicated an
incorrectly configured cluster. However, if we are able to read even
a single partition, then we can reach the cluster. Maybe its a transient
so we should be allowed a next round to try to get more successes.

Addresses teslamotors#51
jyates added a commit to jyates/kafka-helmsman that referenced this issue Jun 29, 2022
Across all the partitions for all topics for all consumers for a given
cluster, if we succeed at reading the freshness for at least one of
the consumers then the cluster freshness succeeds. Without this, if
none the consumers could be evaluated successfully then the cluster
would still be marked successful, but that often indicated an
incorrectly configured cluster. However, if we are able to read even
a single partition, then we can reach the cluster. Maybe its a transient
so we should be allowed a next round to try to get more successes.

Addresses teslamotors#51
jyates added a commit that referenced this issue Jun 29, 2022
Across all the partitions for all topics for all consumers for a given
cluster, if we succeed at reading the freshness for at least one of
the consumers then the cluster freshness succeeds. Without this, if
none the consumers could be evaluated successfully then the cluster
would still be marked successful, but that often indicated an
incorrectly configured cluster. However, if we are able to read even
a single partition, then we can reach the cluster. Maybe its a transient
so we should be allowed a next round to try to get more successes.

Addresses #51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant