Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve] [pip] PIP-382: Add a label named reason for topic_load_failed_total #23351

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions pip/pip-382.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# PIP-382: Add a label named reason for topic_load_failed_total

# Background knowledge

Pulsar has a metric that indicates load topic failed: `topic_load_failed_total`, it will be increased at the following cases
- The target bundle in unloading.
- Failed to load policies.
- Failed to load up Managed Ledger.
- Failed to read Metadata store.
- Topic initialize fails, such as failed to re-build deduplication info.
- Topic load timeout.
- Others.

# Motivation & Goals

Adding an additional label of the metric `topic_load_failed_total` may let us know what error happened fastly, so we can fix the issue fastly.

### Metrics

Add a label named reason for topic_load_failed_total
- label name: `reason`
- label values:
- `bundle_unloading`
- `failed_load_policies`
- `failed_load_ml`
- `failed_access_metadata_store`
- `failed_init`
shibd marked this conversation as resolved.
Show resolved Hide resolved
- `timeout`
- `others`


# Monitoring & Alternatives

- If the value of label value `reason = bundle_unloading` increases a moment, and it stop sto increase after a while, it means everything is fine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sto -> to

- Otherwise, the load-balancer may encounter an error.
- If the value of label value `reason = timeout` increases a moment, and it stops to increase after a while, it means too many topics were loaded at the same time, it may be okay.
- Otherwise, broker may encounter a deadlock issue, or the resources is not enough for the current use case.
- For other label values, it means something is not expected, and we can apart them by the label value.

# General Notes

# Links

<!--
Updated afterwards
-->
* Mailing List discussion thread:https://lists.apache.org/thread/f3xhmm342jor042n5ykkxoc32ffcn85s
* Mailing List voting thread: