Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus_remote_write: Fix cutoff logic. #225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PromyLOPh
Copy link

@PromyLOPh PromyLOPh commented Oct 1, 2024

We noticed that fluent-bit’s prometheus remote_write output plugin was silently dropping some, but not all, process_exporter metrics after about one hour while the stdout output plugin was still showing metrics being collected. We were also able to reduce the time after which metrics were being dropped by modifying CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_THRESHOLD, which indicates the problem is the cutoff logic. This merge-request treats CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR as success and continues encoding other metrics, so they do not get dropped. It might be worth dropping this “error” code entirely, since it’s not really an error and leads to subtle bugs like this one.

After merging this fix the bundled copy of cmetrics inside fluent-bit should be updated.

A single stale metric can suppress encoding of all following non-stale
metrics if the return code CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR
is not treated as success.

Signed-off-by: Lars-Dominik Braun <[email protected]>
Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Good catch!

Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rethink this PR but it's not sufficient to handle whether cutoff-ed or not. Because the cutoff variable is only storing the last one happened cutoff.
Instead, we need to use flag to preserve type type of errors like as: fluent/fluent-bit#9236

@PromyLOPh
Copy link
Author

How is this information (whether some values were not transmitted due to the cutoff) used by fluent-bit? As far as I see cmt_encode_prometheus_remote_write_create always handled CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR like a success and never reported anything to the upper layers (like fluent-bit). Is this something that needs to be changed?

@cosmo0920
Copy link
Contributor

How is this information (whether some values were not transmitted due to the cutoff) used by fluent-bit? As far as I see cmt_encode_prometheus_remote_write_create always handled CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR like a success and never reported anything to the upper layers (like fluent-bit). Is this something that needs to be changed?

Currently, we don't use this error for reporting to fluent-bit plugins. This is because for code simplicity. And I once rethink this PR again, I realized that this should be enough for handling extra cutting off circumstances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants