-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 sinks with buffer.when_full drop_newest/block doesn't work as expected. #20764
Comments
Thanks for the report @nmiculinic . That definitely sounds wrong since you have |
@jszwedko btw I can confirm this is still an issue in 0.39.0 version as well |
@jszwedko https://github.com/nmiculinic/vector/blob/6eecda55020214364fda844cf8ed16a9b6cc2a5c/lib/vector-core/src/config/mod.rs#L352 does this mean if I'm not sure am I following code, but in my tests once I disable global ack, and only configure ack on sink A things work as expected per my testing. Please let me know if I misunderstood something |
That does seem suspicious, but note that here the Lines 164 to 169 in ef4f175
That should use the sink level config, if it is set, otherwise use the global config. It certainly seems like something to verify is working correctly though. |
@jszwedko actually this was a red herring. I've tried explicitly disabling acks everywhere, and things kinda worked? But still I was suspicious something is amiss here... (plus overall less throughput in service for some reason; like 30% less logs passing through despite the enough CPU/memory capacity) so for the sink B; the failing sink I've added following:
and this broke the whole thing. So it was acks disabled everywhere, one sink (B) failing, but had buffer with policy:
and I'd expect its failure not to impact main sink A, or the whole pipeline...but it did. this is what I see in logs for vector-aggregator pods:
thus I'm confused what's happening. I'd expected vector-aggregator to start dropping data for sink B, and keep chugging along; and not apply backpressure to source/sink A. |
Thanks for the additional detail @nmiculinic . It seems possible that there is a bug here. I'm wondering if sink B is buffering the data in such a way that it causes the acks to remain open until it has processed the data (by dropping it). Let me ask a teammate more familiar with this area. |
@nmiculinic one way to validate my hypothesis would be to check if the buffers for sink B is actually filling up completely. Could you check the |
I got some more info that clarified my understanding of how sink acknowledgements work. The current implementation has sources wait for all configured sinks even if only a subset have acknowledgements enabled. We intended to change this behavior to only wait for sinks that actually have acks enabled, but apparently never got to it yet: #7369 I think what you are experiencing is the lack of #7369 being implemented so that only sinks with acks enabled actually cause the source to wait. I'll close this one in-lieu of that one, but please feel free to leave any additional thoughts over there. |
A note for the community
Problem
I have vector unified architecture.
In the aggregator layer I have 2 sinks, A and B. Sink A is the production sink, while B is experimental one. Any errors in B should be ignored, and not block data sending to A. However that's not what happened despite my best configuration efforts (sink B has ack disabled, and data-dropping configured on buffer)
Configuration
There's some transforms in the middle, but they're not really relevant here. Upon sink B failing/being unreachable on the network layer I see increase in discarded events for vector source (
vector_component_discarded_events_total{component_id="vector", component_kind="source"}
metric ).I would expect any failure on sink B would be ignored, and data would flow successfully to sink A. However it seems to completely stop any data sending to sink A. The processes have sufficient CPU/memory resources and aren't being throttled.
Version
0.38.0
Debug Output
No response
Example Data
No response
Additional Context
This is running in k8s, with vector-agent being node level agent, shipping data to envoy LB, shipping to vector-aggregator layer. I also see elevated 504 returned by vector-aggregator on the envoy LB layer
envoy_cluster_upstream_rq
metricReferences
No response
The text was updated successfully, but these errors were encountered: