Message retry performance implications and architectural issues #613

nicklas-dohrn · 2024-09-26T07:20:22Z

This is an issue to discuss the current state of the retry logic for syslog messages,
As there are some implications, that are problematic.
just listed here shortly for an overview:

having a syslog drain fail with high load will drop messages for other drains.
This will also put the cpu consumption of the syslog agent over 1 cpu, not sure why
the syslog-batching implementation is not able to use the retry mechanic, as there is no state about the batching being done in the retry writer.

I will add details and my testing results here later in a better formatted way.

ctlong · 2024-09-26T17:45:28Z

I dived a little deeper into the syslog writer code recently and I think that we were incorrect in some of our previous assertions about the synchronized nature of the agent. If you check out the syslog connector, which the manager uses to create new drains, each drain is provided with an egress diode. Since writing to the diode should be non-blocking, I think that the envelope writing loop is in fact asynchronous to some degree.

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

ctlong · 2024-09-26T17:48:27Z

High CPU usage of the agent is a known problem. Unfortunately, none of the logging and metrics agents currently have any kind of memory or CPU limitation placed upon them. They will expand as necessary to meet demand.

We took a pprof dump a while ago and saw that marshalling/unmarshalling envelopes was the primary performance issue of most of our agents. Part of what I hope to accomplish by merging every agent into the OTel collector is to reduce the number of marshal/unmarshal steps required to egress an individual envelope from a VM.

nicklas-dohrn · 2024-10-03T14:22:11Z

I did some testing as well, and your assumption about every drain getting its own diode is also my understanding why there is some sort of concurrency happening.
Imho, this is also unwanted behaviour, as this does not allow to set the wanted max resource consumption, so the syslog-agent is able to overload other components.

nicklas-dohrn · 2024-10-03T14:26:21Z

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

Yes, this is what I see with testing.
It only allows a "dos" overload, where the dropped messages on the other non malicious receiver seem to be random.

(the image shows the inflowing data on the receiving side, should be 50log/s)

ctlong · 2024-11-07T01:03:17Z

@nicklas-dohrn to confirm the state of this issue, the current concerns are:

Syslog Agent has the potential for high CPU usage under load that has the potential to overload other components.
https-batch drains in Syslog Agent has no retry logic when it fails and does not propagate any error log message.

Is that correct? If so, I'd move to ignore the first concern in this issue as I consider it to be a general known issue with CF-D components – what we really want is some BPM-specific way to indicate CPU shares.

nicklas-dohrn mentioned this issue Sep 26, 2024

Add syslog batching implementation #491

Merged

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Sep 26, 2024

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message retry performance implications and architectural issues #613

Message retry performance implications and architectural issues #613

nicklas-dohrn commented Sep 26, 2024

ctlong commented Sep 26, 2024

ctlong commented Sep 26, 2024 •

edited

Loading

nicklas-dohrn commented Oct 3, 2024

nicklas-dohrn commented Oct 3, 2024

ctlong commented Nov 7, 2024

Message retry performance implications and architectural issues #613

Message retry performance implications and architectural issues #613

Comments

nicklas-dohrn commented Sep 26, 2024

ctlong commented Sep 26, 2024

ctlong commented Sep 26, 2024 • edited Loading

nicklas-dohrn commented Oct 3, 2024

nicklas-dohrn commented Oct 3, 2024

ctlong commented Nov 7, 2024

ctlong commented Sep 26, 2024 •

edited

Loading