[receiver/journald] Restart journalctl if it exits unexpectedly #35635

namco1992 · 2024-10-06T15:45:10Z

Description

According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via Cmd.Start()) successfully, the journald_input will always try to restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR, the operator.Start waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted.

The fix is largely inspired by elastic/beats#40558.

Testing

Add a simple bash script that print a line every second, and load it to systemd.

log_every_second.sh:

#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done

log.service:

[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start the otelcol with the following config:

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1

Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec):

2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}

The journalctl process started by journald receiver could exit due to various reasons, and the log consumption from journald halts until the otelcol is restarted. In this PR, we try to restart the journalctl process when it exits unexpectedly, so the log consumption resumes. Signed-off-by: Mengnan Gong <[email protected]>

linux-foundation-easycla · 2024-10-06T15:45:17Z

The committers listed above are authorized under a signed CLA.

✅ login: namco1992 / name: Namco (0f5e719, 53619f4, 310db2b)

Signed-off-by: Mengnan Gong <[email protected]>

github-actions · 2024-10-22T05:20:57Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

namco1992 · 2024-10-23T03:32:30Z

@sumo-drosiek @djaglowski Hello, sorry for the tagging, may I get a review on this PR? Thanks!

…-telemetry#35635)  #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <[email protected]>

namco1992 requested review from djaglowski and a team as code owners October 6, 2024 15:45

github-actions bot assigned andrzej-stencel Oct 6, 2024

github-actions bot added the pkg/stanza label Oct 6, 2024

namco1992 added 2 commits October 6, 2024 23:50

Update the changelog with PR number

53619f4

Signed-off-by: Mengnan Gong <[email protected]>

copay the built args to avoid modifying the args slice

310db2b

Signed-off-by: Mengnan Gong <[email protected]>

github-actions bot added the Stale label Oct 22, 2024

github-actions bot removed the Stale label Oct 23, 2024

djaglowski approved these changes Oct 28, 2024

View reviewed changes

djaglowski merged commit 0d9a3f8 into open-telemetry:main Oct 28, 2024
155 checks passed

github-actions bot added this to the next release milestone Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/journald] Restart journalctl if it exits unexpectedly #35635

[receiver/journald] Restart journalctl if it exits unexpectedly #35635

namco1992 commented Oct 6, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 6, 2024 •

edited

Loading

github-actions bot commented Oct 22, 2024

namco1992 commented Oct 23, 2024

[receiver/journald] Restart journalctl if it exits unexpectedly #35635

[receiver/journald] Restart journalctl if it exits unexpectedly #35635

Conversation

namco1992 commented Oct 6, 2024 • edited Loading

Description

Testing

linux-foundation-easycla bot commented Oct 6, 2024 • edited Loading

github-actions bot commented Oct 22, 2024

namco1992 commented Oct 23, 2024

namco1992 commented Oct 6, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 6, 2024 •

edited

Loading