Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/journald] Restart journalctl if it exits unexpectedly #35635

Merged
merged 3 commits into from
Oct 28, 2024

Conversation

namco1992
Copy link
Contributor

@namco1992 namco1992 commented Oct 6, 2024

Description

According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via Cmd.Start()) successfully, the journald_input will always try to restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR, the operator.Start waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted.

The fix is largely inspired by elastic/beats#40558.

Testing

Add a simple bash script that print a line every second, and load it to systemd.

log_every_second.sh:

#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done

log.service:

[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start the otelcol with the following config:

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1

Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec):

2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}

The journalctl process started by journald receiver could exit due to
various reasons, and the log consumption from journald halts until the
otelcol is restarted.

In this PR, we try to restart the journalctl process when it exits
unexpectedly, so the log consumption resumes.

Signed-off-by: Mengnan Gong <[email protected]>
@namco1992 namco1992 requested review from djaglowski and a team as code owners October 6, 2024 15:45
Copy link

linux-foundation-easycla bot commented Oct 6, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 22, 2024
@namco1992
Copy link
Contributor Author

@sumo-drosiek @djaglowski Hello, sorry for the tagging, may I get a review on this PR? Thanks!

@github-actions github-actions bot removed the Stale label Oct 23, 2024
@djaglowski djaglowski merged commit 0d9a3f8 into open-telemetry:main Oct 28, 2024
155 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 28, 2024
jpbarto pushed a commit to jpbarto/opentelemetry-collector-contrib that referenced this pull request Oct 29, 2024
…-telemetry#35635)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants