error status on successful backups due to certificate retrieval issue #1723

clsv · 2024-11-13T20:09:17Z

Report

During a long backup (~6 hours), a status error occurs in the backup resource, even though the logs indicate a successful backup completion by the backup-agent.

More about the problem

The issue occurs during long backups (approximately 6 hours). Despite the backup completing successfully according to the logs of the pmb-agent container, the Kubernetes PerconaServerMongoDBBackup resource is marked with a status of error. Additionally, erroneous backups are not removed during cleanup, requiring manual deletion from the S3 bucket.

The PerconaServerMongoDBBackup resource has next error message:

create pbm object: create PBM connection to mongo-rs0-2.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-0.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-1.mongo-rs0.mongo.svc.cluster.local:27017: create mongo connection: invalid connection string option: the specified CA file does not contain any valid certificates

Steps to reproduce

Deploy MongoDB with mode: preferTLS + enable backups using the operator on not fast storage (etcd on the same self hosted cloud storage).
Populate MongoDB with ~400GB of data.
Start a backup job.

Versions

Kubernetes: v1.27.16
Operator: v1.17.0
Database: v6.0.15-12
PBM: v2.6.0

Anything else?

The issue appears to be caused by the operator polling the API every 5 seconds to retrieve the secret and overwriting the certificate file each time (link to source). This may be due to truncation followed by a file rewrite; on slower disks, this can result in connecting to the pbm-agent with an empty certificate file.

I implemented a few checks in a custom-patched version of the operator, that resolved the issue:

Do not request the secret from the Kubernetes API if the certificate is younger than 5 minutes.
Do not overwrite the certificate file if its content matches the secret; instead, update only the modTime.
Use os.WriteFile instead of os.OpenFile with truncate to avoid issues with empty files during certificate update attempts.

i found same problem in forum

The text was updated successfully, but these errors were encountered:

clsv added the bug label Nov 13, 2024

clsv linked a pull request Nov 23, 2024 that will close this issue

Optimize TLS and CA certificate file writes in getMongoUri function #1736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error status on successful backups due to certificate retrieval issue #1723

error status on successful backups due to certificate retrieval issue #1723

clsv commented Nov 13, 2024

error status on successful backups due to certificate retrieval issue #1723

error status on successful backups due to certificate retrieval issue #1723

Comments

clsv commented Nov 13, 2024

Report

More about the problem

Steps to reproduce

Versions

Anything else?