Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error status on successful backups due to certificate retrieval issue #1723

Open
clsv opened this issue Nov 13, 2024 · 0 comments · May be fixed by #1736
Open

error status on successful backups due to certificate retrieval issue #1723

clsv opened this issue Nov 13, 2024 · 0 comments · May be fixed by #1736
Labels

Comments

@clsv
Copy link
Contributor

clsv commented Nov 13, 2024

Report

During a long backup (~6 hours), a status error occurs in the backup resource, even though the logs indicate a successful backup completion by the backup-agent.

More about the problem

The issue occurs during long backups (approximately 6 hours). Despite the backup completing successfully according to the logs of the pmb-agent container, the Kubernetes PerconaServerMongoDBBackup resource is marked with a status of error. Additionally, erroneous backups are not removed during cleanup, requiring manual deletion from the S3 bucket.

The PerconaServerMongoDBBackup resource has next error message:

create pbm object: create PBM connection to mongo-rs0-2.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-0.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-1.mongo-rs0.mongo.svc.cluster.local:27017: create mongo connection: invalid connection string option: the specified CA file does not contain any valid certificates

Steps to reproduce

  1. Deploy MongoDB with mode: preferTLS + enable backups using the operator on not fast storage (etcd on the same self hosted cloud storage).
  2. Populate MongoDB with ~400GB of data.
  3. Start a backup job.

Versions

Kubernetes: v1.27.16
Operator: v1.17.0
Database: v6.0.15-12
PBM: v2.6.0

Anything else?

The issue appears to be caused by the operator polling the API every 5 seconds to retrieve the secret and overwriting the certificate file each time (link to source). This may be due to truncation followed by a file rewrite; on slower disks, this can result in connecting to the pbm-agent with an empty certificate file.

I implemented a few checks in a custom-patched version of the operator, that resolved the issue:

  1. Do not request the secret from the Kubernetes API if the certificate is younger than 5 minutes.
  2. Do not overwrite the certificate file if its content matches the secret; instead, update only the modTime.
  3. Use os.WriteFile instead of os.OpenFile with truncate to avoid issues with empty files during certificate update attempts.

i found same problem in forum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant