You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During a long backup (~6 hours), a status error occurs in the backup resource, even though the logs indicate a successful backup completion by the backup-agent.
More about the problem
The issue occurs during long backups (approximately 6 hours). Despite the backup completing successfully according to the logs of the pmb-agent container, the Kubernetes PerconaServerMongoDBBackup resource is marked with a status of error. Additionally, erroneous backups are not removed during cleanup, requiring manual deletion from the S3 bucket.
The PerconaServerMongoDBBackup resource has next error message:
create pbm object: create PBM connection to mongo-rs0-2.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-0.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-1.mongo-rs0.mongo.svc.cluster.local:27017: create mongo connection: invalid connection string option: the specified CA file does not contain any valid certificates
Steps to reproduce
Deploy MongoDB with mode: preferTLS + enable backups using the operator on not fast storage (etcd on the same self hosted cloud storage).
The issue appears to be caused by the operator polling the API every 5 seconds to retrieve the secret and overwriting the certificate file each time (link to source). This may be due to truncation followed by a file rewrite; on slower disks, this can result in connecting to the pbm-agent with an empty certificate file.
I implemented a few checks in a custom-patched version of the operator, that resolved the issue:
Do not request the secret from the Kubernetes API if the certificate is younger than 5 minutes.
Do not overwrite the certificate file if its content matches the secret; instead, update only the modTime.
Use os.WriteFile instead of os.OpenFile with truncate to avoid issues with empty files during certificate update attempts.
Report
During a long backup (~6 hours), a status error occurs in the backup resource, even though the logs indicate a successful backup completion by the backup-agent.
More about the problem
The issue occurs during long backups (approximately 6 hours). Despite the backup completing successfully according to the logs of the pmb-agent container, the Kubernetes PerconaServerMongoDBBackup resource is marked with a status of error. Additionally, erroneous backups are not removed during cleanup, requiring manual deletion from the S3 bucket.
The PerconaServerMongoDBBackup resource has next error message:
Steps to reproduce
mode: preferTLS
+ enable backups using the operator on not fast storage (etcd on the same self hosted cloud storage).Versions
Kubernetes: v1.27.16
Operator: v1.17.0
Database: v6.0.15-12
PBM: v2.6.0
Anything else?
The issue appears to be caused by the operator polling the API every 5 seconds to retrieve the secret and overwriting the certificate file each time (link to source). This may be due to truncation followed by a file rewrite; on slower disks, this can result in connecting to the pbm-agent with an empty certificate file.
I implemented a few checks in a custom-patched version of the operator, that resolved the issue:
i found same problem in forum
The text was updated successfully, but these errors were encountered: