You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a volume mounted to the backup pod is unreadable, k8up will report an error during scan that the volume is unreadable. This will then proceed to the next step to check for files which also fails. The result is an empty snapshot.
The problem is that this is determined as a successful backup, which IMO is wrong. If I've asked to back up a volume, and the entire volume is determined as unreadable, then this is a failure.
Additional Context
We discovered this when we were wondering why a volume snapshot was empty when we had received no backup failure alerts uselagoon/build-deploy-tool#361
The permission on the volume meant the backup pod user was unable to access it at all.
I know you've mentioned that k8up_backup_restic_last_errors contains some information on files failed etc. But we're talking about the entire volume in this case.
When looking at the backup pod that is created, the user is 65532 and the permissions on the volume mean it is not accessible to this user, and this results in the backup scan failing.
bash-5.1$ id
uid=65532 gid=0(root) groups=0(root)
bash-5.1$ ls -alh /data/
total 28K
drwxr-xr-x 3 root root 19 Aug 22 21:44 .
drwxr-xr-x 1 root root 54 Aug 22 21:44 ..
drwxrws--- 12 10000 10001 30.0K Aug 21 21:49 nginx
bash-5.1$ ls -alh /data/nginx/
ls: can't open '/data/nginx/': Permission denied
No files even get backed up in this instance, but the backup is still classed as a "success". Both of the logs that show error, either of them should really result in a backup failure.
Logs
You can see the initial error here where the scan results in an error. And the subsequent archival process results in an error too.
1.7243634935042348e+09 ERROR k8up.restic.restic.backup.progress /data/nginx during scan {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
1.724363493509051e+09 INFO k8up.restic.restic.backup.progress progress of backup {"percentage": "0.00%"}
1.7243634938520162e+09 ERROR k8up.restic.restic.backup.progress /data/nginx during archival {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
Expected Behavior
If the volume is unreadable, I would expect the backup to fail. Even if other parts of the backup succeed.
I realise that adding a podSecurityContext or podConfig that allows us to change the user to one that is valid will fix this. I still think the behaviour of a volume that fails to scan and archive should probably be classed as a failure, than a success.
Thanks for this new issue. Maybe a bit of a background why it currently happens this way:
Restic (the tool we use underneath K8up) will continue to try to backup, even if it runs into a "permission denied" or other error. Restic will then track these errors internally and provide a count of such errors at the end of the run. Restic will then exit with an exit code of 3, which states that the backup might be incomplete, due to not being able to read all files.
How we currently handle this in K8up is that we treat exit code 3 as successful, but we expose the k8up_backup_restic_last_errors metric, so it can be determined via Prometheus if the backup should be considered successful or not.
Having said that, there's room for improvement:
If Restic exits with code 3, K8up can catch that and set a special condition on the backup object. Something like "PartialBackupCompleted". So it will be more visible without the whole Prometheus setup.
I still think there exists a condition where failing to read the entire directory should be classed as a failed backup.
Relying on the k8up_backup_restic_last_errors metric to catch a condition where the directory was unreadable would have been useless because there are often files within the volumes that are successful that are unreadable. How can you distinguish between the entire directory failing to be backed up, and a few missing files?
Description
If a volume mounted to the backup pod is unreadable, k8up will report an error
during scan
that the volume is unreadable. This will then proceed to the next step to check for files which also fails. The result is an empty snapshot.The problem is that this is determined as a successful backup, which IMO is wrong. If I've asked to back up a volume, and the entire volume is determined as unreadable, then this is a failure.
Additional Context
We discovered this when we were wondering why a volume snapshot was empty when we had received no backup failure alerts uselagoon/build-deploy-tool#361
The permission on the volume meant the backup pod user was unable to access it at all.
I know you've mentioned that
k8up_backup_restic_last_errors
contains some information on files failed etc. But we're talking about the entire volume in this case.When looking at the backup pod that is created, the user is
65532
and the permissions on the volume mean it is not accessible to this user, and this results in the backup scan failing.No files even get backed up in this instance, but the backup is still classed as a "success". Both of the logs that show error, either of them should really result in a backup failure.
Logs
You can see the initial error here where the scan results in an error. And the subsequent archival process results in an error too.
Expected Behavior
If the volume is unreadable, I would expect the backup to fail. Even if other parts of the backup succeed.
Steps To Reproduce
uselagoon/build-deploy-tool#361
Version of K8up
v2.5.2
Version of Kubernetes
v1.31.0
Distribution of Kubernetes
EKS, GCP, AKS
The text was updated successfully, but these errors were encountered: