Failed backup shows as completed when failure to read a volume occurs #1032

shreddedbacon · 2024-12-09T21:05:26Z

Description

If a volume mounted to the backup pod is unreadable, k8up will report an error during scan that the volume is unreadable. This will then proceed to the next step to check for files which also fails. The result is an empty snapshot.

The problem is that this is determined as a successful backup, which IMO is wrong. If I've asked to back up a volume, and the entire volume is determined as unreadable, then this is a failure.

Additional Context

We discovered this when we were wondering why a volume snapshot was empty when we had received no backup failure alerts uselagoon/build-deploy-tool#361

The permission on the volume meant the backup pod user was unable to access it at all.

I know you've mentioned that k8up_backup_restic_last_errors contains some information on files failed etc. But we're talking about the entire volume in this case.

When looking at the backup pod that is created, the user is 65532 and the permissions on the volume mean it is not accessible to this user, and this results in the backup scan failing.

bash-5.1$ id
uid=65532 gid=0(root) groups=0(root)
bash-5.1$ ls -alh /data/
total 28K    
drwxr-xr-x    3 root     root          19 Aug 22 21:44 .
drwxr-xr-x    1 root     root          54 Aug 22 21:44 ..
drwxrws---   12 10000    10001      30.0K Aug 21 21:49 nginx
bash-5.1$ ls -alh /data/nginx/
ls: can't open '/data/nginx/': Permission denied

No files even get backed up in this instance, but the backup is still classed as a "success". Both of the logs that show error, either of them should really result in a backup failure.

Logs

You can see the initial error here where the scan results in an error. And the subsequent archival process results in an error too.

1.7243634935042348e+09	ERROR	k8up.restic.restic.backup.progress	/data/nginx during scan 	{"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
1.724363493509051e+09	INFO	k8up.restic.restic.backup.progress	progress of backup	{"percentage": "0.00%"}
1.7243634938520162e+09	ERROR	k8up.restic.restic.backup.progress	/data/nginx during archival 	{"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544

Expected Behavior

If the volume is unreadable, I would expect the backup to fail. Even if other parts of the backup succeed.

Steps To Reproduce

uselagoon/build-deploy-tool#361

Version of K8up

v2.5.2

Version of Kubernetes

v1.31.0

Distribution of Kubernetes

EKS, GCP, AKS

The text was updated successfully, but these errors were encountered:

shreddedbacon · 2024-12-10T07:34:33Z

I realise that adding a podSecurityContext or podConfig that allows us to change the user to one that is valid will fix this. I still think the behaviour of a volume that fails to scan and archive should probably be classed as a failure, than a success.

Kidswiss · 2024-12-10T09:21:00Z

Hi @shreddedbacon

Thanks for this new issue. Maybe a bit of a background why it currently happens this way:

Restic (the tool we use underneath K8up) will continue to try to backup, even if it runs into a "permission denied" or other error. Restic will then track these errors internally and provide a count of such errors at the end of the run. Restic will then exit with an exit code of 3, which states that the backup might be incomplete, due to not being able to read all files.

How we currently handle this in K8up is that we treat exit code 3 as successful, but we expose the k8up_backup_restic_last_errors metric, so it can be determined via Prometheus if the backup should be considered successful or not.

Having said that, there's room for improvement:

If Restic exits with code 3, K8up can catch that and set a special condition on the backup object. Something like "PartialBackupCompleted". So it will be more visible without the whole Prometheus setup.

shreddedbacon · 2024-12-17T05:21:26Z

I still think there exists a condition where failing to read the entire directory should be classed as a failed backup.

Relying on the k8up_backup_restic_last_errors metric to catch a condition where the directory was unreadable would have been useless because there are often files within the volumes that are successful that are unreadable. How can you distinguish between the entire directory failing to be backed up, and a few missing files?

shreddedbacon added the bug Something isn't working label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed backup shows as completed when failure to read a volume occurs #1032

Failed backup shows as completed when failure to read a volume occurs #1032

shreddedbacon commented Dec 9, 2024 •

edited

Loading

shreddedbacon commented Dec 10, 2024

Kidswiss commented Dec 10, 2024

shreddedbacon commented Dec 17, 2024

Failed backup shows as completed when failure to read a volume occurs #1032

Failed backup shows as completed when failure to read a volume occurs #1032

Comments

shreddedbacon commented Dec 9, 2024 • edited Loading

Description

Additional Context

Logs

Expected Behavior

Steps To Reproduce

Version of K8up

Version of Kubernetes

Distribution of Kubernetes

shreddedbacon commented Dec 10, 2024

Kidswiss commented Dec 10, 2024

shreddedbacon commented Dec 17, 2024

shreddedbacon commented Dec 9, 2024 •

edited

Loading