Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitions (backed by network storage) disappear if network is unavailable for more than 5 seconds #9706

Open
ErikLundJensen opened this issue Nov 12, 2024 · 6 comments

Comments

@ErikLundJensen
Copy link
Contributor

Bug Report

Given Talos OS disk is provided from network storage
when network storage is unavailable for more than 5 seconds
then partitions disappear.

For example the /var partition disappeared at the node. The partition was available again after reboot.

Description

It could be related to the hardcoded timeout of 5 seconds in the mount.go :

func (p *Point) retry(f func() error, isUnmount bool, printerOptions PrinterOptions) error {
	return retry.Constant(5*time.Second, retry.WithUnits(50*time.Millisecond)).Retry(func() error {
		if err := f(); err != nil {
			switch err {
			case unix.EBUSY:
				return retry.ExpectedError(err)
			case unix.ENOENT, unix.ENXIO:
				// if udevd triggers BLKRRPART ioctl, partition device entry might disappear temporarily
				return retry.ExpectedError(err)

It is not clear if other timeouts can cause the partition to disappear as well.
If the mount function runs in a reconciliation loop then it is probably the right place to fix the issue.

Alternative could be looking into the general configuration the XFS filesystem to handle errors using the max_retries and retry_timeout_seconds and action XFS mount options.

Logs

Disk I/O timeouts are seen in logs.

Environment

  • Talos version: 1.6.7 (expected to be reproducible in 1.8 and 1.9)
  • Kubernetes version: 1.29.3
  • Platform: vSphere with vSAN and LACP configuration of Link Aggregate Group.
@smira
Copy link
Member

smira commented Nov 12, 2024

Please provide some logs to understand how does the partition disappear in your case.

@ErikLundJensen
Copy link
Contributor Author

A screenshot from the console as these logs never reach our centralized log server. When /var is unavailable then a lot breaks..
no-var-folder

We did see IO errors (timeouts) in the console as well but did not capture that.

@smira
Copy link
Member

smira commented Nov 13, 2024

So this is quite expected, it has nothing to do with mounting (at least until there's enough logs to prove the opposite).

The partition is mounted, but as it's a network disk, any operation would be broken if the network is unreliable. Talos works without issues e.g. on AWS/EBS volumes, so the network volume should be made reliable enough first.

@ErikLundJensen
Copy link
Contributor Author

but why did the partition not show up again after the network connectivity was re-established?

@smira
Copy link
Member

smira commented Nov 13, 2024

I don't know. There are zero logs on partitions being unmounted (it shouldn't be).

You can grab kernel logs with talosctl dmesg and inspect it yourself to see if the partition is unmounted in any way.

@ErikLundJensen
Copy link
Contributor Author

I'll try to see if I can recreate it in a lab environment and then get the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants