Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus path matcher incorrectly assumes file path suffix, causing Talos to fail to match NVMe drives #114

Open
solidDoWant opened this issue Oct 27, 2024 · 10 comments

Comments

@solidDoWant
Copy link

When getting information about available disks, this line trims block/<device name> from the bus path. For example, it changes /pci0000:00/0000:00:03.2/0000:06:00.0/host11/port-11:0/expander-11:0/port-11:0:0/end_device-11:0:0/target11:0:0/11:0:0:0/block/sda into /pci0000:00/0000:00:03.2/0000:06:00.0/host11/port-11:0/expander-11:0/port-11:0:0/end_device-11:0:0/target11:0:0/11:0:0:0/.

The problem is that not all block devices have a block/<device name> path. Specifically, NVMe drives have a suffix of nvme<dev number>/nvme<dev number>n<namespace number>. For example, for a device named nvme0n1, the suffix would be nvme0/nvme0n1.

This can be confirmed with Talos, which is affected by the aforementioned issue. The command talosctl get disks handles this correctly for some reason. Here's an example output for one of my NVMe drives:

---
node: 
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: nvme1n1
    version: 1
    owner: block.DisksController
    phase: running
    created: 2024-10-27T02:28:38Z
    updated: 2024-10-27T02:28:38Z
spec:
    dev_path: /dev/nvme1n1
    size: 2000398934016
    pretty_size: 2.0 TB
    io_size: 512
    sector_size: 512
    readonly: false
    cdrom: false
    model: PCIe SSD
    serial: "23101720000337"
    wwid: eui.6479a7846ac018dd
    bus_path: /pci0000:00/0000:00:1d.0/0000:58:00.0/nvme
    sub_system: /sys/class/block
    transport: nvme

However, if I specify busPath: /pci0000:00/0000:00:1d.0/0000:58:00.0/nvme in my node configuration, talosctl apply-config fails with no disk found matching provided parameters. Setting it to busPath: /pci0000:00/0000:00:1d.0/0000:58:00.0/nvme/nvme1/nvme1n1 works as expected.

It should also be trivial to check the structure of this file tree by running ls -l /sys/block/<some nvme dev> on any Linux system with a NVMe drive installed.

@smira
Copy link
Member

smira commented Oct 28, 2024

This is a legacy interface which we hope to get rid of soon, so probably we won't be fixing it right now, but rather move towards new implementation.

@solidDoWant
Copy link
Author

I can fix this with a PR with the following patch if it would be accepted:

if strings.HasPrefix(dev, "nvme") {
    // Example: when dev = "nvme0n1", busPathSuffix = "nvme/nvme0/nvme0n1"
    busPathSuffix := filepath.Join("nvme", dev[:strings.LastIndex(dev, "n")], dev)
    busPath = strings.TrimSuffix(busPath, filepath.Join("nvme", dev))
} else {
    busPath = strings.TrimSuffix(busPath, filepath.Join("block", dev))
}

Along with an appropriate test (or tests)

@smira
Copy link
Member

smira commented Oct 28, 2024

That might be a breaking change for those who rely on old, broken behavior.

We have disk matchers done right now, I would rather use that.

@solidDoWant
Copy link
Author

Sorry to clarify do you mean that v2 disk matchers are done? This is the code that feeds the available disk information into the bus path disk matcher (in v1)

@smira
Copy link
Member

smira commented Oct 28, 2024

@solidDoWant
Copy link
Author

Right - this is exactly what has the problem, and what the aforementioned patch would fix.

Talos is using the v1 version of this module for the disk selector code path specifically, see here and here. Talos calls disk.Find here, which calls disk.List, which calls disk.Get, which implements the discussed bug here.

@solidDoWant
Copy link
Author

To be clear, this bug affects the latest Talos release (v1.8.2), not just some old unsupported version.

@smira
Copy link
Member

smira commented Oct 29, 2024

Yes, so the plan is to deprecate v1 stuff, that's why I don't want to introduce more changes there.

@solidDoWant
Copy link
Author

Is this a "we're going to replace it next week" plan, or a "someday when there's nothing better to do or higher priority" plan? I'm not the only person encountering this. I ran into two people last night in a community forum that were impacted by this.

I'm not sure if compatibility or concerns about breakage are reasonable here. The logic is objectively incorrect, so fixing it should at worst only impact people who both discovered and fixed the issue on their own. Additionally, Talos v1.8 introduced several breaking changes (like bumping containerd to v2), so there is recent precedence for breaking existing setups when upgrading Talos.

If nothing else, can the docs be updated to mention this bug? IMO Talos administrators shouldn't need to dig through the Talos codebase figure out what's going on and how to mitigate the problem.

@smira
Copy link
Member

smira commented Oct 29, 2024

Please feel free to open a PR to fix the docs, but backporting this to as a patch release in 1.8.x sounds too much risk to me at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants