Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New system not able to unlock after running role #150

Open
aheath1992 opened this issue Feb 1, 2024 · 10 comments
Open

New system not able to unlock after running role #150

aheath1992 opened this issue Feb 1, 2024 · 10 comments

Comments

@aheath1992
Copy link

New system is unable to unlock after running the nbde_client role, after running the role get an all good from Ansible but upon reboot the system stops at the Luks encryption screen.

    - name: Import nbde_client role
      ansible.builtin.import_role:
        name: linux-system-roles.nbde_client
      vars:
        nbde_client_bindings:
          - device: "{{ root_disk | d('/dev/vda2') }}"
            encryption_password: "{{ current_password }}"
            servers: "{{ tang_servers }}"

Screenshot from 2024-02-01 16-02-08

@richm
Copy link
Contributor

richm commented Feb 1, 2024

What version of the role are you using?
What version of ansible are you using?
What is the platform/version of your control node?
What is the platform/version of your managed node?
@sergio-correia what other debugging information do we need?

@aheath1992
Copy link
Author

What version of the role are you using - 1.71.1
What version of ansible are you using - 2.16
What is the platform/version of your control node - fedora 39
What is the platform/version of your managed node - RHEL 8.8

@sergio-correia
Copy link
Member

Hello. Here are some more info that may be helpful to debug this:

  • what is the version of clevis in the managed node?
  • what are all the encrypted devices in the managed node? is it that /dev/vda2 or do we have others?
  • please check whether the initrd from the managed node includes the clevis machinery to perform the unlocking in early boot (if that is the case); something like lsinitrd | grep clevis can help here
  • also, check whether network is enabled for early boot, which it will need in order to access the tang servers; the role includes "rd.neednet=1" to indicate this. It will likely be included in the initrd in a file named etc/cmdline.d/01-default.conf. Perhaps something like this could help to verify this: lsinitrd /boot/initramfs-$(uname -r).img etc/cmdline.d/01-default.conf
  • if you have access to the tang server, also please check whether there are any requests coming from the client (clevis)
  • check also whether clevis-luks-askpass.path unit is enabled: systemctl status clevis-luks-askpass.path; it will be used if we are going to decrypt a disk in late boot phase
  • check also the clevis configuration for the specified device; e.g.: clevis luks list -d /dev/vda2

@richm: I wonder if it makes sense to have some "action"/"state" to collect some of these information from the managed hosts, to help troubleshooting such issues?

@aheath1992
Copy link
Author

aheath1992 commented Feb 2, 2024

  • what is the version of clevis in the managed node? - clevis-15-15.el8.x86_64
  • what are all the encrypted devices in the managed node? is it that /dev/vda2 or do we have others? - just the root device in this case /dev/vda2
  • please check whether the initrd from the managed node includes the clevis machinery to perform the unlocking in early boot (if that is the case); something like lsinitrd | grep clevis can help here
lsinitrd | grep clevis
clevis
clevis-pin-null
clevis-pin-sss
clevis-pin-tang
clevis-pin-tpm2
lrwxrwxrwx   1 root     root           48 Jan 20  2023 etc/systemd/system/cryptsetup.target.wants/clevis-luks-askpass.path -> /usr/lib/systemd/system/clevis-luks-askpass.path
-rwxr-xr-x   1 root     root         1679 Jan 20  2023 usr/bin/clevis
-rwxr-xr-x   1 root     root         1654 Oct 28  2020 usr/bin/clevis-decrypt
-rwxr-xr-x   1 root     root         1148 Jan 20  2023 usr/bin/clevis-decrypt-null
-rwxr-xr-x   1 root     root        25296 Jan 20  2023 usr/bin/clevis-decrypt-sss
-rwxr-xr-x   1 root     root         3560 Jan 20  2023 usr/bin/clevis-decrypt-tang
-rwxr-xr-x   1 root     root         5121 Oct 28  2020 usr/bin/clevis-decrypt-tpm2
-rw-r--r--   1 root     root        32885 Jan 20  2023 usr/bin/clevis-luks-common-functions
-rwxr-xr-x   1 root     root         2115 Oct 28  2020 usr/bin/clevis-luks-list
-rwxr-xr-x   1 root     root         2466 Jan 20  2023 usr/libexec/clevis-luks-askpass
-rw-r--r--   1 root     root          302 Oct 28  2020 usr/lib/systemd/system/clevis-luks-askpass.path
-rw-r--r--   1 root     root          190 Jan 20  2023 usr/lib/systemd/system/clevis-luks-askpass.service
lsinitrd /boot/initramfs-$(uname -r).img etc/cmdline.d/01-default.conf
 rd.neednet=1 
systemctl status clevis-luks-askpass.path
● clevis-luks-askpass.path - Forward Password Requests to Clevis Directory Watch
   Loaded: loaded (/usr/lib/systemd/system/clevis-luks-askpass.path; enabled; vendor preset: enabled)
   Active: active (waiting) since Fri 2024-02-02 13:54:24 UTC; 24s ago
     Docs: man:clevis-luks-unlockers(7)
clevis luks list -d /dev/vda2
1: sss '{"t":1,"pins":{"tang":[{"url":"http://tang1"},{"url":"http://tang2"}]}}'

@sergio-correia
Copy link
Member

At a first glance, it looks OK -- could you also check `journalctl , to see if any useful information shows up, please? (I forgot to mention beforehand, but feel free to redact any IP addresses, if required)

journalctl -xf -u clevis-luks-askpass.service

@aheath1992
Copy link
Author

aheath1992 commented Feb 2, 2024

Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11941]: Error communicating with the server http://tang1
Feb 02 15:21:20 clevis-test.ansi-001.prod.iad2.dc.redhat.com clevis-luks-askpass[11942]: Error communicating with the server http://tang2

telnet tang1 80
Trying tang1...
Connected to tang1.
Escape character is '^]'.

telnet tang2 80
Trying tang2...
Connected to tang2.
Escape character is '^]'.

@richm
Copy link
Contributor

richm commented Jul 17, 2024

@sergio-correia any idea?

@xeluior
Copy link
Contributor

xeluior commented Jul 30, 2024

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.

- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'

I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

@sergio-correia
Copy link
Member

I've been seeing this as well. I have found that adding the _netdev option to the relevant fstab entry allows the unlocking to proceed (tested on Rocky 8 and 9 clients, both early and late boot, and Debian 11 and 12 clients, late boot only). I have added an awk script task into my playbook after the role runs to add this option.

- name: Update fstab options
  ansible.builtin.shell: |
    name="$(awk '$2 == "{{ item.device }}" { print $1 }' /etc/crypttab | head -n 1)"
    awk -v mapper_path="/dev/mapper/$name" '{
      if ($1 == mapper_path && index($4, "_netdev") == 0) {
        $4 = $4 ",_netdev"
      }
      print
    }' /etc/fstab > /tmp/fstab
    diff -q /tmp/fstab /etc/fstab || echo changed
    mv /tmp/fstab /etc/fstab
  loop: '{{ nbde_client_bindings }}'
  register: fstab
  changed_when: '"changed" in fstab.stdout'

I believe this behavior is tied to systemd's ordering of mount units, that is, it orders fstab entries with _netdev after network.online which is necessary for clevis to work. (ref)

Yeah, this is likely in the right direction.

We may need to have _netdev in crypttab, to mark the device as requiring network, and to prevent a dependency loop, we also need to add _netdev to fstab as well, if the device is specified there for a mount point. Additionally, we may also have to enable the remote-cryptsetup.target unit.

@xeluior
Copy link
Contributor

xeluior commented Aug 20, 2024

I have some more information that should probably be considered here from doing some testing with this role. I did have to add the _netdev option in both /etc/fstab and /etc/crypttab for automatic unlock. This works fine, however, on SystemD versions < 245, the crypttab generator creates a weird ordering issue with the dev-mapper-{name}.device unit that will hang shutdown indefinitely. This can be fixed by adding the x-systemd.requires=systemd-cryptsetup@{name}.service option to the appropriate device in /etc/fstab as well. I have an Ansible-native solution in the playbook I used to deploy this which I could turn into a PR, but it requires several new options per device in nbde_client_bindings so that it can create the appropriate crypttab and fstab entries.

EDIT: the SystemD issue mentioned systemd/systemd#8472

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants