Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Factorize and fix timeout for contacting domain #19615

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Nov 16, 2023

In most cases this is fast, but quite often Samba takes annoyingly long to answer. Make the timeout consistent and enforce this with helper functions.


Since yesterday's services image update, this test and related ones now flake annoyingly often 👍

image

@martinpitt martinpitt added the flake unstable test label Nov 16, 2023
@martinpitt

This comment was marked as outdated.

@martinpitt martinpitt marked this pull request as draft November 16, 2023 06:58
@martinpitt martinpitt force-pushed the realms-address-helper branch from 4506844 to 63a6f9a Compare November 16, 2023 13:23
@martinpitt
Copy link
Member Author

I can reproduce this in some 10% of local runs, when I do a parallel c8s and rhel-9-4 run. I amplified and sped up the test a bit:

--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -353,9 +353,18 @@ class CommonTests:
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
         # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
         self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        b.wait_attr("#realms-op-address", "data-discover", "done")
+        b.eval_js('window.debugging = "dbus"')
+        b.cdp.trace = True
+        m.verbose = True
+        for _ in range(20):
+            b.click("#system_information_domain_button")
+            b.wait_popup("realms-join-dialog")
+            b.wait_attr("#realms-op-address", "data-discover", "done")
+            b.click("#realms-join-dialog button.pf-m-link")
+            b.wait_not_present("#realms-join-dialog")
+
+        return

There is indeed a vast range of response times when opening and closing the dialog interactively. Sometimes it's near-instant, sometimes it takes 10 seconds. So we need to wait longer for the discovery as well. With that, the amplified test loop is stable locally. Let's see what CI thinks.

In most cases this is fast, but quite often Samba takes annoyingly long
to answer. Make the timeout consistent and enforce this with helper
functions, except for the instance in TestPackageInstall as that doesn't
derive from CommonTests.
Restarting sssd in a loop is prone to run into

> systemd[1]: sssd.service: Start request repeated too quickly.
> systemd[1]: sssd.service: Failed with result 'start-limit-hit'.
With 30 seconds we are running into occasional timeout failures.
@martinpitt
Copy link
Member Author

This round has several sssd.service failures due to

systemd[1]: sssd.service: Start request repeated too quickly.
systemd[1]: sssd.service: Failed with result 'start-limit-hit'.

That's an easy fix.

This failure is different. Let's try and bump the timeout.

@martinpitt martinpitt force-pushed the realms-address-helper branch from 63a6f9a to 650c603 Compare November 16, 2023 13:55
@martinpitt
Copy link
Member Author

martinpitt commented Nov 16, 2023

Two testClientCertAuthentication failures on rhel-8-10 and rhel-9-4 where the client machine never picks up the new 'alice' user. Argh! These pass for me locally in a loop. Retrying once, as this PR already helps a lot.

@martinpitt martinpitt marked this pull request as ready for review November 16, 2023 14:35
@martinpitt martinpitt requested a review from jelly November 16, 2023 14:35
@martinpitt
Copy link
Member Author

@jelly Tests are still running, and probably there are other flakes, but already marking for review as this will remedy the worst of our current main flakes already.

with self.browser.wait_timeout(60):
self.browser.wait_attr("#realms-op-address", "data-discover", "done")

def wait_address_helper(self, expected=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have been expected="Contacted domain" which is a bit nicer I'd say. Anyway not critical.

b.set_input_text("#realms-op-admin", self.admin_user)
b.set_input_text("#realms-op-admin-password", self.admin_password)
b.click(f"#realms-join-dialog button{self.primary_btn_class}")
with b.wait_timeout(300):
b.wait_not_present("#realms-join-dialog")
b.logout()
m.execute('while ! id alice; do sleep 5; systemctl restart sssd; done', timeout=300)
m.execute('while ! id alice; do sleep 5; systemctl reset-failed sssd; systemctl restart sssd; done', timeout=300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying still flakes here, but can be a follow up.

@martinpitt
Copy link
Member Author

where the client machine never picks up the new 'alice' user.

I can reproduce this locally after umpteen retries. This isn't a matter of waiting longer, it still doesn't work after 45 mins of sitting, running sss_cache -E, etc. But in the samba container, samba-tool user list does show alice.

I tried to reboot the client machine, doesn't help. So this smells like a real bug in the new samba container. I'll report and naughty it tomorrow.

@martinpitt martinpitt merged commit 6ef43c6 into cockpit-project:main Nov 16, 2023
90 of 91 checks passed
@martinpitt martinpitt deleted the realms-address-helper branch November 16, 2023 16:23
@martinpitt
Copy link
Member Author

martinpitt commented Nov 17, 2023

Two ideas, for my notes:

  • Try the alice polling loop with just sss_cache -E without restarting sssd: Still fails with AD (didn't try IPA exhaustively)
  • Try the alice polling loop without any extra fiddling: Still fails with AD, IPA seems fine. Let's do that.
  • This is the test that modifies the user by adding a cert; try if this still happens without: yes, it does
  • Try joining with realm CLI, for a cockpit-independent reproducer: succeeds reliably now (but see below)
  • Test with the latest container, the container build was fixed yesterday, see How to get a [global] option into smb.conf? samba-in-kubernetes/samba-container#159 : at least one run took a long time to see the user, but eventually succeeded; 10x run pass, 30x run eventually failed
    bots/image-customize -vr 'podman pull quay.io/samba.org/samba-ad-server' services
    
    Sent Image refresh for services bots#5557 to cross-check

When this happens, alice is also not present in sssctl user-show alice. What's weird is this:

# sssctl domain-status COCKPIT.LAN
Online status: Offline

Active servers:
AD Global Catalog: not connected
AD Domain Controller: cockpit.lan
[...]

/var/log/sssd/sssd_cockpit.lan.log has a similar error:

   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_get_account_info_send] (0x0200): Got request for [0x1][BE_REQ_USER][[email protected]]
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] DP Request [Account #5]: REQ_TRACE: New request. [sssd.nss CID #4] Flags [0x0001].
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] [CID #4] Backend is offline! Using cached data if available
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] Number of active DP request: 1
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sss_domain_get_state] (0x1000): [RID#5] Domain cockpit.lan is Active
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [_dp_req_recv] (0x0400): DP Request [Account #5]: Receiving request data.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): DP Request [Account #5]: Request removed.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): Number of active DP request: 0
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2023-11-17  0:47:15): [be[cockpit.lan]] [ad_sasl_log] (0x0040): [RID#6] SASL: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sasl_bind_send] (0x0020): [RID#6] ldap_sasl_interactive_bind_s failed (-2)[Local error]
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sdap_cli_connect_recv] (0x0040): [RID#6] Unable to establish connection [1432158227]: Authentication Failed
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:19): [be[cockpit.lan]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:

On a run which works, /var/log/sssd/sssd_cockpit.lan.log just has a single "Starting..." line, and sssctl domain-status COCKPIT.LAN says:

Online status: Online

Active servers:
AD Global Catalog: f0.cockpit.lan
AD Domain Controller: f0.cockpit.lan
[...]

podman logs samba looks comparable on both services VMs, also the running processes.

The client side journal on the broken instance is interesting: sssd.service is caught in a restarting loop:

GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)

I can recover by leaving and re-joining the domain, so the server-side isn't persistently broken.

Logging into cockpit and not opening the realm join dialog, then joining via CLI still fails, although less often:

diff --git test/verify/check-system-realms test/verify/check-system-realms
index 9947c595a..e50e01eb9 100755
--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,22 +357,21 @@ class CommonTests:
 
         # join domain, wait until it works
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
+
         # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
         self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        self.wait_discover()
+        b.wait_visible("#system_information_domain_button")
+        self.assertIn("cockpit.lan", m.execute("realm discover"))
+        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")
 
-        b.set_input_text("#realms-op-address", "cockpit.lan")
-        self.wait_address_helper()
-        b.set_input_text("#realms-op-admin", self.admin_user)
-        b.set_input_text("#realms-op-admin-password", self.admin_password)
-        b.click(f"#realms-join-dialog button{self.primary_btn_class}")
-        with b.wait_timeout(300):
-            b.wait_not_present("#realms-join-dialog")
         b.logout()
+
         m.execute('while ! id alice; do sleep 5; done', timeout=300)
 
+        # testlib.sit()
+
+        return
+
         # alice's certificate was written by testClientCertAuthentication()
         alice_cert_key = ['--cert', "/var/tmp/alice.pem", '--key', "/var/tmp/alice.key"]
         alice_user_pass = ['-u', 'alice:' + self.alice_password]
@@ -896,45 +895,14 @@ class TestAD(TestRealms, CommonTests):
         m = self.machine
 
         services_machine = self.machines['services']
-        # samba has no default CA and no helpers, so just re-use our completely independent cockpit-tls unit test one
-        m.upload(["alice.pem", "alice.key"], "/var/tmp", relative_dir="src/tls/ca/")
-
-        with open("src/tls/ca/alice.pem") as f:
-            alice_cert = f.read().strip()
-        # mangle into form palatable for LDAP
-        alice_cert = ''.join([line for line in alice_cert.splitlines() if not line.startswith("----")])
-        # set up an AD user and import their TLS certificate
-        services_machine.write("/tmp/alice_edit", f'''#!/bin/sh -eu
-sed -i "/^$/d" "$1"
-echo "userCertificate: {alice_cert}" >> "$1"
-''', perm="755")
+
         services_machine.execute(f"""
-podman cp /tmp/alice_edit samba:/tmp/
 podman exec -i samba sh -exc '
 samba-tool user add alice {self.alice_password}
-samba-tool user edit --editor=/tmp/alice_edit alice
 # for debugging:
 samba-tool user show alice
 ' """, stdout=None)
 
-        # set up sssd for certificate mapping to AD
-        # see sssd.conf(5) "CERTIFICATE MAPPING SECTION" and sss-certmap(5)
-        m.write("/etc/sssd/conf.d/certmap.conf", """
-[certmap/cockpit.lan/certs]
-# our test certificates don't have EKU, and as we match full certificates it is not important to check anything here
-matchrule = <KU>digitalSignature
-# default rule; doesn't work because samba's LDAP doesn't understand ";binary"
-# maprule = LDAP:(userCertificate;binary={cert!bin})
-# match verbatim base64 certificate
-maprule = LDAP:(userCertificate={cert!base64})
-# match cert properties only; this looks at SubjectAlternativeName, which our test certs don't have
-# this also requires CA validation in cockpit-tls or sssd, which we don't have yet
-# maprule = (|(userPrincipalName={subject_principal})(sAMAccountName={subject_principal.short_name}))
-""", perm="0600")
-        # tell sssd about our CA for validating certs
-        with open("src/tls/ca/ca.pem") as f:
-            m.write("/etc/sssd/pki/sssd_auth_ca_db.pem", f.read())
-
         self.checkClientCertAuthentication()
 

Leaving the server running and merely looping the client side passes reliably:

--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,21 +357,27 @@ class CommonTests:
 
         # join domain, wait until it works
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
-        # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
-        self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        self.wait_discover()
 
-        b.set_input_text("#realms-op-address", "cockpit.lan")
-        self.wait_address_helper()
-        b.set_input_text("#realms-op-admin", self.admin_user)
-        b.set_input_text("#realms-op-admin-password", self.admin_password)
-        b.click(f"#realms-join-dialog button{self.primary_btn_class}")
-        with b.wait_timeout(300):
-            b.wait_not_present("#realms-join-dialog")
-        b.logout()
-        m.execute('while ! id alice; do sleep 5; done', timeout=300)
+        m.execute("cp /etc/nsswitch.conf /etc/nsswitch.conf.orig")
+
+        # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
+        for _retry in range(10):
+            self.login_and_go("/system")
+            b.wait_visible("#system_information_domain_button")
+            m.execute("until realm discover | grep -q COCKPIT.LAN; do sleep 5; done")
+            m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} COCKPIT.LAN")
+            b.logout()
+            m.execute('while ! id alice; do sleep 5; done')
+            m.execute("realm leave")
+            self.assertEqual(m.execute("realm list"), "")
+            m.execute("! id alice")
+            # clean up
+            m.execute("systemctl stop realmd sssd")
+            m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-restore")
+            m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-remove")
+            m.execute("diff -u /etc/nsswitch.conf.orig /etc/nsswitch.conf", stdout=None)
+
+        return
  • I extended this to completely restart the server (samba container), that works reliably as well.
  • I restored everything relevant in /etc and /var (that can't be removed entirely) and running services, still works reliably.

@martinpitt
Copy link
Member Author

Interesting, I sometimes run into a completely different flake:

+ until realm discover | grep -q COCKPIT.LAN; do sleep 5; done
+ echo 'foobarFoo123' | realm join -vU Administrator COCKPIT.LAN
 * Resolving: _ldap._tcp.cockpit.lan
 * Resolving: cockpit.lan
 ! Discovery timed out after 15 seconds
realm: No such realm found

Shelving that for now.

@martinpitt
Copy link
Member Author

martinpitt commented Nov 17, 2023

Current test: This includes the current samba container (cockpit-project/bots#5557), and this bit:

        self.login_and_go("/system")
        b.wait_visible("#system_information_domain_button")
        # CURRENT TEST: significantly passes when logging out here
        # b.logout()
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")
        # CURRENT TEST: definitively fails when logging out here
        b.logout()

So joining while cockpit is running seems to make the significant difference.

However, this result isn't very reliable yet, as the flake is fairly hard to reproduce now, and due to its resistance of making faster, a 2x 10x iteration takes awfully long, and 10 serial runs are not completely reliable.

  • Disable preloads: fail
  • Add 5s sleep after login to settle down machine (changes timing): fails
  • Replace cockpit with IO/CPU eater: FAIL!

This fails:

        m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
        m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
        time.sleep(1)
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
        m.execute('while ! id alice; do sleep 5; done', timeout=300)

@martinpitt
Copy link
Member Author

I sent samba-in-kubernetes/samba-container#160 to hopefully get some help with debugging this, I'm running out of ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants