test: Factorize and fix timeout for contacting domain #19615

martinpitt · 2023-11-16T06:39:57Z

In most cases this is fast, but quite often Samba takes annoyingly long to answer. Make the timeout consistent and enforce this with helper functions.

Since yesterday's services image update, this test and related ones now flake annoyingly often 👍

martinpitt · 2023-11-16T13:23:39Z

I can reproduce this in some 10% of local runs, when I do a parallel c8s and rhel-9-4 run. I amplified and sped up the test a bit:

--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -353,9 +353,18 @@ class CommonTests:
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
         # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
         self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        b.wait_attr("#realms-op-address", "data-discover", "done")
+        b.eval_js('window.debugging = "dbus"')
+        b.cdp.trace = True
+        m.verbose = True
+        for _ in range(20):
+            b.click("#system_information_domain_button")
+            b.wait_popup("realms-join-dialog")
+            b.wait_attr("#realms-op-address", "data-discover", "done")
+            b.click("#realms-join-dialog button.pf-m-link")
+            b.wait_not_present("#realms-join-dialog")
+
+        return

There is indeed a vast range of response times when opening and closing the dialog interactively. Sometimes it's near-instant, sometimes it takes 10 seconds. So we need to wait longer for the discovery as well. With that, the amplified test loop is stable locally. Let's see what CI thinks.

In most cases this is fast, but quite often Samba takes annoyingly long to answer. Make the timeout consistent and enforce this with helper functions, except for the instance in TestPackageInstall as that doesn't derive from CommonTests.

Restarting sssd in a loop is prone to run into > systemd[1]: sssd.service: Start request repeated too quickly. > systemd[1]: sssd.service: Failed with result 'start-limit-hit'.

With 30 seconds we are running into occasional timeout failures.

martinpitt · 2023-11-16T13:55:18Z

This round has several sssd.service failures due to

systemd[1]: sssd.service: Start request repeated too quickly.
systemd[1]: sssd.service: Failed with result 'start-limit-hit'.

That's an easy fix.

This failure is different. Let's try and bump the timeout.

martinpitt · 2023-11-16T14:18:23Z

Two testClientCertAuthentication failures on rhel-8-10 and rhel-9-4 where the client machine never picks up the new 'alice' user. Argh! These pass for me locally in a loop. Retrying once, as this PR already helps a lot.

martinpitt · 2023-11-16T14:36:16Z

@jelly Tests are still running, and probably there are other flakes, but already marking for review as this will remedy the worst of our current main flakes already.

jelly · 2023-11-16T14:39:45Z

test/verify/check-system-realms

+        with self.browser.wait_timeout(60):
+            self.browser.wait_attr("#realms-op-address", "data-discover", "done")
+
+    def wait_address_helper(self, expected=None):


This could have been expected="Contacted domain" which is a bit nicer I'd say. Anyway not critical.

jelly · 2023-11-16T14:40:22Z

test/verify/check-system-realms

        b.set_input_text("#realms-op-admin", self.admin_user)
        b.set_input_text("#realms-op-admin-password", self.admin_password)
        b.click(f"#realms-join-dialog button{self.primary_btn_class}")
        with b.wait_timeout(300):
            b.wait_not_present("#realms-join-dialog")
        b.logout()
-        m.execute('while ! id alice; do sleep 5; systemctl restart sssd; done', timeout=300)
+        m.execute('while ! id alice; do sleep 5; systemctl reset-failed sssd; systemctl restart sssd; done', timeout=300)


Annoying still flakes here, but can be a follow up.

martinpitt · 2023-11-16T16:23:31Z

where the client machine never picks up the new 'alice' user.

I can reproduce this locally after umpteen retries. This isn't a matter of waiting longer, it still doesn't work after 45 mins of sitting, running sss_cache -E, etc. But in the samba container, samba-tool user list does show alice.

I tried to reboot the client machine, doesn't help. So this smells like a real bug in the new samba container. I'll report and naughty it tomorrow.

martinpitt · 2023-11-17T05:30:24Z

Two ideas, for my notes:

Try the alice polling loop with just sss_cache -E without restarting sssd: Still fails with AD (didn't try IPA exhaustively)
Try the alice polling loop without any extra fiddling: Still fails with AD, IPA seems fine. Let's do that.
This is the test that modifies the user by adding a cert; try if this still happens without: yes, it does
Try joining with realm CLI, for a cockpit-independent reproducer: succeeds reliably now (but see below)
Test with the latest container, the container build was fixed yesterday, see How to get a [global] option into smb.conf? samba-in-kubernetes/samba-container#159 : at least one run took a long time to see the user, but eventually succeeded; 10x run pass, 30x run eventually failed
```
bots/image-customize -vr 'podman pull quay.io/samba.org/samba-ad-server' services
```
Sent Image refresh for services bots#5557 to cross-check

When this happens, alice is also not present in sssctl user-show alice. What's weird is this:

# sssctl domain-status COCKPIT.LAN
Online status: Offline

Active servers:
AD Global Catalog: not connected
AD Domain Controller: cockpit.lan
[...]

/var/log/sssd/sssd_cockpit.lan.log has a similar error:

   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_get_account_info_send] (0x0200): Got request for [0x1][BE_REQ_USER][[email protected]]
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] DP Request [Account #5]: REQ_TRACE: New request. [sssd.nss CID #4] Flags [0x0001].
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] [CID #4] Backend is offline! Using cached data if available
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] Number of active DP request: 1
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sss_domain_get_state] (0x1000): [RID#5] Domain cockpit.lan is Active
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [_dp_req_recv] (0x0400): DP Request [Account #5]: Receiving request data.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): DP Request [Account #5]: Request removed.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): Number of active DP request: 0
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2023-11-17  0:47:15): [be[cockpit.lan]] [ad_sasl_log] (0x0040): [RID#6] SASL: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sasl_bind_send] (0x0020): [RID#6] ldap_sasl_interactive_bind_s failed (-2)[Local error]
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sdap_cli_connect_recv] (0x0040): [RID#6] Unable to establish connection [1432158227]: Authentication Failed
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:19): [be[cockpit.lan]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:

On a run which works, /var/log/sssd/sssd_cockpit.lan.log just has a single "Starting..." line, and sssctl domain-status COCKPIT.LAN says:

Online status: Online

Active servers:
AD Global Catalog: f0.cockpit.lan
AD Domain Controller: f0.cockpit.lan
[...]

podman logs samba looks comparable on both services VMs, also the running processes.

The client side journal on the broken instance is interesting: sssd.service is caught in a restarting loop:

GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)

I can recover by leaving and re-joining the domain, so the server-side isn't persistently broken.

Logging into cockpit and not opening the realm join dialog, then joining via CLI still fails, although less often:

diff --git test/verify/check-system-realms test/verify/check-system-realms
index 9947c595a..e50e01eb9 100755
--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,22 +357,21 @@ class CommonTests:
 
         # join domain, wait until it works
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
+
         # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
         self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        self.wait_discover()
+        b.wait_visible("#system_information_domain_button")
+        self.assertIn("cockpit.lan", m.execute("realm discover"))
+        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")
 
-        b.set_input_text("#realms-op-address", "cockpit.lan")
-        self.wait_address_helper()
-        b.set_input_text("#realms-op-admin", self.admin_user)
-        b.set_input_text("#realms-op-admin-password", self.admin_password)
-        b.click(f"#realms-join-dialog button{self.primary_btn_class}")
-        with b.wait_timeout(300):
-            b.wait_not_present("#realms-join-dialog")
         b.logout()
+
         m.execute('while ! id alice; do sleep 5; done', timeout=300)
 
+        # testlib.sit()
+
+        return
+
         # alice's certificate was written by testClientCertAuthentication()
         alice_cert_key = ['--cert', "/var/tmp/alice.pem", '--key', "/var/tmp/alice.key"]
         alice_user_pass = ['-u', 'alice:' + self.alice_password]
@@ -896,45 +895,14 @@ class TestAD(TestRealms, CommonTests):
         m = self.machine
 
         services_machine = self.machines['services']
-        # samba has no default CA and no helpers, so just re-use our completely independent cockpit-tls unit test one
-        m.upload(["alice.pem", "alice.key"], "/var/tmp", relative_dir="src/tls/ca/")
-
-        with open("src/tls/ca/alice.pem") as f:
-            alice_cert = f.read().strip()
-        # mangle into form palatable for LDAP
-        alice_cert = ''.join([line for line in alice_cert.splitlines() if not line.startswith("----")])
-        # set up an AD user and import their TLS certificate
-        services_machine.write("/tmp/alice_edit", f'''#!/bin/sh -eu
-sed -i "/^$/d" "$1"
-echo "userCertificate: {alice_cert}" >> "$1"
-''', perm="755")
+
         services_machine.execute(f"""
-podman cp /tmp/alice_edit samba:/tmp/
 podman exec -i samba sh -exc '
 samba-tool user add alice {self.alice_password}
-samba-tool user edit --editor=/tmp/alice_edit alice
 # for debugging:
 samba-tool user show alice
 ' """, stdout=None)
 
-        # set up sssd for certificate mapping to AD
-        # see sssd.conf(5) "CERTIFICATE MAPPING SECTION" and sss-certmap(5)
-        m.write("/etc/sssd/conf.d/certmap.conf", """
-[certmap/cockpit.lan/certs]
-# our test certificates don't have EKU, and as we match full certificates it is not important to check anything here
-matchrule = <KU>digitalSignature
-# default rule; doesn't work because samba's LDAP doesn't understand ";binary"
-# maprule = LDAP:(userCertificate;binary={cert!bin})
-# match verbatim base64 certificate
-maprule = LDAP:(userCertificate={cert!base64})
-# match cert properties only; this looks at SubjectAlternativeName, which our test certs don't have
-# this also requires CA validation in cockpit-tls or sssd, which we don't have yet
-# maprule = (|(userPrincipalName={subject_principal})(sAMAccountName={subject_principal.short_name}))
-""", perm="0600")
-        # tell sssd about our CA for validating certs
-        with open("src/tls/ca/ca.pem") as f:
-            m.write("/etc/sssd/pki/sssd_auth_ca_db.pem", f.read())
-
         self.checkClientCertAuthentication()

Leaving the server running and merely looping the client side passes reliably:

--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,21 +357,27 @@ class CommonTests:
 
         # join domain, wait until it works
         m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
-        # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
-        self.login_and_go("/system")
-        b.click("#system_information_domain_button")
-        b.wait_popup("realms-join-dialog")
-        self.wait_discover()
 
-        b.set_input_text("#realms-op-address", "cockpit.lan")
-        self.wait_address_helper()
-        b.set_input_text("#realms-op-admin", self.admin_user)
-        b.set_input_text("#realms-op-admin-password", self.admin_password)
-        b.click(f"#realms-join-dialog button{self.primary_btn_class}")
-        with b.wait_timeout(300):
-            b.wait_not_present("#realms-join-dialog")
-        b.logout()
-        m.execute('while ! id alice; do sleep 5; done', timeout=300)
+        m.execute("cp /etc/nsswitch.conf /etc/nsswitch.conf.orig")
+
+        # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
+        for _retry in range(10):
+            self.login_and_go("/system")
+            b.wait_visible("#system_information_domain_button")
+            m.execute("until realm discover | grep -q COCKPIT.LAN; do sleep 5; done")
+            m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} COCKPIT.LAN")
+            b.logout()
+            m.execute('while ! id alice; do sleep 5; done')
+            m.execute("realm leave")
+            self.assertEqual(m.execute("realm list"), "")
+            m.execute("! id alice")
+            # clean up
+            m.execute("systemctl stop realmd sssd")
+            m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-restore")
+            m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-remove")
+            m.execute("diff -u /etc/nsswitch.conf.orig /etc/nsswitch.conf", stdout=None)
+
+        return

I extended this to completely restart the server (samba container), that works reliably as well.
I restored everything relevant in /etc and /var (that can't be removed entirely) and running services, still works reliably.

martinpitt · 2023-11-17T09:57:03Z

Interesting, I sometimes run into a completely different flake:

+ until realm discover | grep -q COCKPIT.LAN; do sleep 5; done
+ echo 'foobarFoo123' | realm join -vU Administrator COCKPIT.LAN
 * Resolving: _ldap._tcp.cockpit.lan
 * Resolving: cockpit.lan
 ! Discovery timed out after 15 seconds
realm: No such realm found

Shelving that for now.

martinpitt · 2023-11-17T13:00:23Z

Current test: This includes the current samba container (cockpit-project/bots#5557), and this bit:

        self.login_and_go("/system")
        b.wait_visible("#system_information_domain_button")
        # CURRENT TEST: significantly passes when logging out here
        # b.logout()
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")
        # CURRENT TEST: definitively fails when logging out here
        b.logout()

So joining while cockpit is running seems to make the significant difference.

However, this result isn't very reliable yet, as the flake is fairly hard to reproduce now, and due to its resistance of making faster, a 2x 10x iteration takes awfully long, and 10 serial runs are not completely reliable.

Disable preloads: fail
Add 5s sleep after login to settle down machine (changes timing): fails
Replace cockpit with IO/CPU eater: FAIL!

This fails:

        m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
        m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
        time.sleep(1)
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
        m.execute('while ! id alice; do sleep 5; done', timeout=300)

martinpitt · 2023-11-20T11:10:38Z

I sent samba-in-kubernetes/samba-container#160 to hopefully get some help with debugging this, I'm running out of ideas.

martinpitt added the flake unstable test label Nov 16, 2023

This comment was marked as outdated.

Sign in to view

martinpitt marked this pull request as draft November 16, 2023 06:58

martinpitt force-pushed the realms-address-helper branch from 4506844 to 63a6f9a Compare November 16, 2023 13:23

martinpitt added 3 commits November 16, 2023 14:54

test: Avoid sssd.service restart limit failure in check-system-realms

4a503de

Restarting sssd in a loop is prone to run into > systemd[1]: sssd.service: Start request repeated too quickly. > systemd[1]: sssd.service: Failed with result 'start-limit-hit'.

test: Increaese IPA leave timeout

650c603

With 30 seconds we are running into occasional timeout failures.

martinpitt force-pushed the realms-address-helper branch from 63a6f9a to 650c603 Compare November 16, 2023 13:55

martinpitt marked this pull request as ready for review November 16, 2023 14:35

martinpitt requested a review from jelly November 16, 2023 14:35

jelly approved these changes Nov 16, 2023

View reviewed changes

martinpitt merged commit 6ef43c6 into cockpit-project:main Nov 16, 2023
90 of 91 checks passed

martinpitt deleted the realms-address-helper branch November 16, 2023 16:23

martinpitt mentioned this pull request Nov 17, 2023

Image refresh for services cockpit-project/bots#5557

Merged

1 task

martinpitt mentioned this pull request Nov 20, 2023

Occasionally clients can't discover AD Global Catalog server samba-in-kubernetes/samba-container#160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Factorize and fix timeout for contacting domain #19615

test: Factorize and fix timeout for contacting domain #19615

martinpitt commented Nov 16, 2023 •

edited

Loading

This comment was marked as outdated.

martinpitt commented Nov 16, 2023

martinpitt commented Nov 16, 2023

martinpitt commented Nov 16, 2023 •

edited

Loading

martinpitt commented Nov 16, 2023

jelly Nov 16, 2023

jelly Nov 16, 2023

martinpitt commented Nov 16, 2023

martinpitt commented Nov 17, 2023 •

edited

Loading

martinpitt commented Nov 17, 2023

martinpitt commented Nov 17, 2023 •

edited

Loading

martinpitt commented Nov 20, 2023

test: Factorize and fix timeout for contacting domain #19615

test: Factorize and fix timeout for contacting domain #19615

Conversation

martinpitt commented Nov 16, 2023 • edited Loading

This comment was marked as outdated.

martinpitt commented Nov 16, 2023

martinpitt commented Nov 16, 2023

martinpitt commented Nov 16, 2023 • edited Loading

martinpitt commented Nov 16, 2023

jelly Nov 16, 2023

Choose a reason for hiding this comment

jelly Nov 16, 2023

Choose a reason for hiding this comment

martinpitt commented Nov 16, 2023

martinpitt commented Nov 17, 2023 • edited Loading

martinpitt commented Nov 17, 2023

martinpitt commented Nov 17, 2023 • edited Loading

martinpitt commented Nov 20, 2023

martinpitt commented Nov 16, 2023 •

edited

Loading

martinpitt commented Nov 16, 2023 •

edited

Loading

martinpitt commented Nov 17, 2023 •

edited

Loading

martinpitt commented Nov 17, 2023 •

edited

Loading