-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Factorize and fix timeout for contacting domain #19615
test: Factorize and fix timeout for contacting domain #19615
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
4506844
to
63a6f9a
Compare
I can reproduce this in some 10% of local runs, when I do a parallel c8s and rhel-9-4 run. I amplified and sped up the test a bit: --- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -353,9 +353,18 @@ class CommonTests:
m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
# join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
self.login_and_go("/system")
- b.click("#system_information_domain_button")
- b.wait_popup("realms-join-dialog")
- b.wait_attr("#realms-op-address", "data-discover", "done")
+ b.eval_js('window.debugging = "dbus"')
+ b.cdp.trace = True
+ m.verbose = True
+ for _ in range(20):
+ b.click("#system_information_domain_button")
+ b.wait_popup("realms-join-dialog")
+ b.wait_attr("#realms-op-address", "data-discover", "done")
+ b.click("#realms-join-dialog button.pf-m-link")
+ b.wait_not_present("#realms-join-dialog")
+
+ return
There is indeed a vast range of response times when opening and closing the dialog interactively. Sometimes it's near-instant, sometimes it takes 10 seconds. So we need to wait longer for the discovery as well. With that, the amplified test loop is stable locally. Let's see what CI thinks. |
In most cases this is fast, but quite often Samba takes annoyingly long to answer. Make the timeout consistent and enforce this with helper functions, except for the instance in TestPackageInstall as that doesn't derive from CommonTests.
Restarting sssd in a loop is prone to run into > systemd[1]: sssd.service: Start request repeated too quickly. > systemd[1]: sssd.service: Failed with result 'start-limit-hit'.
With 30 seconds we are running into occasional timeout failures.
This round has several sssd.service failures due to
That's an easy fix. This failure is different. Let's try and bump the timeout. |
63a6f9a
to
650c603
Compare
@jelly Tests are still running, and probably there are other flakes, but already marking for review as this will remedy the worst of our current main flakes already. |
with self.browser.wait_timeout(60): | ||
self.browser.wait_attr("#realms-op-address", "data-discover", "done") | ||
|
||
def wait_address_helper(self, expected=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could have been expected="Contacted domain"
which is a bit nicer I'd say. Anyway not critical.
b.set_input_text("#realms-op-admin", self.admin_user) | ||
b.set_input_text("#realms-op-admin-password", self.admin_password) | ||
b.click(f"#realms-join-dialog button{self.primary_btn_class}") | ||
with b.wait_timeout(300): | ||
b.wait_not_present("#realms-join-dialog") | ||
b.logout() | ||
m.execute('while ! id alice; do sleep 5; systemctl restart sssd; done', timeout=300) | ||
m.execute('while ! id alice; do sleep 5; systemctl reset-failed sssd; systemctl restart sssd; done', timeout=300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoying still flakes here, but can be a follow up.
I can reproduce this locally after umpteen retries. This isn't a matter of waiting longer, it still doesn't work after 45 mins of sitting, running I tried to reboot the client machine, doesn't help. So this smells like a real bug in the new samba container. I'll report and naughty it tomorrow. |
Two ideas, for my notes:
When this happens, alice is also not present in
/var/log/sssd/sssd_cockpit.lan.log has a similar error:
On a run which works, /var/log/sssd/sssd_cockpit.lan.log just has a single "Starting..." line, and
The client side journal on the broken instance is interesting:
I can recover by leaving and re-joining the domain, so the server-side isn't persistently broken. Logging into cockpit and not opening the realm join dialog, then joining via CLI still fails, although less often: diff --git test/verify/check-system-realms test/verify/check-system-realms
index 9947c595a..e50e01eb9 100755
--- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,22 +357,21 @@ class CommonTests:
# join domain, wait until it works
m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
+
# join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
self.login_and_go("/system")
- b.click("#system_information_domain_button")
- b.wait_popup("realms-join-dialog")
- self.wait_discover()
+ b.wait_visible("#system_information_domain_button")
+ self.assertIn("cockpit.lan", m.execute("realm discover"))
+ m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")
- b.set_input_text("#realms-op-address", "cockpit.lan")
- self.wait_address_helper()
- b.set_input_text("#realms-op-admin", self.admin_user)
- b.set_input_text("#realms-op-admin-password", self.admin_password)
- b.click(f"#realms-join-dialog button{self.primary_btn_class}")
- with b.wait_timeout(300):
- b.wait_not_present("#realms-join-dialog")
b.logout()
+
m.execute('while ! id alice; do sleep 5; done', timeout=300)
+ # testlib.sit()
+
+ return
+
# alice's certificate was written by testClientCertAuthentication()
alice_cert_key = ['--cert', "/var/tmp/alice.pem", '--key', "/var/tmp/alice.key"]
alice_user_pass = ['-u', 'alice:' + self.alice_password]
@@ -896,45 +895,14 @@ class TestAD(TestRealms, CommonTests):
m = self.machine
services_machine = self.machines['services']
- # samba has no default CA and no helpers, so just re-use our completely independent cockpit-tls unit test one
- m.upload(["alice.pem", "alice.key"], "/var/tmp", relative_dir="src/tls/ca/")
-
- with open("src/tls/ca/alice.pem") as f:
- alice_cert = f.read().strip()
- # mangle into form palatable for LDAP
- alice_cert = ''.join([line for line in alice_cert.splitlines() if not line.startswith("----")])
- # set up an AD user and import their TLS certificate
- services_machine.write("/tmp/alice_edit", f'''#!/bin/sh -eu
-sed -i "/^$/d" "$1"
-echo "userCertificate: {alice_cert}" >> "$1"
-''', perm="755")
+
services_machine.execute(f"""
-podman cp /tmp/alice_edit samba:/tmp/
podman exec -i samba sh -exc '
samba-tool user add alice {self.alice_password}
-samba-tool user edit --editor=/tmp/alice_edit alice
# for debugging:
samba-tool user show alice
' """, stdout=None)
- # set up sssd for certificate mapping to AD
- # see sssd.conf(5) "CERTIFICATE MAPPING SECTION" and sss-certmap(5)
- m.write("/etc/sssd/conf.d/certmap.conf", """
-[certmap/cockpit.lan/certs]
-# our test certificates don't have EKU, and as we match full certificates it is not important to check anything here
-matchrule = <KU>digitalSignature
-# default rule; doesn't work because samba's LDAP doesn't understand ";binary"
-# maprule = LDAP:(userCertificate;binary={cert!bin})
-# match verbatim base64 certificate
-maprule = LDAP:(userCertificate={cert!base64})
-# match cert properties only; this looks at SubjectAlternativeName, which our test certs don't have
-# this also requires CA validation in cockpit-tls or sssd, which we don't have yet
-# maprule = (|(userPrincipalName={subject_principal})(sAMAccountName={subject_principal.short_name}))
-""", perm="0600")
- # tell sssd about our CA for validating certs
- with open("src/tls/ca/ca.pem") as f:
- m.write("/etc/sssd/pki/sssd_auth_ca_db.pem", f.read())
-
self.checkClientCertAuthentication()
Leaving the server running and merely looping the client side passes reliably: --- test/verify/check-system-realms
+++ test/verify/check-system-realms
@@ -357,21 +357,27 @@ class CommonTests:
# join domain, wait until it works
m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
- # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
- self.login_and_go("/system")
- b.click("#system_information_domain_button")
- b.wait_popup("realms-join-dialog")
- self.wait_discover()
- b.set_input_text("#realms-op-address", "cockpit.lan")
- self.wait_address_helper()
- b.set_input_text("#realms-op-admin", self.admin_user)
- b.set_input_text("#realms-op-admin-password", self.admin_password)
- b.click(f"#realms-join-dialog button{self.primary_btn_class}")
- with b.wait_timeout(300):
- b.wait_not_present("#realms-join-dialog")
- b.logout()
- m.execute('while ! id alice; do sleep 5; done', timeout=300)
+ m.execute("cp /etc/nsswitch.conf /etc/nsswitch.conf.orig")
+
+ # join client machine with Cockpit, to create the HTTP/ principal and /etc/cockpit/krb5.keytab
+ for _retry in range(10):
+ self.login_and_go("/system")
+ b.wait_visible("#system_information_domain_button")
+ m.execute("until realm discover | grep -q COCKPIT.LAN; do sleep 5; done")
+ m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} COCKPIT.LAN")
+ b.logout()
+ m.execute('while ! id alice; do sleep 5; done')
+ m.execute("realm leave")
+ self.assertEqual(m.execute("realm list"), "")
+ m.execute("! id alice")
+ # clean up
+ m.execute("systemctl stop realmd sssd")
+ m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-restore")
+ m.execute("authselect backup-list | cut -f1 -d' ' | xargs authselect backup-remove")
+ m.execute("diff -u /etc/nsswitch.conf.orig /etc/nsswitch.conf", stdout=None)
+
+ return
|
Interesting, I sometimes run into a completely different flake:
Shelving that for now. |
Current test: This includes the current samba container (cockpit-project/bots#5557), and this bit:
So joining while cockpit is running seems to make the significant difference. However, this result isn't very reliable yet, as the flake is fairly hard to reproduce now, and due to its resistance of making faster, a 2x 10x iteration takes awfully long, and 10 serial runs are not completely reliable.
This fails: m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
time.sleep(1)
self.assertIn("cockpit.lan", m.execute("realm discover"))
m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
m.execute('while ! id alice; do sleep 5; done', timeout=300) |
I sent samba-in-kubernetes/samba-container#160 to hopefully get some help with debugging this, I'm running out of ideas. |
In most cases this is fast, but quite often Samba takes annoyingly long to answer. Make the timeout consistent and enforce this with helper functions.
Since yesterday's services image update, this test and related ones now flake annoyingly often 👍