Occasionally clients can't discover AD Global Catalog server #160

martinpitt · 2023-11-20T11:03:33Z

I've been debugging a big Cockpit AD test flake for three days now, and still can't put my finger on it, so maybe you have an idea. This started to fail since we moved from https://github.com/Fmstrat/samba-domain/ to https://quay.io/repository/samba.org/samba-ad-server , i.e. the client side didn't change. What this test does is roughly this:

Start one "services" VM with a samba-ad-server podman container (called f0.cockpit.lan), with exporting all ports
Start one "client/cockpit" VM x0.cockpit.lan with realmd, adcli and such.
Create an alice user in Samba AD on "services"
On the client, join the domain, and wait until the alice user is visible, i.e. id alice succeeds.

This works most of the time. After joining:

# sssctl domain-status cockpit.lan
Online status: Online

Active servers:
AD Global Catalog: f0.cockpit.lan
AD Domain Controller: f0.cockpit.lan

But in about 10% of local runs and 50% of runs in CI, it looks like this:

Online status: Offline

Active servers:
AD Global Catalog: not connected
AD Domain Controller: cockpit.lan

and /var/log/sssd/sssd_cockpit.lan.log has a similar error:

   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_get_account_info_send] (0x0200): Got request for [0x1][BE_REQ_USER][[email protected]]
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] DP Request [Account #5]: REQ_TRACE: New request. [sssd.nss CID #4] Flags [0x0001].
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] [CID #4] Backend is offline! Using cached data if available
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] Number of active DP request: 1
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sss_domain_get_state] (0x1000): [RID#5] Domain cockpit.lan is Active
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [_dp_req_recv] (0x0400): DP Request [Account #5]: Receiving request data.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): DP Request [Account #5]: Request removed.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): Number of active DP request: 0
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2023-11-17  0:47:15): [be[cockpit.lan]] [ad_sasl_log] (0x0040): [RID#6] SASL: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sasl_bind_send] (0x0020): [RID#6] ldap_sasl_interactive_bind_s failed (-2)[Local error]
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sdap_cli_connect_recv] (0x0040): [RID#6] Unable to establish connection [1432158227]: Authentication Failed
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:19): [be[cockpit.lan]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:

This is a race condition -- I can gradually strip down the test until it doesn't involve Cockpit at all any more -- the only effect that it has is to cause some I/O and CPU noise (like packagekit checking for updates). I can synthesize this with client-side commands like this:

        m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
        m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
        time.sleep(1)
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
        m.execute('while ! id alice; do sleep 5; done', timeout=300)

This is cockpit test API lingo, but m.execute just runs a shell command on the client VM, while m.spawn() runs it in the background.

Do you happen to have any idea to investigate further what exactly fails here?

The text was updated successfully, but these errors were encountered:

martinpitt · 2023-11-20T11:10:14Z

I tried to create a reproducer that runs on a standard RHEL 9.4 cloud image (as that's where it fails most often, but it also fails on C8S, Fedora 39, etc.).

First, some prep:

systemctl stop firewalld
hostnamectl set-hostname x0.cockpit.lan
logout
# log back in to pick up changed host name

Set up the Samba container:

cat <<EOF > /tmp/samba-ad.json
{
  "samba-container-config": "v0",
  "configs": {
    "demo": {
      "instance_features": ["addc"],
      "domain_settings": "sink",
      "instance_name": "smb"
    }
  },
  "domain_settings": {
    "sink": {
      "realm": "COCKPIT.LAN",
      "short_domain": "COCKPIT",
      "admin_password": "foobarFoo123"
    }
  }
}
EOF

SERVER_IP=$(ip route show | grep -oP 'src \K\S+' | head -n1)

# necessary?
# echo "$SERVER_IP x0.cockpit.lan x0" >> /etc/hosts

podman run -d --rm --name samba --privileged \
    -p $SERVER_IP:53:53/udp -p 389:389 -p 389:389/udp -p 445:445 \
    -p 88:88 \
    -p 88:88/udp \
    -p 135:135 \
    -p 137-138:137-138/udp \
    -p 139:139 \
    -p 464:464 \
    -p 464:464/udp \
    -p 636:636 \
    -p 1024-1044:1024-1044 \
    -p 3268-3269:3268-3269 \
    -v /tmp/samba-ad.json:/etc/samba/container.json \
    -h smb.cockpit.lan \
    quay.io/samba.org/samba-ad-server


nmcli con mod 'System eth0' ipv4.ignore-auto-dns yes ipv4.dns $SERVER_IP
systemctl restart NetworkManager
# echo "nameserver $SERVER_IP" > /etc/resolv.conf

# wait until server is running
until nslookup -type=SRV _ldap._tcp.cockpit.lan; do sleep 1; done
until nc -z $SERVER_IP 389; do sleep 1; done

# add AD user
podman exec -i samba samba-tool user add alice foobarFoo123

Now the AD client side:

printf '[cockpit.lan]\nfully-qualified-names = no\n' > /etc/realmd.conf
# this should see up COCKPIT.LAN
realm discover
# cockpit.lan type kerberos, client-software: sssd, etc

echo foobarFoo123 | realm join -vU Administrator cockpit.lan

This succeeds.

id alice fails, and sssctl domain-status cockpit.lan is in a semi-broken state: It says "Online" (instead of "offline" as our test does), but it still cannot find the global catalog:

Online status: Online

Active servers:
AD Global Catalog: not connected
AD Domain Controller: smb.cockpit.lan

Discovered AD Global Catalog servers:
None so far.
Discovered AD Domain Controller servers:
- smb.cockpit.lan

The sssd log is rather empty:

# cat /var/log/sssd/sssd_cockpit.lan.log
(2023-11-20  6:07:28): [be[cockpit.lan]] [server_setup] (0x3f7c0): Starting with debug level = 0x0070

all other log files look similar.

So this clearly does not reproduce the actual flake/error, but I'm lost here. Do you have a hint how to fix this CLI reproducer? Once it works in general, I hope I can make it flake/error like our actual test (which is hard to debug as there are so many moving parts).

Thanks!

phlogistonjohn · 2023-11-20T19:16:22Z

CC: @gd

There is a race condition with the current Samba AD container on the services image: Sometimes joining the domain doesn't pick up the global directory server, and queries fail with "SSSD is offline" / "Unspecified GSS failure". This is too hard for us to track down ourselves, it needs help from [1]. In the meantime, leave and re-join the domain to give it another chance of succeeding. This avoids the extra I/O/CPU noise that goes along with the cockpit session (such as checking for package updates), and has a higher chance of succeeding. Note that joining AD via cockpit is still covered by test{Un,}QualifiedUser. [1] samba-in-kubernetes/samba-container#160

Commit 392d6b2 moved to quay.io/samba.org/samba-ad-server, but this has a serious and difficult bug [1] with connecting to the Global Directory. Go back to the previous https://github.com/Fmstrat/samba-domain container. The official dockerhub image actually works very well now, but we still have to build ourselves due to the docker.io pull rate limits. Also don't re-add the external volumes -- we are not interested in permanently keeping any Samba data. [1] samba-in-kubernetes/samba-container#160

martinpitt · 2023-12-19T14:34:48Z

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot. As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

phlogistonjohn · 2023-12-19T15:41:55Z

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot.

I'm sorry to hear that. Both for the change and for the issue.

As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

It is certainly possible.

We build images tagged nightly that include nightly builds of samba master. Could you try quay.io/samba.org/samba-ad-server:nightly and see if the issue occurs there too? If so, we may want to report the issue at the samba bugzilla.

Also sorry for the lack of response ealier. I saw this issue when I was on vacation and pinged my manager at work hoping he'd have someone else look into it. But I guess not and from my POV it fell through the cracks.

martinpitt · 2023-12-19T16:10:36Z

No worries at all @phlogistonjohn ! Thanks for the hint, I'll try the nightly image, in January (this is EOY for me as well). Happy holidays!

martinpitt mentioned this issue Nov 20, 2023

test: Factorize and fix timeout for contacting domain cockpit-project/cockpit#19615

Merged

martinpitt mentioned this issue Nov 21, 2023

test: Add hack to retry a failed AD joining cockpit-project/cockpit#19638

Closed

martinpitt mentioned this issue Nov 22, 2023

images: Go back to Fmstrat/samba-domain container cockpit-project/bots#5580

Merged

1 task

mvollmer mentioned this issue Dec 11, 2023

test: Retry auth in checkClientCertAuthentication cockpit-project/cockpit#19719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasionally clients can't discover AD Global Catalog server #160

Occasionally clients can't discover AD Global Catalog server #160

martinpitt commented Nov 20, 2023

martinpitt commented Nov 20, 2023

phlogistonjohn commented Nov 20, 2023

martinpitt commented Dec 19, 2023

phlogistonjohn commented Dec 19, 2023

martinpitt commented Dec 19, 2023

Occasionally clients can't discover AD Global Catalog server #160

Occasionally clients can't discover AD Global Catalog server #160

Comments

martinpitt commented Nov 20, 2023

martinpitt commented Nov 20, 2023

phlogistonjohn commented Nov 20, 2023

martinpitt commented Dec 19, 2023

phlogistonjohn commented Dec 19, 2023

martinpitt commented Dec 19, 2023