Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally clients can't discover AD Global Catalog server #160

Open
martinpitt opened this issue Nov 20, 2023 · 5 comments
Open

Occasionally clients can't discover AD Global Catalog server #160

martinpitt opened this issue Nov 20, 2023 · 5 comments

Comments

@martinpitt
Copy link

I've been debugging a big Cockpit AD test flake for three days now, and still can't put my finger on it, so maybe you have an idea. This started to fail since we moved from https://github.com/Fmstrat/samba-domain/ to https://quay.io/repository/samba.org/samba-ad-server , i.e. the client side didn't change. What this test does is roughly this:

  • Start one "services" VM with a samba-ad-server podman container (called f0.cockpit.lan), with exporting all ports
  • Start one "client/cockpit" VM x0.cockpit.lan with realmd, adcli and such.
  • Create an alice user in Samba AD on "services"
  • On the client, join the domain, and wait until the alice user is visible, i.e. id alice succeeds.

This works most of the time. After joining:

# sssctl domain-status cockpit.lan
Online status: Online

Active servers:
AD Global Catalog: f0.cockpit.lan
AD Domain Controller: f0.cockpit.lan

But in about 10% of local runs and 50% of runs in CI, it looks like this:

Online status: Offline

Active servers:
AD Global Catalog: not connected
AD Domain Controller: cockpit.lan

and /var/log/sssd/sssd_cockpit.lan.log has a similar error:

   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_get_account_info_send] (0x0200): Got request for [0x1][BE_REQ_USER][[email protected]]
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] DP Request [Account #5]: REQ_TRACE: New request. [sssd.nss CID #4] Flags [0x0001].
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] [CID #4] Backend is offline! Using cached data if available
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] Number of active DP request: 1
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sss_domain_get_state] (0x1000): [RID#5] Domain cockpit.lan is Active
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [_dp_req_recv] (0x0400): DP Request [Account #5]: Receiving request data.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): DP Request [Account #5]: Request removed.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): Number of active DP request: 0
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2023-11-17  0:47:15): [be[cockpit.lan]] [ad_sasl_log] (0x0040): [RID#6] SASL: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/[email protected] not found in Kerberos database)
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sasl_bind_send] (0x0020): [RID#6] ldap_sasl_interactive_bind_s failed (-2)[Local error]
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sdap_cli_connect_recv] (0x0040): [RID#6] Unable to establish connection [1432158227]: Authentication Failed
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:19): [be[cockpit.lan]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:

This is a race condition -- I can gradually strip down the test until it doesn't involve Cockpit at all any more -- the only effect that it has is to cause some I/O and CPU noise (like packagekit checking for updates). I can synthesize this with client-side commands like this:

        m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
        m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
        time.sleep(1)
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
        m.execute('while ! id alice; do sleep 5; done', timeout=300)

This is cockpit test API lingo, but m.execute just runs a shell command on the client VM, while m.spawn() runs it in the background.

Do you happen to have any idea to investigate further what exactly fails here?

@martinpitt
Copy link
Author

I tried to create a reproducer that runs on a standard RHEL 9.4 cloud image (as that's where it fails most often, but it also fails on C8S, Fedora 39, etc.).

First, some prep:

systemctl stop firewalld
hostnamectl set-hostname x0.cockpit.lan
logout
# log back in to pick up changed host name

Set up the Samba container:

cat <<EOF > /tmp/samba-ad.json
{
  "samba-container-config": "v0",
  "configs": {
    "demo": {
      "instance_features": ["addc"],
      "domain_settings": "sink",
      "instance_name": "smb"
    }
  },
  "domain_settings": {
    "sink": {
      "realm": "COCKPIT.LAN",
      "short_domain": "COCKPIT",
      "admin_password": "foobarFoo123"
    }
  }
}
EOF

SERVER_IP=$(ip route show | grep -oP 'src \K\S+' | head -n1)

# necessary?
# echo "$SERVER_IP x0.cockpit.lan x0" >> /etc/hosts

podman run -d --rm --name samba --privileged \
    -p $SERVER_IP:53:53/udp -p 389:389 -p 389:389/udp -p 445:445 \
    -p 88:88 \
    -p 88:88/udp \
    -p 135:135 \
    -p 137-138:137-138/udp \
    -p 139:139 \
    -p 464:464 \
    -p 464:464/udp \
    -p 636:636 \
    -p 1024-1044:1024-1044 \
    -p 3268-3269:3268-3269 \
    -v /tmp/samba-ad.json:/etc/samba/container.json \
    -h smb.cockpit.lan \
    quay.io/samba.org/samba-ad-server


nmcli con mod 'System eth0' ipv4.ignore-auto-dns yes ipv4.dns $SERVER_IP
systemctl restart NetworkManager
# echo "nameserver $SERVER_IP" > /etc/resolv.conf

# wait until server is running
until nslookup -type=SRV _ldap._tcp.cockpit.lan; do sleep 1; done
until nc -z $SERVER_IP 389; do sleep 1; done

# add AD user
podman exec -i samba samba-tool user add alice foobarFoo123

Now the AD client side:

printf '[cockpit.lan]\nfully-qualified-names = no\n' > /etc/realmd.conf
# this should see up COCKPIT.LAN
realm discover
# cockpit.lan type kerberos, client-software: sssd, etc

echo foobarFoo123 | realm join -vU Administrator cockpit.lan

This succeeds.

id alice fails, and sssctl domain-status cockpit.lan is in a semi-broken state: It says "Online" (instead of "offline" as our test does), but it still cannot find the global catalog:

Online status: Online

Active servers:
AD Global Catalog: not connected
AD Domain Controller: smb.cockpit.lan

Discovered AD Global Catalog servers:
None so far.
Discovered AD Domain Controller servers:
- smb.cockpit.lan

The sssd log is rather empty:

# cat /var/log/sssd/sssd_cockpit.lan.log
(2023-11-20  6:07:28): [be[cockpit.lan]] [server_setup] (0x3f7c0): Starting with debug level = 0x0070

all other log files look similar.

So this clearly does not reproduce the actual flake/error, but I'm lost here. Do you have a hint how to fix this CLI reproducer? Once it works in general, I hope I can make it flake/error like our actual test (which is hard to debug as there are so many moving parts).

Thanks!

@phlogistonjohn
Copy link
Collaborator

CC: @gd

martinpitt added a commit to martinpitt/cockpit that referenced this issue Nov 21, 2023
There is a race condition with the current Samba AD container on the
services image: Sometimes joining the domain doesn't pick up the global
directory server, and queries fail with "SSSD is offline" / "Unspecified
GSS failure".

This is too hard for us to track down ourselves, it needs help from [1].
In the meantime, leave and re-join the domain to give it another chance
of succeeding. This avoids the extra I/O/CPU noise that goes along with
the cockpit session (such as checking for package updates), and has a
higher chance of succeeding.

Note that joining AD via cockpit is still covered by
test{Un,}QualifiedUser.

[1] samba-in-kubernetes/samba-container#160
martinpitt added a commit to martinpitt/bots that referenced this issue Nov 22, 2023
Commit 392d6b2 moved to quay.io/samba.org/samba-ad-server, but this
has a serious and difficult bug [1] with connecting to the Global
Directory.

Go back to the previous https://github.com/Fmstrat/samba-domain container. The
official dockerhub image actually works very well now, but we still have to
build ourselves due to the docker.io pull rate limits.

Also don't re-add the external volumes -- we are not interested in permanently
keeping any Samba data.

[1] samba-in-kubernetes/samba-container#160
martinpitt added a commit to cockpit-project/bots that referenced this issue Nov 22, 2023
Commit 392d6b2 moved to quay.io/samba.org/samba-ad-server, but this
has a serious and difficult bug [1] with connecting to the Global
Directory.

Go back to the previous https://github.com/Fmstrat/samba-domain container. The
official dockerhub image actually works very well now, but we still have to
build ourselves due to the docker.io pull rate limits.

Also don't re-add the external volumes -- we are not interested in permanently
keeping any Samba data.

[1] samba-in-kubernetes/samba-container#160
@martinpitt
Copy link
Author

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot. As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

@phlogistonjohn
Copy link
Collaborator

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot.

I'm sorry to hear that. Both for the change and for the issue.

As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

It is certainly possible.

We build images tagged nightly that include nightly builds of samba master. Could you try quay.io/samba.org/samba-ad-server:nightly and see if the issue occurs there too? If so, we may want to report the issue at the samba bugzilla.

Also sorry for the lack of response ealier. I saw this issue when I was on vacation and pinged my manager at work hoping he'd have someone else look into it. But I guess not and from my POV it fell through the cracks.

@martinpitt
Copy link
Author

No worries at all @phlogistonjohn ! Thanks for the hint, I'll try the nightly image, in January (this is EOY for me as well). Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants