-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasionally clients can't discover AD Global Catalog server #160
Comments
I tried to create a reproducer that runs on a standard RHEL 9.4 cloud image (as that's where it fails most often, but it also fails on C8S, Fedora 39, etc.). First, some prep:
Set up the Samba container:
Now the AD client side:
This succeeds.
The sssd log is rather empty:
all other log files look similar. So this clearly does not reproduce the actual flake/error, but I'm lost here. Do you have a hint how to fix this CLI reproducer? Once it works in general, I hope I can make it flake/error like our actual test (which is hard to debug as there are so many moving parts). Thanks! |
CC: @gd |
There is a race condition with the current Samba AD container on the services image: Sometimes joining the domain doesn't pick up the global directory server, and queries fail with "SSSD is offline" / "Unspecified GSS failure". This is too hard for us to track down ourselves, it needs help from [1]. In the meantime, leave and re-join the domain to give it another chance of succeeding. This avoids the extra I/O/CPU noise that goes along with the cockpit session (such as checking for package updates), and has a higher chance of succeeding. Note that joining AD via cockpit is still covered by test{Un,}QualifiedUser. [1] samba-in-kubernetes/samba-container#160
Commit 392d6b2 moved to quay.io/samba.org/samba-ad-server, but this has a serious and difficult bug [1] with connecting to the Global Directory. Go back to the previous https://github.com/Fmstrat/samba-domain container. The official dockerhub image actually works very well now, but we still have to build ourselves due to the docker.io pull rate limits. Also don't re-add the external volumes -- we are not interested in permanently keeping any Samba data. [1] samba-in-kubernetes/samba-container#160
Commit 392d6b2 moved to quay.io/samba.org/samba-ad-server, but this has a serious and difficult bug [1] with connecting to the Global Directory. Go back to the previous https://github.com/Fmstrat/samba-domain container. The official dockerhub image actually works very well now, but we still have to build ourselves due to the docker.io pull rate limits. Also don't re-add the external volumes -- we are not interested in permanently keeping any Samba data. [1] samba-in-kubernetes/samba-container#160
We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot. As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here. |
I'm sorry to hear that. Both for the change and for the issue.
It is certainly possible. We build images tagged Also sorry for the lack of response ealier. I saw this issue when I was on vacation and pinged my manager at work hoping he'd have someone else look into it. But I guess not and from my POV it fell through the cracks. |
No worries at all @phlogistonjohn ! Thanks for the hint, I'll try the nightly image, in January (this is EOY for me as well). Happy holidays! |
I've been debugging a big Cockpit AD test flake for three days now, and still can't put my finger on it, so maybe you have an idea. This started to fail since we moved from https://github.com/Fmstrat/samba-domain/ to https://quay.io/repository/samba.org/samba-ad-server , i.e. the client side didn't change. What this test does is roughly this:
f0.cockpit.lan
), with exporting all portsx0.cockpit.lan
withrealmd
,adcli
and such.alice
user in Samba AD on "services"alice
user is visible, i.e.id alice
succeeds.This works most of the time. After joining:
But in about 10% of local runs and 50% of runs in CI, it looks like this:
and /var/log/sssd/sssd_cockpit.lan.log has a similar error:
This is a race condition -- I can gradually strip down the test until it doesn't involve Cockpit at all any more -- the only effect that it has is to cause some I/O and CPU noise (like packagekit checking for updates). I can synthesize this with client-side commands like this:
This is cockpit test API lingo, but
m.execute
just runs a shell command on the client VM, whilem.spawn()
runs it in the background.Do you happen to have any idea to investigate further what exactly fails here?
The text was updated successfully, but these errors were encountered: