Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add hack to retry a failed AD joining #19638

Closed
wants to merge 1 commit into from

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Nov 21, 2023

There is a race condition with the current Samba AD container on the services image: Sometimes joining the domain doesn't pick up the global directory server, and queries fail with "SSSD is offline" / "Unspecified GSS failure".

This is too hard for us to track down ourselves, it needs help from [1]. In the meantime, leave and re-join the domain to give it another chance of succeeding. This avoids the extra I/O/CPU noise that goes along with the cockpit session (such as checking for package updates), and has a higher chance of succeeding.

Note that joining AD via cockpit is still covered by test{Un,}QualifiedUser.

[1] samba-in-kubernetes/samba-container#160


I hate this, but I've already sunk 3½ days into debugging this, and am none the wiser. This is beyond me.

This is one of our worst flakes right now:

image

There is a race condition with the current Samba AD container on the
services image: Sometimes joining the domain doesn't pick up the global
directory server, and queries fail with "SSSD is offline" / "Unspecified
GSS failure".

This is too hard for us to track down ourselves, it needs help from [1].
In the meantime, leave and re-join the domain to give it another chance
of succeeding. This avoids the extra I/O/CPU noise that goes along with
the cockpit session (such as checking for package updates), and has a
higher chance of succeeding.

Note that joining AD via cockpit is still covered by
test{Un,}QualifiedUser.

[1] samba-in-kubernetes/samba-container#160
@martinpitt martinpitt added the flake unstable test label Nov 21, 2023
@martinpitt
Copy link
Member Author

martinpitt commented Nov 21, 2023

There's more work to be done here sigh, but that's a different test. Probably the same root cause, though. But that's also already a known top flake

@martinpitt martinpitt requested a review from jelly November 21, 2023 16:36
@martinpitt
Copy link
Member Author

With the retry, the test can time out, need to bump.

Also, this really effing hates me

@martinpitt
Copy link
Member Author

Idea for tomorrow: Search for a different Samba container, or possibly go back to https://github.com/Fmstrat/samba-domain

@martinpitt martinpitt removed the request for review from jelly November 21, 2023 16:48
@martinpitt martinpitt marked this pull request as draft November 21, 2023 16:50
@martinpitt
Copy link
Member Author

We can use https://quay.io/repository/bedrock/ubuntu?tab=info as Ubuntu base container, this seems to work fine. podman run -it --rm quay.io/bedrock/ubuntu is a bog standard jammy (current LTS) with current builds.

@martinpitt
Copy link
Member Author

I sent cockpit-project/bots#5580 to replace the samba container. That looks more promising.

@martinpitt martinpitt closed this Nov 22, 2023
@martinpitt martinpitt deleted the ad-hack branch November 22, 2023 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant