test: Add hack to retry a failed AD joining #19638
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a race condition with the current Samba AD container on the services image: Sometimes joining the domain doesn't pick up the global directory server, and queries fail with "SSSD is offline" / "Unspecified GSS failure".
This is too hard for us to track down ourselves, it needs help from [1]. In the meantime, leave and re-join the domain to give it another chance of succeeding. This avoids the extra I/O/CPU noise that goes along with the cockpit session (such as checking for package updates), and has a higher chance of succeeding.
Note that joining AD via cockpit is still covered by test{Un,}QualifiedUser.
[1] samba-in-kubernetes/samba-container#160
I hate this, but I've already sunk 3½ days into debugging this, and am none the wiser. This is beyond me.
This is one of our worst flakes right now: