-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement keepalived load balancer #4344
Conversation
62f99f7
to
0d7de41
Compare
7c5ef55
to
0871a4b
Compare
7406008
to
d5797f1
Compare
cmd/controller/controller.go
Outdated
@@ -232,11 +232,16 @@ func (c *command) start(ctx context.Context) error { | |||
if c.SingleNode { | |||
return errors.New("control plane load balancing cannot be used in a single-node cluster") | |||
} | |||
if cplb.Type != v1beta1.CPLBTypeKeepalived { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should do this in config validation already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is done in a separate commit because rewriting the git history was quite complex...
docs/cplb.md
Outdated
@@ -3,17 +3,23 @@ | |||
For clusters that don't have an [externally managed load balancer](high-availability.md#load-balancer) for the k0s | |||
control plane, there is another option to get a highly available control plane called control plane load balancing (CPLB). | |||
|
|||
CPLB allows automatic assigned of predefined IP addresses using VRRP across masters. | |||
CPLB has two features that often will be combined, but normally will be used together: VRRP Instances, which allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPLB has two features that often will be combined, but normally will be used together
This doesn't sound right 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't 😅 . Fixed
docs/cplb.md
Outdated
* If `VirtualServers` are used, the cluster configuration doesn't specify a non-empty | ||
[`spec.api.externalAddress`][specapi]. `VRRPInstances` are compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't really get what this means. Could you rephrase this a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased it, please review it again, I think it's better now but I'm not quite convinced to be honest
|
||
func (r *CPLBReconciler) watchAPIServers(ctx context.Context) { | ||
// Before starting check that the API is actually responding | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could utilize our internal watch
helper here? IMO would be much simpler and that already handles retries etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this as a separate commit. Can be squashed.
if err := k.configureDummy(); err != nil { | ||
return fmt.Errorf("failed to configure dummy interface: %w", err) | ||
} | ||
if err := k.Config.ValidateVRRPInstances(nil); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, IMO we need to hook validation to general config validation "phase"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's added as a separate commit because rewriting the git history wasn't easy...
Converting to draft while I address the concerns. |
// Wait for the supervisor to start keepalived before | ||
// watching for endpoint changes | ||
process := k.supervisor.GetProcess() | ||
for process == nil { | ||
k.log.Info("Waiting for keepalived to start") | ||
time.Sleep(5 * time.Second) | ||
process = k.supervisor.GetProcess() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might work without this loop, if the nil check would be moved into the update loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about it, I will change it, but I prefer it this way.
Even if the process is dead, s.GetProcess
won't return nil once keepalived is started the first time, it will just return a process with an invalid PID. Adding it into the loop means we're doing a check that will always return false. Also we don't care if the PID is of an older dead process because the PID is obtained AFTER writing the template which means the new process will be using the new file.
But if you feel very strongly about it I'll move the nil check inside, there aren't big consecuences anyway, maybe faster reloads when the cluster is bootstrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about it, I will change it, but I prefer it this way.
Currently, the loop cannot be cancelled externally. That's why I figured that we could just inline the nil check below. Would be less code, too. I'm fine with keeping it, as long as it can be cancelled.
it will just return a process with an invalid PID
Right. I'd argue that this is a current shortcoming of Supervisor, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twz123 I accidentally editted your comment with my reply 🤦
The reply is:
I added a limit of 6 times, that's 30 seconds and should be way more than enough time to start keepalived. It gets cancelled eventually, it's not the prettiest but should fix the problem.
Unfortunatelly I deleted your comment saying that it had to be possible to cancel it from the outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunatelly I deleted your comment saying that it had to be possible to cancel it from the outside.
Restored it from the history 🙃
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
dd92d37
to
7eff376
Compare
pkg/apis/k0s/v1beta1/cplb.go
Outdated
TUNLBKind KeepalivedLBKind = "TUN" | ||
) | ||
|
||
type RealServer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they? Still not able to spot them 🙈
@@ -108,6 +140,10 @@ func (k *Keepalived) Start(_ context.Context) error { | |||
DataDir: k.K0sVars.DataDir, | |||
UID: k.uid, | |||
} | |||
|
|||
if len(k.Config.VirtualServers) > 0 { | |||
go k.watchReconcilerUpdates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot spot it. Can you verify?
// Wait for the supervisor to start keepalived before | ||
// watching for endpoint changes | ||
process := k.supervisor.GetProcess() | ||
for process == nil { | ||
k.log.Info("Waiting for keepalived to start") | ||
time.Sleep(5 * time.Second) | ||
process = k.supervisor.GetProcess() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about it, I will change it, but I prefer it this way.
Currently, the loop cannot be cancelled externally. That's why I figured that we could just inline the nil check below. Would be less code, too. I'm fine with keeping it, as long as it can be cancelled.
it will just return a process with an invalid PID
Right. I'd argue that this is a current shortcoming of Supervisor, though.
Everything should be addressed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing™. I think we can just omit the delay_loop if it's zero.
{{ if gt (len $RealServers) 0 }} | ||
{{ range .VirtualServers }} | ||
virtual_server {{ .IPAddress }} {{ $APIServerPort }} { | ||
delay_loop {{ .DelayLoop.Seconds }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delay_loop {{ .DelayLoop.Seconds }} | |
{{- if gt .DelayLoop.Seconds 0.0 }} | |
delay_loop {{ .DelayLoop.Seconds }} | |
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't. In keepalived it defaults to 60. I set it to 0 because I think it doesn't make sense to delay it at all in CPLB because it's added after kubernetes.default.svc is reconciled and hence has passed all the relevant health local health checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't in keepalived it defaults to 60.
I see. Apparently I found yet another delay_loop in the keepalived codebase 😅
I set it to 0 because I think it doesn't make sense to delay it at all in CPLB because it's added after kubernetes.default.svc is reconciled and hence has passed all the relevant health local health checks.
I don't fully understand the implications, but a delay_loop of 0 is not a thing in keepalived (it will use the default of 60, then):
# /var/lib/k0s/bin/keepalived --dont-fork --use-file /run/k0s/keepalived.conf --no-syslog --log-console --log-detail --dump-conf -t
(/run/k0s/keepalived.conf: Line 33) number '0' outside range [0.000001, 18446744073709.551614]
(/run/k0s/keepalived.conf: Line 33) virtual server delay loop '0' invalid - ignoring
But letz address this in a subsequent PR 😄
cf8c331
to
1c9689a
Compare
And move validation to clusterconfig. Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
The field wasn't required and didn't serve any actual purpose, so remove it and auto generate it always. Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Co-authored-by: Tom Wieczorek <[email protected]> Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <[email protected]>
Description
For easier review I split the PR in 5 commits:
1- Add API types and autogenerated code
2- Controller changes to make keepalived work
3- Updated inttest
4- Encapsulate keepalived config in a new struct
5- Documentations
I also found a typo so I sneaked in a tiny 1 line commit
This is part of #4181
Type of change
How Has This Been Tested?
Checklist: