Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Distributed controlplane #2350

Closed
till opened this issue Nov 2, 2022 · 9 comments
Closed

[question] Distributed controlplane #2350

till opened this issue Nov 2, 2022 · 9 comments
Labels
question Further information is requested Stale

Comments

@till
Copy link
Contributor

till commented Nov 2, 2022

Hi,

was wondering if anyone had given any thoughts to a completely distributed controlplane setup?

As in, that I want to run controlplane nodes across different providers/datacenters. Right now the loadbalancer setup, etc. seems to assume that I run a loadbalancer which proxies all my controlplane nodes. But how would this work across different providers (where I don't have the liberty of of announcing IP space and giant mesh networks, etc.)?

Or let's say in an IoT use-case where clusters are heavily distributed. Or is there an alternative (proposed) setup for these cases? E.g. run lots of single-node k0s'?

The alternative is running the controlplane in one place and workers in others, but then it's kinda like putting all 🥚 in one 🧺 .

Thanks!

@jnummelin jnummelin added the question Further information is requested label Nov 2, 2022
@jnummelin
Copy link
Member

Running a geo-distributed controlplane does impose some challenges. Like for example:

  • Etcd clustering
  • LB (as you mentioned too)
  • networking in general

I'll break this to bit more details below.

As in many other cases in engineering, this boils down to kinda compromises. 😄 What I'd propose, and have seen lot of k0s deployments (IoT, manufacturing, etc. use cases) doing is the following:

  • controlplane on a single "region", distributed to many "zones"; to mitigate the etcd latency constraints
  • some sort of LB in front; if in cloud it's usually the ELB and alike
  • workers on "edge" networks
  • depending on the needs; calico with wireguard as the CNI layer (wireguard provides encryption for the cluster networking)

Etcd

The issue is that Etcd does have some practical limits on latency when spread over network, thus it is most often deployed as cluster in a single "region". To mitigate HA issues you could run it e.g. in a 3 node setup where each node is on a specific "zone".

LB

LB is indeed a bit of a challenge, not only for this kinda deployment scenario but for k0s/k8s in general. What I can see as alternatives are something like:

  • LB in a single DC distributing to many DCs. This has the drawback of the LB being kind of a SPOF.
  • Distributed DNS with round-robin (or some health-check based balancing) backend selection

We're working on a solution to mitigate the need for external LB in k0s itself. The WIP solution is to have each worker run a node-local LB (of sorts) and all the node-local components (kubelet, kube-proxy, etc.) connect to the API via that. That provides HA connectivity to all controlplane nodes (kube API mostly). We're hoping to get that into 1.26 release.

@till
Copy link
Contributor Author

till commented Nov 2, 2022

@jnummelin yes, aware of etcd but I would probably use kine and use either postgres/mysql or even nats?

@jnummelin
Copy link
Member

In case of mysql/postgres if they can tolerate higher latencies etc. then it might work much better. (I'm definitely no expert in clustering those)

So the main problem in your use case boils down to how to create a single address (LB or something else) to which all node components can connect to. That includes kubelet, kube-proxy and konnectivity-agent. Konnectivity-agent is actually maybe the most problematic one due to how it manages HA connections. For that there has to be LB of sorts as the agent can be configured only with single address (the LB) through which it expects to get "routed" to all the servers. I know, this is somewhat awkward way of doing it and we've opened discussion in the upstream if there's some other ways to handle this.

@jnummelin
Copy link
Member

One idea that came to mind for the "LB" is to utilize services like Ngrok, CloudFlare tunnels etc. So in a nutshell each controller registers to a (TCP) "tunnel" on the service (e.g. ngrok). A tunnel has a dedicated DNS (I assume) which you'd use as externalAddress and in sans.

Note: I have not tested this approach at all, but conceptually it should work :)

@till
Copy link
Contributor Author

till commented Nov 3, 2022

@jnummelin interesting, hadn't thought about ngrok etc.. But would be interesting as well. Not sure if either of these can distribute requests between hosts though? I thought they were all about exposing one host, but it looks like they have fallbacks in place.

I really like your idea of running local loadbalancers to basically ensure that local worker nodes, use local control plane nodes as well. Not entirely sure where it would abstract that "these are the cp nodes" part though. As to, would it only use local or prefer local and then fallback cp nodes further away?

"Clustering" mysql/postgressql is not simple either, but maybe a replication topology would help so for example to be able to do proper backup and DR. It doesn't look like kine can split operations (yet), and I am note sure if e.g. MySQL Galera is the answer or MaxSQL (or whatever the MariaDB proxy was called). Maybe kine would take a contribution so it would act as a proxy layer itself and split read to replicas and use the main server for INSERT/UPDATE/DELETE.

Besides, latency, I think 80ms latency is normal for mysql/postgresql if you'd run a distributed database setup in let's say AWS US-East and AWS US-West. I've seen these work even across continents (or oceans). Not entirely sure how k8s deals with the lag or if it matters. In the video I posted, they mention that nats is fine up until ~250ms.

@jnummelin
Copy link
Member

I really like your idea of running local loadbalancers to basically ensure that local worker nodes, use local control plane nodes as well. Not entirely sure where it would abstract that "these are the cp nodes" part though. As to, would it only use local or prefer local and then fallback cp nodes further away?

In these sort of deployment cases you'd run plain controller nodes (i.e. no --enable-worker) probably. At least that would be my suggestion. 😄 k0s configures all the kube bits to talk to localhost only, so there's really no controller<->controller comms other than etcd replication. And for that we do not need any local LBs in place on pure controller nodes. For pure workers we'd have that setup in a way that it basically does "round-robin" to all controller nodes.

Maybe kine would take a contribution so it would act as a proxy layer itself and split read to replicas and use the main server for INSERT/UPDATE/DELETE.

As said, I'm no expert in MySQL/Postgres but in many other data stores these sort of things are often managed/configured on the driver level. I.e. the driver selects "master" for writes and replicas for read operations. But maybe the one compiled in kine for MySQL does not understand these sort of things.

@till
Copy link
Contributor Author

till commented Nov 5, 2022

It seems like r/w splitting is not supported by the go-drivers used by kine. I asked in k3s-io/kine#117.

I guess depending on the SQL, you can use something like:

  • pgpool/pgbouncer (postgresql)
  • mariadb maxscale/proxysql (mysql)

I remember pgpool also had a "nice" consensus algorithm which was painful to work with (in my experience). Not sure about the others.

My colleague also mentioned tailscale. Could be helpful to have the controller nodes available globally.

@jnummelin
Copy link
Member

My colleague also mentioned tailscale.

Tailscale is pretty cool in general. And it actually works with k0s in this sort of use case. We did some demo with it like a year ago. Of course the Tailscale part is something you need to handle, k0s is not able to integrate with it directly.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2022

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Dec 7, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants