Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi cluster support #44

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
349 changes: 349 additions & 0 deletions text/2022-05-12-multi-cluster-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
# Multi Cluster Support

## Summary

As the multi-cluster injection is a huge feature, and related with the full
lifecycle of Chaos Mesh, this RFC will be organized in a different way. It will
describe the design through the Deploy -> Usage -> Authentication ->
Architecture, and tell you the reason behind the design and other considerations
in every separated section in a Q&A form.

## Deploy

A new cluster scope resource RemoteCluster will be used to install and manage
the chaos-mesh installation on another cluster. This resource will be
responsible to install, configure, upgrade and uninstall the cluster.

```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: RemoteCluster
metadata:
name: cluster-xxxxxxx
spec:
namespace: "chaos-mesh"
kubeConfig:
secretRef:
name: “cluster-xxx-kubeconfig”
key: xxx
configOverride:
chaosDaemon:
runtime: containerd
status:
currentVersion: "v2.1.0"
conditions:
- type: Installed
status: True
- type: Ready
status: True
Comment on lines +34 to +37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that the Ready include the Installed

```

The `Spec` defines the cluster which we want to install the chaos mesh, and the
version of chaos mesh. It will install the same version of Chaos Mesh as the
controller’s, which means if the version of controller is v2.2.0, it will
install the chaos mesh to v2.2.0 in the target cluster. It will only support the
same minor version (e.g. parent cluster "v2.2.0" could control remote cluster
"v2.2.1"). The `configOverride` allows to override the global configurations.
The full configuration is a combination of the default configuration, the global
configuration (provided through a constant configmap) and the override. We
should provide an eliminated default configuration with the controller, as there
are too many things not needed in the remote chaos mesh.

The `Status` shows the current status of the target cluster. All of this field
can be read from the target cluster through the helm package or read the pods’
condition from the target cluster. The condition `Installed` is True iff the
YangKeao marked this conversation as resolved.
Show resolved Hide resolved
release has been installed in the target cluster and namespace. The condition
`Ready` is True iff all pods are ready. The `currentVersion` in the status gives
iguoyr marked this conversation as resolved.
Show resolved Hide resolved
us a chance to automatically migrate the configuration.

The total workflow can be described in the following graph:

```
User creates RemoteCluster -> RemoteCluster controller Reconcile

User removes the RemoteCluster -> RemoteCluster controller Reconcile

Pods in the target cluster/namespace changed -> RemoteCluster controller Reconcile (resources will be mapped to the `RemoteCluster`)

RemoteCluster controller Reconcile
-> If the resource is being deleted
-> Helm list the installed release in remote cluster
-> If the chaos-mesh chart is not installed, and there is no chaos using this cluster
remove the finalizer
-> Apply the chaos and return
-> If the finalizer doesn't exist, add a finalizer
-> Helm list the installed release/chart in remote cluster
-> If the chaos-mesh chart is not installed, and the `RemoteCluster` itself is not being
deleted, install the Chaos Mesh through helm and list the helm release again
-> If the chaos-mesh chart is installed, `Installed` conditions turn to true, else, turn to false
-> If the chaos-mesh chart is installed, but the config is different from the current merged config, run helm upgrade
-> If all pods of target Chaos Mesh are running, `Ready` conditions turn to true, else, turn to false
```

### Q&A

1. Why do we need to manage the installation of Chaos Mesh?

Consider a normal SaaS provider, which is a typical user of Chaos Mesh, will
need to automatically create Kubernetes clusters for his users. To use Chaos
Mesh, he will also need to install Chaos Mesh in the cluster which is created
automatically by the software. It’s inconvenient to deploy Chaos Mesh
manually in this situation.

Instead of every such user writing their own program to manage the
installation of Chaos Mesh, we could provide an official way to install /
upgrade / uninstall the Chaos Mesh.

2. Why do we define a new API to manage the deployment, rather than a go package
or cli tools?

Because the k8s API is just intuitive and easy to integrate, and it’s only
one step further than a go package or cli tools.

3. Why is the `RemoteCluster` resource cluster scoped?

Different *Chaos (e.g. NetworkChaos) will want to inject chaos in the same
cluster. If the `RemoteCluster` is namespace scoped, and we have to allow
selecting a remote cluster across the namespaces, then the namespace itself
is meaningless.

4. Where will the controller exist?

In the chaos-controller-manager.

5. What will it install?

A normal Chaos Mesh, with some controllers (like dashboard, workflow,
schedule) disabled.

6. When will the cluster be upgraded?

As discussed with @iguoyr, automatically upgrading is a dangerous behavior.
The controller will never upgrade the remote cluster automatically, except
when the user add an annotation to tell us he want to upgrade the cluster. We
could also provide a global configuration to always upgrade the cluster
automatically.

If the remote cluster version is using an outdated version, all functions
except upgrade and uninstalltion (like modifying the configuration) will not
work.

Upgrading the cluster will also `replace` the CRD.

## Inject

To inject chaos in another cluster, add a `cluster` and `namespace` field to
every selector. For example

```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: burn-cpu
spec:
remoteCluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remoteCluster be used as a selector instead of a single field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we cannot assume the different clusters are isomorphic, so if you are trying to run the same selector on different remote cluster, I doubt whether it's meaningful. Just like the containerNames, which is also embarrassing, as its existence shows our assumption of the isomorphic between pods.

And it's also hard to manage the status for chaos in multiple different clusters.

cluster: cluster-xxxxxxx
namespace: remote-cluster-test
mode: one
selector:
labelSelectors:
"app.kubernetes.io/component": "tikv"
stressors:
cpu:
workers: 1
load: 100
options: ["--cpu 2", "--timeout 600", "--hdd 1"]
duration: "30s"
```

For a chaos whose remoteCluster is not nil, the controller will synchronize this
resource with the target namespace in the target cluster. “synchronize” is
executed in two-ways: distribute the spec from the parent to remote, and sync
the status from the remote cluster to parent. As we don’t support modifying the
spec yet, half of the “synchronize” is executed only once. The remote chaos
resource will be created with some annotations to help the controller find the
parent (in parent cluster).

For implementation, the chaos mesh will set up multiple “Managers”, one for each
cluster. These managers will set up controllers to synchronize the things.
Comment on lines +166 to +167
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, about the implementation, are these "Managers" actually controller-runtime Reoncilers?


The creation, deletion is straightforward. Another thing to consider is the GC.
As there is nothing like `ownerReference` across the clusters, we will need to
handle the unexpected situation, e.g. the parent is deleted without
notification… We should also provide a way to force remove both parent and
remote (maybe propagating the annotation is enough).

The total workflow can be described in the following graph:

```
User creates chaos -> if this chaos contain a `remoteCluster` field, the remote chaos controller reconcile

User delete chaos -> if this chaos contain a `remoteCluster` field, the remote chaos controller reconcile

Chaos in remote cluster changed -> the remote chaos controller reconcile (resources will be mapped to the chaos in parent cluster)

Controller Reconcile -> If the Target Cluster is not ready, do nothing and return
-> If the chaos is being deleted
-> If the chaos in the remote cluster doesn't exist, remove the finalizer
-> Apply and return
-> If the finalizer doesn't exist, add a finalizer
-> Get the chaos in remote cluster
-> If the chaos in the remote cluster doesn't exist, create the chaos in the remote cluster
-> If the chaos in the remote cluster exists, read the `Status` of the chaos in the remote cluster
and update the `Status` of the chaos in the parent cluster
```

### Q&A

1. Why should we install the chaos mesh controller in every cluster? Wouldn’t it
be better to manage all chaos in a single cluster (even process)?

It sounds great to listen and respond to every chaos in a single process,
which means we will only need to install chaos daemon in the target cluster.
However, the reality is that there is no common way to communicate with the
chaos daemon in another cluster. I have thought about proxy, port-forward or
other things, but nothing is reliable enough.

I also don’t want to depend on the user to tell us the network topology: the
clusters are in the same network / different network / proxy / ingress / … .
It’s too hard for both of us and users. If we operate the chaos through
another controller in the target cluster, we only need an available
kubeconfig, which is much simpler than the network configuration.
YangKeao marked this conversation as resolved.
Show resolved Hide resolved

2. Why is the `remoteCluster` an extra field? Would a standalone resource /
annotation / … be better?

I can’t say which one is better. It’s a blind choice. I don’t want to create
a new embedded standalone resource because the existing two embedded
resources: Schedule and Workflow are too huge (but it’s actually not a strong
evidence). I prefer fields over annotations because fields sound more like an
official API, but annotations do not (or maybe we could start from
annotations? I’m not sure).
Comment on lines +212 to +220
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use annotations with v1alpha1. ❤️ No more changes on api/v1alpha1. 🤣🤣

Using annotations would bring cost more effort to implement the reconciler, but I think that would be OK/acceptable.


3. Can one chaos inject fault across different clusters?

Obviously, no. I don’t think we can provide a general way to implement this
in the near future.

4. (For a chaos with `remoteCluster`) Where will the webhook run?

In the parent chaos mesh. Because we will not create chaos in the webhook
(but in the reconcile), the validation should be done in the parent’s
webhook.

5. What about chaos events?

Thanks to @xlgao-zju , we have stored some events inside the chaos status, so
we don’t need to synchronize the event between two clusters.

## Authentication / Authorization

The Chaos Mesh project has paid a lot of attention to Authentication and
Authorization. In most situations, we choose to follow the RBAC pattern. The
multi-cluster injection will also follow the RBAC pattern. Like what we have
done inside the validate_auth.go, if the remoteCluster is not nil, a webhook
will create a SubjectAccessReview to check whether the user is able to create
this object in the target cluster/namespace.

The only problem is that the username could be different among the clusters. To
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that the username could be different among the clusters.

Not only the username but also the user credential.

It requires users to provide different credentials for different clusters. It seems we would save the credentials into userMap.

When the "Manager" sync the *Chaos from "ParentCluster" to "RemoteCluster", it requires a certain kubeconfig with users' credential. (Or maybe we just use the given kubeconfig 🤷‍♂️)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only the username but also the user credential.

I think chaos mesh itself could use a super powerful credential, and impersonate any one (with only username) to create things.

Or is it possible to only validate the authorization only in the parent cluster? As the chaos controller in the parent cluster could also create SubjectAccessReview in the remote cluster. I think it's better as the authorization error will block the user creating the resource (which he don't actually have privilege to create).

solve this problem, we could provide a usermap configuration to allow users map
the usernames across different clusters. The usermap should be stored inside a
configmap. The cluster in userMap should allow globs.

```yaml
userMap:
- originalName: yangkeao
cluster: *
name: admin
```

The webhook should also create a `SubjectAccessReview` in the remote cluster to
check whether the remote user is privileged enough to create the chaos. This
behavior makes it possible for the user to know the authorization error while
creating the chaos, but not after creation.

We should also provide an option to stop the webhook (in helm), as it will truly
bring some inconvenience to the users. If the users don’t care about
Authentication and Authorization, they can turn off this feature.

### Q&A

1. Why don’t we build a standalone authentication solution?

This sound never dimmed since the first day we were considering the
authentication. From the subjective aspect, as a developer, I don’t want to
build or ship a software with awful experience and strange configuration.
From the objective aspect, as a user, I don’t want to learn or manage the
authentication for every single application on my cluster. It would be a
disaster to create users / teams / groups in every application. Obviously,
the Chaos Mesh shouldn’t be the bad guy.

If there are some needs which cannot be solved through RBAC, I don’t regard
it as a common need (or a suitable need). e.g. you want to list every
namespaces which you have privilege…

2. Why don't we store the user information inside the chaos?

Yes. We could store the username inside the chaos annotation, and during the
Reconcile, we impersonate the user. It is possible, but doesn’t solve the
problem better. With webhook, we can tell the user that you are not
privileged enough to create the resource at the creation time, which is
really expected (like every resource in a single cluster).

3. What about the `Workflow` and `Schedule`?

The `Workflow` can create chaos in different clusters, and it’s the only way
to orchestrate the chaos in different clusters. We could get a list of
affected clusters/namespaces, and evaluate the privilege.

4. What about creating a CRD for `userMap`?

I think a ConfigMap is enough. A standalone CRD cannot bring more benefits.

5. `userMap` is ugly!

Yes. It reminds me of uid_map in the linux user_namespace, which is also an
ugly thing. If you have a better idea, please tell me.

## Architecture

The total architecture is really simple.

In the management (or so-called parent) cluster, the chaos-controller-manager
will be responsible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parent cluster also manages the PhysicalMachine


1. deploying chaos mesh to other clusters
2. validating chaos
3. authorization
4. control workflow and schedule
5. synchronizing the chaos status
6. normal chaos injecting / recovering, as the user may want to inject to
current management cluster

In the remote cluster, the chaos-controller-manager and chaos-daemon will be
responsible for injecting / recovering chaos

These two chaos-controller-manager could share a single binary, so the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sharing a single binary is a good idea. The logic of parent modeis different fromchild mode`, if we share in a single binary, which will make chaos-controller-manager more complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. It is not that different actually. Just like we can run Schedule and Workflow controller in a standalone binary, but running them in the same binary doesn't bring too much trouble.

At least, I found I'll need to clarify the statement. It seems that three modes are somehow duplicated, the remote cluster chaos mesh doesn't need to know itself is a child, and the parent chaos mesh should also be able to inject chaos into current (parent) cluster.

chaos-controller-manager could work in three different modes:

1. parent mode, like discussed in former section
2. child mode, like discussed in former section
3. single mode, just the normal one we are using now

(maybe these modes can be described through turning on/off some controllers /
webhooks)
Comment on lines +326 to +333
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no clear fields to describe the mode? And will we control the relationship between clusters, for example:

  • can a cluster be managed by different clusters? if not, how do we prevent this operation?
  • can a cluster be a parent and a child at the same time? if not, how do we prevent this operation?


## Dashboard

As the plan shown above, we don't need too much modification on dashboard. The
cluster installation status should be represented in `RemoteCluster`, and the
execution status of a standalone chaos should be recorded in the `.Status` of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
execution status of a standalone chaos should be recorded in the `.Status` of
execution status of a standalone chaos should be recorded in the `Status` of

every chaos. The `Workflow` and `Schedule` doesn't exist in the remote cluster.

Given these design, all information about the execution of current chaos can be
read in the current (parent) cluster, and showed in the dashboard like the
normal chaos.

The only modification, is to use the events recorded in the chaos, rather than
the events in the kubernetes cluster. As the events in the chaos status and the
events in Kubernetes cluster are totally redundant now, we should peek one of
them.