multi cluster support #44

YangKeao · 2022-05-12T07:25:08Z

Signed-off-by: YangKeao [email protected]

Signed-off-by: YangKeao <[email protected]>

Andrewmatilde

IMO, we need more info about dashboard or how to display status of chaos-meshs in multi clusters & communicate with them.

text/2022-05-12-multi-cluster-support.md

Signed-off-by: YangKeao <[email protected]>

…i-cluster-support Signed-off-by: YangKeao <[email protected]>

Signed-off-by: YangKeao <[email protected]>

STRRL

That's awesome!

And maybe we should keep consistent with the name of "parent" in the future documentations, like "Management Cluster"

STRRL · 2022-05-24T02:07:10Z

text/2022-05-12-multi-cluster-support.md

+install or upgrade the chaos mesh to v2.2.0 in the target cluster. The
+`configOverride` allows to override the global configurations. The full


The configOverride allows to override the global configurations.

So the content of configOverride would be similar to the helm chart values?

What happens if Spec.configOverride changes?

It seems that the best way is to run helm upgrade. We could compare the config, and if config changed, we should upgrade them.

This feature is really needed, because we hope the user could upgrade the config on parent cluster, and these config should be dispatched to the remote cluster automatically.

STRRL · 2022-05-24T02:09:57Z

text/2022-05-12-multi-cluster-support.md

+The `Spec` defines the cluster which we want to install the chaos mesh, and the
+version of chaos mesh. It will install the same version of Chaos Mesh as the
+controller’s, which means if the version of controller is v2.2.0, it will
+install or upgrade the chaos mesh to v2.2.0 in the target cluster. The


if the version of controller is v2.2.0, it will install or upgrade the chaos mesh to v2.2.0 in the target cluster

IIUC, if the target RemoteCluster already contains a Chaos Mesh installation, we want to "align" the version to be the same as the "Parent Cluster".

What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?

Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling

What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?

I don't think we will have enough energy to manage a long skew, so only the same (minor) version could be supported. I'll add this to the RFC.

Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling

You could use configOverride to override the image.

STRRL · 2022-05-24T02:16:34Z

text/2022-05-12-multi-cluster-support.md

+2. Why is the `remoteCluster` an extra field? Would a standalone resource /
+   annotation / … be better?
+
+   I can’t say which one is better. It’s a blind choice. I don’t want to create
+   a new embedded standalone resource because the existing two embedded
+   resources: Schedule and Workflow are too huge (but it’s actually not a strong
+   evidence). I prefer fields over annotations because fields sound more like an
+   official API, but annotations do not (or maybe we could start from
+   annotations? I’m not sure).


I prefer to use annotations with v1alpha1. ❤️ No more changes on api/v1alpha1. 🤣🤣

Using annotations would bring cost more effort to implement the reconciler, but I think that would be OK/acceptable.

STRRL · 2022-05-24T02:21:42Z

text/2022-05-12-multi-cluster-support.md

+For implementation, the chaos mesh will set up multiple “Managers”, one for each
+cluster. These managers will set up controllers to synchronize the things.


BTW, about the implementation, are these "Managers" actually controller-runtime Reoncilers?

text/2022-05-12-multi-cluster-support.md

STRRL · 2022-05-24T02:48:44Z

text/2022-05-12-multi-cluster-support.md

+4. Where will the webhook run?
+
+   In the parent chaos mesh. Because we will not create chaos in the webhook
+   (but in the reconcile), the validation should be done in the parent’s
+   webhook.


I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.

And it seems we need to prevent users from creating *Chaos in RemoteCluster?

I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.

It's fine to run duplicated webhook multiple times (as our webhook doesn't have side-effect now 😃 ), so it doesn't mean we couldn't run webhook in the remote cluster.

And it seems we need to prevent users from creating *Chaos in RemoteCluster?

I don't think so. Creating *Chaos in RemoteCluster doesn't break the whole design. We have a label/annotation in the remote cluster to mark it as managed by the one in parent cluster (and it's already needed, for GC).

STRRL · 2022-05-24T03:02:13Z

text/2022-05-12-multi-cluster-support.md

+will create a SubjectAccessReview to check whether the user is able to create
+this object in the target cluster/namespace.
+
+The only problem is that the username could be different among the clusters. To


The only problem is that the username could be different among the clusters.

Not only the username but also the user credential.

It requires users to provide different credentials for different clusters. It seems we would save the credentials into userMap.

When the "Manager" sync the *Chaos from "ParentCluster" to "RemoteCluster", it requires a certain kubeconfig with users' credential. (Or maybe we just use the given kubeconfig 🤷‍♂️)

Not only the username but also the user credential.

I think chaos mesh itself could use a super powerful credential, and impersonate any one (with only username) to create things.

Or is it possible to only validate the authorization only in the parent cluster? As the chaos controller in the parent cluster could also create SubjectAccessReview in the remote cluster. I think it's better as the authorization error will block the user creating the resource (which he don't actually have privilege to create).

Signed-off-by: YangKeao <[email protected]>

text/2022-05-12-multi-cluster-support.md

Signed-off-by: YangKeao <[email protected]>

cwen0 · 2022-05-25T07:31:53Z

text/2022-05-12-multi-cluster-support.md

+In the remote cluster, the chaos-controller-manager and chaos-daemon will be
+responsible for injecting / recovering chaos
+
+These two chaos-controller-manager could share a single binary, so the


I don't think sharing a single binary is a good idea. The logic of parent modeis different fromchild mode`, if we share in a single binary, which will make chaos-controller-manager more complicated

Well. It is not that different actually. Just like we can run Schedule and Workflow controller in a standalone binary, but running them in the same binary doesn't bring too much trouble.

At least, I found I'll need to clarify the statement. It seems that three modes are somehow duplicated, the remote cluster chaos mesh doesn't need to know itself is a child, and the parent chaos mesh should also be able to inject chaos into current (parent) cluster.

cwen0 · 2022-05-26T02:43:13Z

text/2022-05-12-multi-cluster-support.md

+metadata:
+ name: burn-cpu
+spec:
+ remoteCluster:


Can remoteCluster be used as a selector instead of a single field?

Indeed, we cannot assume the different clusters are isomorphic, so if you are trying to run the same selector on different remote cluster, I doubt whether it's meaningful. Just like the containerNames, which is also embarrassing, as its existence shows our assumption of the isomorphic between pods.

And it's also hard to manage the status for chaos in multiple different clusters.

WangXiangUSTC · 2022-05-30T15:03:33Z

text/2022-05-12-multi-cluster-support.md

+ - type: Installed
+   status: True
+ - type: Ready
+   status: True


seems that the Ready include the Installed

WangXiangUSTC · 2022-05-30T15:23:59Z

text/2022-05-12-multi-cluster-support.md

+The total architecture is really simple.
+
+In the management (or so-called parent) cluster, the chaos-controller-manager
+will be responsible for


the parent cluster also manages the PhysicalMachine

WangXiangUSTC · 2022-05-30T15:25:40Z

text/2022-05-12-multi-cluster-support.md

+
+As the plan shown above, we don't need too much modification on dashboard. The
+cluster installation status should be represented in `RemoteCluster`, and the
+execution status of a standalone chaos should be recorded in the `.Status` of


Suggested change

execution status of a standalone chaos should be recorded in the `.Status` of

execution status of a standalone chaos should be recorded in the `Status` of

iguoyr · 2022-06-02T04:19:52Z

text/2022-05-12-multi-cluster-support.md

+chaos-controller-manager could work in three different modes:
+
+1. parent mode, like discussed in former section
+2. child mode, like discussed in former section
+3. single mode, just the normal one we are using now
+
+(maybe these modes can be described through turning on/off some controllers /
+webhooks)


It seems there is no clear fields to describe the mode? And will we control the relationship between clusters, for example:

can a cluster be managed by different clusters? if not, how do we prevent this operation?

can a cluster be a parent and a child at the same time? if not, how do we prevent this operation?

multi cluster support

4b7b11c

Signed-off-by: YangKeao <[email protected]>

YangKeao force-pushed the multi-cluster-support branch from a1dc9e9 to 4b7b11c Compare May 12, 2022 07:25

multi cluster support

18cf46c

Signed-off-by: YangKeao <[email protected]>

Andrewmatilde reviewed May 19, 2022

View reviewed changes

text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved

text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved

YangKeao added 6 commits May 19, 2022 14:04

clarify the

dee0961

Signed-off-by: YangKeao <[email protected]>

add Dashboard Section

2a62035

Signed-off-by: YangKeao <[email protected]>

add reconcile flow

0565c16

Signed-off-by: YangKeao <[email protected]>

add global configuration

7a7e682

Signed-off-by: YangKeao <[email protected]>

Merge remote-tracking branch 'origin/multi-cluster-support' into mult…

2aad931

…i-cluster-support Signed-off-by: YangKeao <[email protected]>

specify the reconciling resource

f312bfa

Signed-off-by: YangKeao <[email protected]>

STRRL reviewed May 24, 2022

View reviewed changes

fix some minor problem

f00150c

Signed-off-by: YangKeao <[email protected]>

iguoyr reviewed May 25, 2022

View reviewed changes

YangKeao added 2 commits May 25, 2022 13:36

clarify the remote cluster

a7d87b4

Signed-off-by: YangKeao <[email protected]>

add more detailed description about the upgrade

7569901

Signed-off-by: YangKeao <[email protected]>

cwen0 reviewed May 25, 2022

View reviewed changes

cwen0 reviewed May 26, 2022

View reviewed changes

WangXiangUSTC reviewed May 30, 2022

View reviewed changes

iguoyr reviewed Jun 2, 2022

View reviewed changes

This was referenced Jun 8, 2022

Multi-Cluster Support Tracking Issue chaos-mesh/chaos-mesh#3341

Closed

[multicluster] bootstrap controller manager at runtime with clusterregistry chaos-mesh/chaos-mesh#3342

Merged

FingerLeader mentioned this pull request Jun 27, 2022

[multicluster] chaos injection and status synchronization chaos-mesh/chaos-mesh#3396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi cluster support #44

multi cluster support #44

YangKeao commented May 12, 2022 •

edited

Loading

Andrewmatilde left a comment

STRRL left a comment

STRRL May 24, 2022

YangKeao May 24, 2022 •

edited

Loading

STRRL May 24, 2022

YangKeao May 24, 2022

STRRL May 24, 2022

STRRL May 24, 2022

STRRL May 24, 2022

STRRL May 24, 2022

YangKeao May 24, 2022

STRRL May 24, 2022

YangKeao May 24, 2022

cwen0 May 25, 2022

YangKeao May 25, 2022

cwen0 May 26, 2022

YangKeao May 26, 2022

WangXiangUSTC May 30, 2022

WangXiangUSTC May 30, 2022

WangXiangUSTC May 30, 2022

iguoyr Jun 2, 2022

		install or upgrade the chaos mesh to v2.2.0 in the target cluster. The
		`configOverride` allows to override the global configurations. The full

		For implementation, the chaos mesh will set up multiple “Managers”, one for each
		cluster. These managers will set up controllers to synchronize the things.

	execution status of a standalone chaos should be recorded in the `.Status` of
	execution status of a standalone chaos should be recorded in the `Status` of

multi cluster support #44

Are you sure you want to change the base?

multi cluster support #44

Conversation

YangKeao commented May 12, 2022 • edited Loading

Andrewmatilde left a comment

Choose a reason for hiding this comment

STRRL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YangKeao May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YangKeao commented May 12, 2022 •

edited

Loading

YangKeao May 24, 2022 •

edited

Loading