Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi cluster support #44

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

YangKeao
Copy link
Member

@YangKeao YangKeao commented May 12, 2022

Signed-off-by: YangKeao [email protected]

rendered version

Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
Copy link
Member

@Andrewmatilde Andrewmatilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we need more info about dashboard or how to display status of chaos-meshs in multi clusters & communicate with them.

text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome!

And maybe we should keep consistent with the name of "parent" in the future documentations, like "Management Cluster"

Comment on lines 43 to 44
install or upgrade the chaos mesh to v2.2.0 in the target cluster. The
`configOverride` allows to override the global configurations. The full
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configOverride allows to override the global configurations.

So the content of configOverride would be similar to the helm chart values?

What happens if Spec.configOverride changes?

Copy link
Member Author

@YangKeao YangKeao May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the best way is to run helm upgrade. We could compare the config, and if config changed, we should upgrade them.

This feature is really needed, because we hope the user could upgrade the config on parent cluster, and these config should be dispatched to the remote cluster automatically.

Comment on lines 40 to 43
The `Spec` defines the cluster which we want to install the chaos mesh, and the
version of chaos mesh. It will install the same version of Chaos Mesh as the
controller’s, which means if the version of controller is v2.2.0, it will
install or upgrade the chaos mesh to v2.2.0 in the target cluster. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the version of controller is v2.2.0, it will install or upgrade the chaos mesh to v2.2.0 in the target cluster

IIUC, if the target RemoteCluster already contains a Chaos Mesh installation, we want to "align" the version to be the same as the "Parent Cluster".

What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?

Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?

I don't think we will have enough energy to manage a long skew, so only the same (minor) version could be supported. I'll add this to the RFC.

Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling

You could use configOverride to override the image.

Comment on lines +196 to +204
2. Why is the `remoteCluster` an extra field? Would a standalone resource /
annotation / … be better?

I can’t say which one is better. It’s a blind choice. I don’t want to create
a new embedded standalone resource because the existing two embedded
resources: Schedule and Workflow are too huge (but it’s actually not a strong
evidence). I prefer fields over annotations because fields sound more like an
official API, but annotations do not (or maybe we could start from
annotations? I’m not sure).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use annotations with v1alpha1. ❤️ No more changes on api/v1alpha1. 🤣🤣

Using annotations would bring cost more effort to implement the reconciler, but I think that would be OK/acceptable.

Comment on lines +150 to +151
For implementation, the chaos mesh will set up multiple “Managers”, one for each
cluster. These managers will set up controllers to synchronize the things.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, about the implementation, are these "Managers" actually controller-runtime Reoncilers?

text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Show resolved Hide resolved
Comment on lines 211 to 215
4. Where will the webhook run?

In the parent chaos mesh. Because we will not create chaos in the webhook
(but in the reconcile), the validation should be done in the parent’s
webhook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems we need to prevent users from creating *Chaos in RemoteCluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.

It's fine to run duplicated webhook multiple times (as our webhook doesn't have side-effect now 😃 ), so it doesn't mean we couldn't run webhook in the remote cluster.

And it seems we need to prevent users from creating *Chaos in RemoteCluster?

I don't think so. Creating *Chaos in RemoteCluster doesn't break the whole design. We have a label/annotation in the remote cluster to mark it as managed by the one in parent cluster (and it's already needed, for GC).

will create a SubjectAccessReview to check whether the user is able to create
this object in the target cluster/namespace.

The only problem is that the username could be different among the clusters. To
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that the username could be different among the clusters.

Not only the username but also the user credential.

It requires users to provide different credentials for different clusters. It seems we would save the credentials into userMap.

When the "Manager" sync the *Chaos from "ParentCluster" to "RemoteCluster", it requires a certain kubeconfig with users' credential. (Or maybe we just use the given kubeconfig 🤷‍♂️)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only the username but also the user credential.

I think chaos mesh itself could use a super powerful credential, and impersonate any one (with only username) to create things.

Or is it possible to only validate the authorization only in the parent cluster? As the chaos controller in the parent cluster could also create SubjectAccessReview in the remote cluster. I think it's better as the authorization error will block the user creating the resource (which he don't actually have privilege to create).

Signed-off-by: YangKeao <[email protected]>
text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
text/2022-05-12-multi-cluster-support.md Outdated Show resolved Hide resolved
In the remote cluster, the chaos-controller-manager and chaos-daemon will be
responsible for injecting / recovering chaos

These two chaos-controller-manager could share a single binary, so the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sharing a single binary is a good idea. The logic of parent modeis different fromchild mode`, if we share in a single binary, which will make chaos-controller-manager more complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. It is not that different actually. Just like we can run Schedule and Workflow controller in a standalone binary, but running them in the same binary doesn't bring too much trouble.

At least, I found I'll need to clarify the statement. It seems that three modes are somehow duplicated, the remote cluster chaos mesh doesn't need to know itself is a child, and the parent chaos mesh should also be able to inject chaos into current (parent) cluster.

metadata:
name: burn-cpu
spec:
remoteCluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remoteCluster be used as a selector instead of a single field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we cannot assume the different clusters are isomorphic, so if you are trying to run the same selector on different remote cluster, I doubt whether it's meaningful. Just like the containerNames, which is also embarrassing, as its existence shows our assumption of the isomorphic between pods.

And it's also hard to manage the status for chaos in multiple different clusters.

Comment on lines +34 to +37
- type: Installed
status: True
- type: Ready
status: True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that the Ready include the Installed

The total architecture is really simple.

In the management (or so-called parent) cluster, the chaos-controller-manager
will be responsible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parent cluster also manages the PhysicalMachine


As the plan shown above, we don't need too much modification on dashboard. The
cluster installation status should be represented in `RemoteCluster`, and the
execution status of a standalone chaos should be recorded in the `.Status` of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
execution status of a standalone chaos should be recorded in the `.Status` of
execution status of a standalone chaos should be recorded in the `Status` of

Comment on lines +326 to +333
chaos-controller-manager could work in three different modes:

1. parent mode, like discussed in former section
2. child mode, like discussed in former section
3. single mode, just the normal one we are using now

(maybe these modes can be described through turning on/off some controllers /
webhooks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no clear fields to describe the mode? And will we control the relationship between clusters, for example:

  • can a cluster be managed by different clusters? if not, how do we prevent this operation?
  • can a cluster be a parent and a child at the same time? if not, how do we prevent this operation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants