-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Improve Cloud Chaos Providers #46
Open
miketonks-form3
wants to merge
5
commits into
chaos-mesh:main
Choose a base branch
from
miketonks:mike-cloud-chaos
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
# Chaos Types for Cloud Providers | ||
|
||
## Summary | ||
|
||
The current ASWChaos, GCPChaos and AzureChaos experiment types are quite basic. | ||
|
||
We would like to extend them to support selecting instances using additional | ||
filters such as tags or labels, instance type, availability zone, etc, as | ||
supported by the different cloud providers. | ||
|
||
In future we would like to add more types for different cloud resources | ||
such as networking. | ||
|
||
## Motivation | ||
|
||
We would like to use chaos-mesh to orchestrate more complex testing in cloud | ||
environments. | ||
|
||
For example, when stopping an instance it should be possible to select by name, | ||
tags or labels, instance type, availability zone, and other properties as | ||
supported by the cloud provider. | ||
|
||
Additionally, it would be useful to simulate outages by changing network ACLs | ||
in a similar way to how NetworkChaos works. | ||
|
||
| ChaosType | Selector | Notes | | ||
| --------- | -------- | ----- | | ||
| AWSChaos | Single instance by `InstanceID`| No support for selectors | | ||
| GCPChaos | Single instance by `name`| No support for selectors | | ||
| AzureChaos | Single instance by `resource_group` and `name`| Docs does | ||
mention a`mode` parameter, but behavious is not documented | | ||
|
||
It would be good to make these types similar, so users of chaos-mesh with | ||
multiple cloud providers can create similar experiments easily across all | ||
clouds. However since we rely on the underlying SDK libraries from each | ||
provider, it seems sensible that we should keep close the the naming | ||
conventions used in the SDK. | ||
|
||
Currently, none of these types use `Selectors` currently. We should use the | ||
`RecordsController` with the `impl.Select` to lookup matching resources in the | ||
cloud and save these in the `records` as described | ||
[here](https://github.com/chaos-mesh/chaos-mesh/tree/master/controllers/common/records#readme) | ||
Since `filters` field is new, it is not a breaking change, so it is safe to update the existing chaos types. | ||
|
||
The initial types AWSChaos, GCPChaos and AzureChaos represent the common Instance or VirtualMachine resource types. CloudProviders have many different resource types that we may want to target for chaos in future. We can introduce new chaos experiment types for these, such as AWSNetworkChaos or AWSAutoscalerChaos. | ||
|
||
## Detailed design | ||
|
||
Add new fields to the existing AWS Chaos, GCP Chaos and Azure Chaos | ||
definitions, to allow specifying filters. | ||
|
||
### AWS Chaos selecting multiple instances using tag | ||
|
||
``` | ||
apiVersion: chaos-mesh.org/v1alpha1 | ||
kind: AWSChaos | ||
metadata: | ||
name: ec2-stop-example | ||
namespace: chaos-mesh | ||
spec: | ||
action: ec2-stop | ||
awsRegion: 'us-east-2' | ||
filters: | ||
- name: 'tag:environment' | ||
value: 'staging' | ||
- name: 'instance-type' | ||
value: 't2.micro' | ||
mode: 'all' | ||
duration: '5m' | ||
``` | ||
|
||
Supported filters are defined in the AWS SDK. For details see: | ||
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstances.html | ||
|
||
### GCP Chaos selecting multiple instances using labels | ||
|
||
``` | ||
apiVersion: chaos-mesh.org/v1alpha1 | ||
kind: GCPChaos | ||
metadata: | ||
name: node-stop-example | ||
namespace: chaos-mesh | ||
spec: | ||
action: node-stop | ||
secretName: 'cloud-key-secret' | ||
project: 'your-project-id' | ||
zone: 'your-zone' | ||
filters: | ||
- 'labels.environment = staging' | ||
- 'name ne .*web' | ||
mode: 'all' | ||
duration: '5m' | ||
``` | ||
|
||
Supported filters are described in the Google Cloud Compute Engine reference. For details see: | ||
https://cloud.google.com/compute/docs/reference/rest/v1/instances/list | ||
|
||
If multiple filters are provided, they will be combined with an AND expression. | ||
|
||
|
||
|
||
## Drawbacks | ||
|
||
|
||
## Alternatives | ||
|
||
- Cloud extensions to live in a seperate repository | ||
|
||
It would perhaps be more scalable for the cloud provider types to live in | ||
separate source repositories, allowing them to be maintained without | ||
impacting the main chaos-mesh code base. | ||
|
||
However this doesn't seem feasible at this time. We would need some kind | ||
of plugin / exnension model to support this. | ||
|
||
|
||
## Unresolved questions | ||
|
||
- How to filter VMs list using Azure SDK. | ||
|
||
Documentation isn't the best. Also the autorest library used in chaos-mesh | ||
is out of support: https://github.com/Azure/go-autorest | ||
|
||
The following does work: `az vm list -d --query "[?tags.env=='staging']"` | ||
but I couldn't easily find how to do this in the golang SDK. Have asked on | ||
gophers slack. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GCP and Azure, could you provide the related docs like
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstances.html please? ❤️