This document explains the action API, control flow and the contracts behind it. It starts with a high-level overview and then explains every API in detail.
Actions are implemented with the help of ActionKit and the action API through an implementation of an extension. Extensions are HTTP servers implementing the action API to describe which actions are supported and how to execute these. The following diagram illustrates who is issuing calls and in what phases.
As can be seen above, the extension is called by the Steadybit agent in two phases:
- In the action registration phase, Steadybit learns about the supported actions. Once this phase is completed, actions will be usable within Steadybit, e.g., within the experiment editor.
- The action execution phase occurs whenever an action is executed, e.g., as part of an experiment.
The following sections explain the various API endpoints, their responsibilities and structures in more detail.
As the name implies, the action list returns a list of supported actions. Or, more specifically, HTTP endpoints that the agent should call to learn more about the actions.
This endpoint needs to be registered with Steadybit agents.
// Request: GET /actions
// Response: 200
{
"actions": [
{
"method": "GET",
"path": "/actions/rollout-restart"
}
]
}
- Go API:
ActionListResponse
- OpenAPI Schema:
ActionListResponse
An action description is required for each action. The HTTP endpoint serving the description is discovered through the action list endpoint.
Action descriptions expose information about the presentation, configuration and behavior of actions. For example:
- What should the action be called?
- Which configuration options should be presented to end-users within the user interface?
- Can the action be stopped, or is this an instantaneous event, e.g., host reboots?
Action description is a somewhat evolved topic. For more information on action parameters, please refer to our parameter types documentation.
// Request: GET /actions/rollout-restart
// Response: 200
{
"id": "com.steadybit.example.actions.kubernetes.rollout-restart",
"label": "Kubernetes Rollout Restart Deployment",
"description": "Execute a rollout restart for a Kubernetes deployment",
"version": "1.0.0",
"icon": "",
"kind": "attack",
"category": "resource",
"targetSelection": {
"targetType": "kubernetes-deployment",
"selectionTemplates": [
{
"label": "default",
"description": "Find deployment by cluster, namespace and deployment",
"query": "k8s.cluster-name=\"\" AND k8s.namespace=\"\" AND k8s.deployment=\"\""
}
],
"timeControl": "internal",
"hint": {
"type": "warning",
"content": "This can be dangerous! Please have a look [here](https://foo.bar/baz) first."
},
"quantityRestriction": "none",
"missingQuerySelection": "include_none",
"defaultBlastRadius": {
"mode": "percentage",
"value": 100
},
"parameters": [
{
"label": "Wait for rollout completion?",
"name": "wait",
"type": "boolean",
"description": "",
"required": false,
"advanced": true,
"order": 0,
"defaultValue": "false",
"hint": {
"type": "info",
"content": "We just want to inform you that this is an awesome action."
}
}
],
"prepare": {
"method": "POST",
"path": "/actions/rollout-restart/prepare"
},
"start": {
"method": "POST",
"path": "/actions/rollout-restart/start"
},
"status": {
"method": "POST",
"path": "/actions/rollout-restart/status"
},
"stop": {
"method": "POST",
"path": "/actions/rollout-restart/stop"
}
}
- Go API:
DescribeActionResponse
- OpenAPI Schema:
DescribeActionResponse
Actions are versioned strictly, and Steadybit will ignore definition changes for the same version. Remember to update the version every time you update the
action description. You can use extbuild.GetSemverVersionStringOrUnknown()
from
our extension-kit to use the build version here.
If your Action requires a target, you can specify the target type here. The given value should match the targetType
of one of the existing discoveries.
Creating the right target query might be difficult for the user. Therefore, you can provide a list of target selection templates which can be selected by the user in the ui.
Actions can fine tune the target selection in the ui. The following values are supported:
"quantityRestriction": "None"
- The user needs to define the target selection and has an option to randomize the selection, e.g.50%
. This is the default value and is usually the correct setting for all actions of kind "Attack"."quantityRestriction": "ExactlyOne"
- The user must select exactly one target. The restriction will be validated as soon as you start the experiment."quantityRestriction": "All"
- The user needs to define the target selection, but the Steadybit ui will not show the randomization part, e.g.50%
.
The default blast radius is set to 100%. All targets specified by the query will be used. The action can define a different default blast radius, like "use a single target".
The user can always adjust the blast radius in the experiment editor. Only relevant for "quantityRestriction": "None"
.
If the user does not provide a target selection, the action can decide how to proceed. The following values are supported:
"missingQuerySelection": "include_none"
- default - No targets will be included, the experiment will be invalid."missingQuerySelection": "include_all"
- All available targets will be included, the experiment will be valid and all targets will be used if no blast radius is limiting the selection.
Time control informs Steadybit about behavioral aspects of the action. At this moment, there are three options:
- Instantaneous that cannot be undone, e.g., killing processes or shutting down servers:
"timeControl": "INSTANTANEOUS"
- Actions spanning a configurable amount of time that are stoppable, e.g., causing CPU/memory stress, network configuration changes:
"timeControl": "EXTERNAL"
. Note that these actions require a parameter namedduration
with typeduration
. - Actions spanning an unknown amount of time, e.g., waiting for a service to roll over or for deployment to finish:
"timeControl": "INTERNAL"
As with every distributed system things can go wrong. And the steadybit agent and extensions is no exception. Steadybit has multiple safety measures in place to prevent causing uncontrolled harm to your system:
The Platform calculates an estimation for the total experiment duration. If the experiment exceeds this duration by 15 minutes, it will be canceled. The estimate is updated while conducting the experiment and is computed by summing up the durations of the steps.
Hint: An action that doesn't specify a duration
parameter but exceeds the timeout, will cause the experiment to be canceled.
The Platform expects each step to start within three minutes. If your action takes longer than three minutes to start, the experiment will be canceled.
The Golang Action SDK will add a synthetic status callback if not present. It expects the status callback to be called periodically. If more than three calls are missed the extension will rollback all active actions on it's own.
Actions and action parameters can contain hints. These will be rendered inside the UI experiment editor. The following screenshot shows an action with a hint and hint type "hint_warning" and a parameter hint with type "hint_info". The content of the hint also supports Markdown.
Action execution is divided into three steps:
- preparation
- start
- status
- stop
HTTP endpoints represent each step. Steadybit learns about these endpoints through the action description documented in the previous sections. The following sub-sections explain the responsibilities of each of the endpoints in detail.
The preparation (or short prepare
) step receives the action's configuration options (representing the parameters defined in the action description) and a
selected target. The HTTP endpoint must respond with an HTTP status code 200
and a JSON response body containing a state object. Details about Error Handling
can be found in this chapter.
The state object is later used in HTTP requests to the start and stop endpoints. So you will want to include all the execution relevant information within the state object, e.g., a subset of the target's attributes, the configuration options and the original state (in case you are going to do some system modification as part of the start step).
If a parameter of type file is defined, the request will be a multipart request. The first part will contain a JSON defined by PrepareActionRequestBody (part='request'). The following will be the files with the name of the parameter as key (part='example-parameter').
// Request: POST /actions/rollout-restart/prepare
{
"config": {
"wait": true
},
"target": {
"name": "demo-dev/steadybit-demo/gateway",
"attributes": {
"k8s.deployment": [
"gateway"
],
"k8s.namespace": [
"steadybit-demo"
],
"k8s.cluster-name": [
"demo-dev"
]
}
}
}
// Response: 200
{
"state": {
"Cluster": "demo-dev",
"Namespace": "steadybit-demo",
"Deployment": "gateway",
"Wait": true
}
}
- Go API:
PrepareActionRequestBody
,PrepareActionResponse
- OpenAPI Schema:
PrepareActionRequestBody
,PrepareActionResponse
The actual action happens within the start step, i.e., this is where you will typically modify the system, kill processes or reboot servers.
The start step receives the prepare step's state object. The HTTP endpoint must respond with an HTTP status code 200
on success. Details about Error Handling
can be found in this chapter. A JSON response body containing a state object may be returned. This state object is later passed to the stop
step.
This endpoint must respond within a few seconds. It is not permitted to block until the action execution is completed within the start endpoint. For example, you can trigger a deployment change within the start endpoint, but the start endpoint may not block until the deployment change is fully rolled out (this is what the status endpoint is for).
// Request: POST /actions/rollout-restart/start
{
"state": {
"Cluster": "demo-dev",
"Namespace": "steadybit-demo",
"Deployment": "gateway",
"Wait": true
}
}
// Response: 200
{
"state": {
"Cluster": "demo-dev",
"Namespace": "steadybit-demo",
"Deployment": "gateway",
"Wait": true
}
}
- Go API:
StartActionRequestBody
,StartActionResponse
- OpenAPI Schema:
StartActionRequestBody
,StartActionResponse
The status step exists to observe the status of the action execution. For example, when triggering a deployment change you would use the status endpoint to inspect whether the deployment change was processed.
The status step receives the prepare, start or previous state step's state object. The HTTP endpoint must respond with an HTTP status code 200
on success.
Details about Error Handling can be found in this chapter.
This endpoint must respond within a few seconds. It is not permitted to block until the action execution is completed within the status endpoint. For example,
you can inspect a deployment change's state within the status endpoint, but the status endpoint may not block until the deployment change is fully rolled out.
The status endpoint is continuously called until it responds with completed=true
.
// Request: POST /actions/rollout-restart/status
{
"state": {
"Cluster": "demo-dev",
"Namespace": "steadybit-demo",
"Deployment": "gateway",
"Wait": true
}
}
// Response: 200
{
"completed": true
}
- Go API:
ActionStatusRequestBody
,ActionStatusResponse
- OpenAPI Schema:
ActionStatusRequestBody
,ActionStatusResponse
The stop step exists to revert system modifications, stop CPU/memory stress or any other actions.
The stop step receives the prepare, status or start step's state object. The HTTP endpoint must respond with an HTTP status code 200
on success. Details about
Error Handling can be found in this chapter.
// Request: POST /actions/rollout-restart/stop
{
"state": {
"Cluster": "demo-dev",
"Namespace": "steadybit-demo",
"Deployment": "gateway",
"Wait": true
}
}
// Response: 200
- Go API:
StopActionRequestBody
,StopActionResponse
- OpenAPI Schema:
StopActionRequestBody
,StopActionResponse
The prepare
, start
, status
and stop
endpoints share the same mechanisms for error handling.
The agent will stop the experiment execution, if the extension:
- returns a HTTP status code which is not
200
- returns a body of type
ActionKitError
- returns its specific response type and an attribute
error
containingActionKitError
The attribute status
in ActionKitError
defines, how Steadybit will show the error.
failed
- The action has detected some failures, for example a failing test which has been implemented by the action. The action will be stopped, if this status is returned by the status endpoint.errored
- There was a technical error while executing the action. Will be marked as red in the platform. The action will be stopped, if this status is returned by the status endpoint.
- Go API:
ActionKitError
- OpenAPI Schema:
ActionKitError