Deployment of K8Sensor is currently broken, fix this and then run tests to ensure upgrade from a configuration
using the deprecated kubernetes sensor is seamless. To test this you should install the agent using the old (2.x)
version of the operator that uses helm and configure the agent to use the deprecated k8s sensor. Then upgrade the
operator to the new (3.x) version and ensure that the upgrade works seamlessly and that the new k8sensor is deployed.
Then check the Instana backend to ensure that k8s data collection has continued to function as expected through the
transition.
If the user misconfigures the agent then the attempt to create or update the Agent CR should be rejected. This can be achieved using one of or a combination of the following methods.
Validation rules and schema-based generation, validation, and processing rules can be used to verify validity of user-provided configuration and provide useful feedback for troubleshooting.
Defaulting and Validation Webhooks could be used for more advanced validation and to ensure defaulted values will appear on the CR present on the cluster without the need for updates to the CR by the controller that could cause performance issues if another controller is managing the agent CR.
Beginning in k8s v1.26 (alpha) or v1.28 (beta) a ValidationAdmissionPolicy may also be used to configure validation rules using Common Expression Language (CEL).
Unit test coverage exists now for most of the repo, but I recommend adding unit test coverage for status.go and event-filter.go. Unit tests should also be added for the k8sensor deployment builder. I recommend implementing mostly whitebox testing. This should be very similar to the tests that already exist for the agent daemonset. You will need to use the mock-gen tool to generate mocks for some of the interfaces used by the deployment builder. You will then need to add the relevant mock-gen commands to the Makefile, so they can be easily regenerated in the future if needed. For an example of this, please see the mockgen commands for the daemonset tests in the Makefile.
Some behavioral tests exist at controllers/suite_test.go. These will run a mock k8s server and test the agent controller's behavior to ensure that the lifecycle of owned resources are managed as expected during create, update, and delete of the Agent CR. Additional tests can be added here as needed or desired.
The end to end tests that currently run in the helm and webhook builds will need to be setup to run against PRs for this repo to ensure that the end-to-end functionality of the agent works as expected when changes are made to the operator logic.
The Helm chart should be updated to wrap the operator and an instance of the CR built by directly using toYaml on the Values.yaml file to construct the spec. This should end up looking something like this:
apiVersion: instana.io/v1
kind: InstanaAgent
metadata:
name: "{{ .Release.Name }}"
namespace: "{{ .Release.Namespace }}"
spec:
{{- toYaml .Values | nindent 2}}
But you may need to modify it a bit to get the formatting exactly right. The new Helm chart should include only this
and the CRD, plus the resources needed to deploy the operator, which can be generated by running
kustomize build config/default
and then replacing the namespace in each namespaced resource with
{{ .Release.Namespace }}
to ensure that all resources will be deployed into the desired namespace. Helm itself will
ensure that the CRD will be installed before the InstanaAgent CR. See
here for more information.
Some automation to make updates to the chart based on changes to the CRD or operator deployment may be desirable if these things are expected to change often.
It may be useful to add logic to the PR builds that will automatically regenerate manifests and bundle YAMLs and commit the changes to the source branch. Additionally, if you find that you are frequently making changes to operator deployment files or sample CRs that would affect the Helm chart you may want to consider creating some automation that will automatically push changes to the chart when relevant files in the operator repo have changed.
Settings for static code linting should be reviewed and updated to suit the teams preferences. A basic set of rules is currently enabled and will run during PR builds for this repo.
Custom error types should be created with relevant messages to wrap errors that are being passed up the stack with relevant information for debugging.
Currently sensitive data (agent key, download, key, certificates, etc.) can be configured directly with the Agent CR. This is considered bad-practice and can be a security risk in some cases; however, it may be alright to keep as a means to deploy agents easily in development environments. Customers should be advised to place sensitive data directly into secrets and reference the secrets from the agent spec.
Rate-limiting for the controller should be configured to prevent potential performance issues in cases where the cluster is inaccessible or the agent otherwise cannot be deployed for some reason.
Options could be added to the CR-spec to enable agents to run on master nodes by automatically setting the appropriate tolerations for node taints.
If desired an option could be added to automatically assign zone names to agent instances based on the value of the
topology.kubernetes.io/zone
label on the node on which they are running.
It may be worth considering the use of different default logging settings to improve readability (e.g. --zap-time-encoding=rfc3339 --zap-encoder=console).
This could potentially be deprecated and replaced by a field using the json.RawMessage
type, which would enable the
configuration yaml to be configured using native yaml within the CR rather than as an embedded string.
The agent should have a readiness probe in addition to its liveness probe. The k8s_sensor should also have liveness and readiness probes. A startup probe can also be added to the agent now that it is supported in all stable versions of k8s. This will allow for faster readiness and recovery from soft-locks (if they occur) since it will allow the initialDelaySeconds to be reduced on the liveness probe. The agent and k8sensor may also want to create dedicated readiness endpoints to allow their actual availability to be reflected more accurately in the CR status. In the agent's case, the availability of independent servers running on different ports may need to be considered when deciding whether to do this since traffic directed at k8s services will not be forwarded to pods that are marked as unready.
Optional Persistent volumes could potentially be used to cache dynamically downloaded updates and packages in between agent restarts.
Runtime status information from the agent could be scraped and incorporated into the status tracked by the CR if this is deemed useful.
Requests and limits for ephemeral storage can be set to ensure that agent pods are assigned to nodes containing appropriate storage space for dynamic agent package downloads or to ensure agents do not exceed some limit of storage use on the host node.
If desired, certificates could be automatically generated and configured when appropriate if cert-manager or OpenShift's certificate generation is available.
Network policies can be used to restrict inbound traffic on ports that the agent or k8s_sensor do not use as a security measure. (May not work on agent itself due to `hostNetwork: true).
Currently, the agent-key and download-key are read by the agent via environment variable set via referencing a key in one or more k8s secrets. It would be beneficial to watch the secret(s) and trigger a restart of the agent daemsonset if a change is detected in the secret(s).