-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dogswatch: initial kubernetes operator #239
Conversation
Updated branch with fixed rebased history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't read Go and don't really know how k8s works but looks fine at a glance :)
How does this end up on the system? Are we going to integrate it with the usual spec file builder or does this need something different?
workspaces/dogswatch/main.go
Outdated
log.WithError(err).Fatalf("agent stopped") | ||
} | ||
} | ||
log.Info("bark bark! 🐕") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ship it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's dogswatch should it be 🐕👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time reading through some of this this and would love to give this a proper code review, but without some deeper knowledge of Kubernetes I can't say whether what you're doing is correct. If you are confident it works, I think we need to go with that for now.
From a code perspective, there are a ton of abstractions here. Future readers would definitely appreciate a bunch more comments and some module level docs around what these abstractions represent in the greater puzzle.
🍰
Indeed! I was sprinkling in what I was thinking where I had reflective moments, but I do intend to spend time expanding the comments and docstrings to the exported symbols at the minimum. |
OperatorBuildVersion = OperatorDevelopmentDoNotUseInProduction | ||
) | ||
|
||
// OperatorVersion describes compatibility versioning at the Platform level (the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of versioning machinery that I'm not convinced will lead to a great experience.
Does this mean that if we break compatibility then customers will need to run one operator per version on their cluster? That's going to be pretty tough to communicate and if a customer misses the notification then updates will silently stop working.
I'd prefer some form of negotiation where we select the "best" version of the upgrade workflow based on what the node supports.
If this isn't yet implemented I'd recommend dropping it and revisiting it in the form of a new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a hard check on matching version compatibility though it is used for selecting Nodes (based on the presence of the label). Having this versioning imparted now and asserting that its a 1.0.0 wherein the appropriate set of label and annotation changes are part of the bound contract is necessary in my opinion. It being here and not necessarily acted on allows us to act on it later.
// NodeName limits the nodestream to a single Node resource with the | ||
// provided name. | ||
NodeName string | ||
// ResyncPeriod is the time between complete resynchronization of the cached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this jittered somehow? It sounds expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resync period must be specified for the informer which will regularly resync itself with the configured resource and selection. I am soliciting feedback through other channels for some k8s feedback/review - this is one of the points I'm hoping to get some better understanding of.
Talked to @jahkeup, just confirming that the force-push today was only a rebase onto latest development. |
b4a6742
to
7d21b08
Compare
This'll probably need some incremental changes on top of it to make sure its working with the changes in bottlerocket-os/bottlerocket-update-operator#16 and actually act on an update (there are none right now and the commands aren't there). Otherwise, these changes can be reviewed at this point though a handful of earlier feedback items are still pending fixes that were punted. |
We'll still want eyes on this before merging. The testing requires an update starting point with the updated interface as called by the updog integration. Otherwise, additional testing and fixes have been added - these uncovered some unhandled scenarios, and are fixed! I would encourage folks to a look at this code with a careful eye around the event handling and at the For those wanting to try this out: I'll assume you have an ECR repository, k8s cluster (with ECR credential helper setup), and your kubectl locally configured: # Set the registry you're using
DOCKER_IMAGE="$account_id.dkr.ecr.us-west-2.amazonaws.com/dogswatch"
# Build the container image
make DOCKER_IMAGE="$DOCKER_IMAGE" container
# Push it to ECR
$(aws --region us-west-2 ecr get-login --registry-id $account_id --no-include-email)
docker push "$DOCKER_IMAGE"
# Deploy to Kubernetes - the resources are created in the "thar" namespace.
make DOCKER_IMAGE="$DOCKER_IMAGE" deploy
# For each node you want to run dogswatch on:
kubectl label node $NODE_NAME thar.amazonaws.com/platform-version=1.0.0
# Check it out!
kubectl get -n thar deployments,daemonsets,pods |
There's also some config and targets in there for https://github.com/kubernetes-sigs/kind if anyone wants to skip the launching of a cluster - I can't rattle the commands off the top of my head at the moment.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some struggle and learning I managed to get a full e2e test working consistently. Most of the issues I encountered were not related to this PR. I made some small changes to get the tests working, which I put in the dogswatch-test
branch. These changes contain hacky communication with updog until a standard interface is agreed upon, which is discussed in #184. The code itself has some sharp edges, but it works. And I also agree with the decision to use Go.
LGTM!
// Host is the integration for this platform and the host - this is a very light | ||
// mapping between the platform requirements and the implementation backing the | ||
// interaction itself. | ||
type Host interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface and the Platform
interface are almost the exact same and I feel like these could be collapsed into a single layer. Open to discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was another area that was knowingly complected :( There's a few places where we want to be able to insert implementations for testing as well as altered cluster behavior (ex: "chaotic" mode where clusters bounce near continuously). I'm not keen on collapsing these as we'd lose the ability to insert these implementations - though as we discussed before, I'm game to unExporting these to reduce the surface area of the API itself. Some of these probably shouldn't have been Exported to begin with (thanks to Me!) 😝
895cf95
to
7cffc98
Compare
Construct a view on the policy that's summarized for the policy check to act on with. There are some issues that may be introduced with this approach and need to be better formed, however, I think this will get this functionality off the ground.
We can't yet check for a usable configured state from updog, that'll have to come later. For now, checking that it exists on disk is enough to convince ourselves that we're in the right place.
This allows for the testing that's now present.
Force-pushed with rebase on current (Matt says) I think this comment used to point to: bottlerocket-os/bottlerocket-update-operator#15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the additions @patraw ! There's a couple nits and a small change that I'd like to see in place before we merge this - namely the stub ID to avoid confusion in code and by humans.
var buf bytes.Buffer | ||
writer := bufio.NewWriter(&buf) | ||
cmd.Stdout = writer | ||
cmd.Stderr = writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The human oriented messages and machine oriented messages are split between these two FDs, we probably shouldn't combine them. However, I think we can punt on this for now, but once deserialization comes into play, these will definitely need to be split.
* Add logging to command execution. * Increase, temporarily, memory of running pod. * Update ListUpdate logic.
Change
Issue #, if available:
#184 #185 #186
Description of changes:
The
dogswatch
Kubernetes Operator implements interfaces to the cluster's orchestrator and Thar itself to coordinate updates as described in #184 . This implementation uses labels, annotations, and Kubernetes' SDK primitives to stream updates and post wanted changes for which the Nodes respond in appropriate responsive actions.The controller and agent alike concentrate their communicated state and progress into "intents" in which their current state, their "wanted" state, and activity status are reported to drive a Node's upgrade. Nodes that are driven are targeted by regularly checking for updates and then once available post their need for an update by way of labels that cause a controller to target a Node when the controlller deems it appropriate to do so. Controllers limit the number of on going actions and may implement their own policy dictating how and when Nodes may proceed.
As it stands today, the controller does not handle rollback scenarios nor is it capable of deeper understanding of an update that would permit it to halt a rollback in the cluster as update metadata is not yet propagated to it from the requesting Node. These checks, and richer ones, are anticipated to be added over time.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Remaining items