dogswatch: initial kubernetes operator #239

jahkeup · 2019-09-16T18:36:13Z

Change

Issue #, if available:

Description of changes:

The dogswatch Kubernetes Operator implements interfaces to the cluster's orchestrator and Thar itself to coordinate updates as described in #184 . This implementation uses labels, annotations, and Kubernetes' SDK primitives to stream updates and post wanted changes for which the Nodes respond in appropriate responsive actions.

The controller and agent alike concentrate their communicated state and progress into "intents" in which their current state, their "wanted" state, and activity status are reported to drive a Node's upgrade. Nodes that are driven are targeted by regularly checking for updates and then once available post their need for an update by way of labels that cause a controller to target a Node when the controlller deems it appropriate to do so. Controllers limit the number of on going actions and may implement their own policy dictating how and when Nodes may proceed.

As it stands today, the controller does not handle rollback scenarios nor is it capable of deeper understanding of an update that would permit it to halt a rollback in the cluster as update metadata is not yet propagated to it from the requesting Node. These checks, and richer ones, are anticipated to be added over time.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Remaining items

Test end-to-end update
additional stubbed "functional tests"
uncordon after successful update

jahkeup · 2019-09-17T19:48:02Z

Updated branch with fixed rebased history.

iliana

I can't read Go and don't really know how k8s works but looks fine at a glance :)

How does this end up on the system? Are we going to integrate it with the usual spec file builder or does this need something different?

iliana · 2019-09-18T16:11:31Z

workspaces/dogswatch/main.go

+			log.WithError(err).Fatalf("agent stopped")
+		}
+	}
+	log.Info("bark bark! 🐕")


ｓｈｉｐ　ｉｔ

if it's dogswatch should it be 🐕👀

workspaces/dogswatch/pkg/agent/agent.go

zmrow

I spent some time reading through some of this this and would love to give this a proper code review, but without some deeper knowledge of Kubernetes I can't say whether what you're doing is correct. If you are confident it works, I think we need to go with that for now.

From a code perspective, there are a ton of abstractions here. Future readers would definitely appreciate a bunch more comments and some module level docs around what these abstractions represent in the greater puzzle.

🍰

jahkeup · 2019-09-18T18:53:29Z

From a code perspective, there are a ton of abstractions here. Future readers would definitely appreciate a bunch more comments and some module level docs around what these abstractions represent in the greater puzzle.

Indeed! I was sprinkling in what I was thinking where I had reflective moments, but I do intend to spend time expanding the comments and docstrings to the exported symbols at the minimum.

workspaces/dogswatch/pkg/k8sutil/marker.go

workspaces/dogswatch/pkg/marker/keys.go

workspaces/dogswatch/pkg/marker/values.go

bcressey · 2019-09-19T02:53:57Z

workspaces/dogswatch/pkg/marker/values.go

+	OperatorBuildVersion = OperatorDevelopmentDoNotUseInProduction
+)
+
+// OperatorVersion describes compatibility versioning at the Platform level (the


This is a lot of versioning machinery that I'm not convinced will lead to a great experience.

Does this mean that if we break compatibility then customers will need to run one operator per version on their cluster? That's going to be pretty tough to communicate and if a customer misses the notification then updates will silently stop working.

I'd prefer some form of negotiation where we select the "best" version of the upgrade workflow based on what the node supports.

If this isn't yet implemented I'd recommend dropping it and revisiting it in the form of a new feature.

There isn't a hard check on matching version compatibility though it is used for selecting Nodes (based on the presence of the label). Having this versioning imparted now and asserting that its a 1.0.0 wherein the appropriate set of label and annotation changes are part of the bound contract is necessary in my opinion. It being here and not necessarily acted on allows us to act on it later.

bcressey · 2019-09-19T02:55:46Z

workspaces/dogswatch/pkg/nodestream/config.go

+	// NodeName limits the nodestream to a single Node resource with the
+	// provided name.
+	NodeName string
+	// ResyncPeriod is the time between complete resynchronization of the cached


Is this jittered somehow? It sounds expensive.

The resync period must be specified for the informer which will regularly resync itself with the configured resource and selection. I am soliciting feedback through other channels for some k8s feedback/review - this is one of the points I'm hoping to get some better understanding of.

workspaces/dogswatch/pkg/platform/updog/platform.go

workspaces/dogswatch/pkg/platform/updog/updog.go

workspaces/dogswatch/pkg/intent/intent.go

workspaces/dogswatch/pkg/controller/kubernetes.go

tjkirch · 2019-09-26T21:40:52Z

Talked to @jahkeup, just confirming that the force-push today was only a rebase onto latest development.

jahkeup · 2019-10-01T18:13:00Z

This'll probably need some incremental changes on top of it to make sure its working with the changes in bottlerocket-os/bottlerocket-update-operator#16 and actually act on an update (there are none right now and the commands aren't there). Otherwise, these changes can be reviewed at this point though a handful of earlier feedback items are still pending fixes that were punted.

jahkeup · 2019-10-05T05:49:41Z

We'll still want eyes on this before merging. The testing requires an update starting point with the updated interface as called by the updog integration. Otherwise, additional testing and fixes have been added - these uncovered some unhandled scenarios, and are fixed!

I would encourage folks to a look at this code with a careful eye around the event handling and at the TODOs sprinkled around - this is a fit and trim implementation and has some MVP tint to it.

For those wanting to try this out:

I'll assume you have an ECR repository, k8s cluster (with ECR credential helper setup), and your kubectl locally configured:

# Set the registry you're using
DOCKER_IMAGE="$account_id.dkr.ecr.us-west-2.amazonaws.com/dogswatch"
# Build the container image
make DOCKER_IMAGE="$DOCKER_IMAGE" container
# Push it to ECR
$(aws --region us-west-2 ecr get-login --registry-id $account_id --no-include-email)
docker push "$DOCKER_IMAGE"
# Deploy to Kubernetes - the resources are created in the "thar" namespace.
make DOCKER_IMAGE="$DOCKER_IMAGE" deploy
# For each node you want to run dogswatch on:
kubectl label node $NODE_NAME thar.amazonaws.com/platform-version=1.0.0
# Check it out!
kubectl get -n thar deployments,daemonsets,pods

jahkeup · 2019-10-05T05:54:40Z

There's also some config and targets in there for https://github.com/kubernetes-sigs/kind if anyone wants to skip the launching of a cluster - I can't rattle the commands off the top of my head at the moment..

patraw

After some struggle and learning I managed to get a full e2e test working consistently. Most of the issues I encountered were not related to this PR. I made some small changes to get the tests working, which I put in the dogswatch-test branch. These changes contain hacky communication with updog until a standard interface is agreed upon, which is discussed in #184. The code itself has some sharp edges, but it works. And I also agree with the decision to use Go.

LGTM!

extras/dogswatch/Makefile

extras/dogswatch/pkg/intent/intent.go

extras/dogswatch/pkg/k8sutil/marker.go

patraw · 2019-11-05T17:35:41Z

extras/dogswatch/pkg/platform/updog/host.go

+// Host is the integration for this platform and the host - this is a very light
+// mapping between the platform requirements and the implementation backing the
+// interaction itself.
+type Host interface {


This interface and the Platform interface are almost the exact same and I feel like these could be collapsed into a single layer. Open to discussion.

This was another area that was knowingly complected :( There's a few places where we want to be able to insert implementations for testing as well as altered cluster behavior (ex: "chaotic" mode where clusters bounce near continuously). I'm not keen on collapsing these as we'd lose the ability to insert these implementations - though as we discussed before, I'm game to unExporting these to reduce the surface area of the API itself. Some of these probably shouldn't have been Exported to begin with (thanks to Me!) 😝

Construct a view on the policy that's summarized for the policy check to act on with. There are some issues that may be introduced with this approach and need to be better formed, however, I think this will get this functionality off the ground.

We can't yet check for a usable configured state from updog, that'll have to come later. For now, checking that it exists on disk is enough to convince ourselves that we're in the right place.

This allows for the testing that's now present.

jahkeup · 2019-11-12T00:48:34Z

Force-pushed with rebase on current develop. This accompanies a new README.md and small edits as pointed out in earlier review.

(Matt says) I think this comment used to point to: bottlerocket-os/bottlerocket-update-operator#15

jahkeup

Looks good, thanks for the additions @patraw ! There's a couple nits and a small change that I'd like to see in place before we merge this - namely the stub ID to avoid confusion in code and by humans.

extras/dogswatch/pkg/platform/updog/updog.go

jahkeup · 2019-11-14T18:48:59Z

extras/dogswatch/pkg/platform/updog/updog.go

+	var buf bytes.Buffer
+	writer := bufio.NewWriter(&buf)
+	cmd.Stdout = writer
+	cmd.Stderr = writer


The human oriented messages and machine oriented messages are split between these two FDs, we probably shouldn't combine them. However, I think we can punt on this for now, but once deserialization comes into play, these will definitely need to be split.

* Add logging to command execution. * Increase, temporarily, memory of running pod. * Update ListUpdate logic.

jahkeup force-pushed the dogswatch branch from f476606 to 2a9bfa5 Compare September 17, 2019 19:46

jahkeup requested review from bcressey, etungsten, jhaynes and zmrow September 17, 2019 19:47

iliana approved these changes Sep 18, 2019

View reviewed changes

zmrow approved these changes Sep 18, 2019

View reviewed changes

bcressey requested changes Sep 19, 2019

View reviewed changes

sam-aws mentioned this pull request Sep 25, 2019

updog: commands for smooth coordination #301

Merged

2 tasks

jahkeup force-pushed the dogswatch branch from 7dca5dc to 6cbb37c Compare September 26, 2019 18:12

jahkeup force-pushed the dogswatch branch 2 times, most recently from b4a6742 to 7d21b08 Compare October 1, 2019 00:47

jahkeup requested a review from patraw October 3, 2019 18:11

patraw removed their request for review October 3, 2019 18:11

jahkeup requested review from patraw and removed request for jhaynes October 3, 2019 18:11

patraw requested a review from jhaynes October 3, 2019 18:11

jahkeup removed the request for review from jhaynes October 4, 2019 17:21

patraw approved these changes Nov 5, 2019

View reviewed changes

patraw force-pushed the dogswatch branch 2 times, most recently from 895cf95 to 7cffc98 Compare November 5, 2019 18:35

jahkeup added 17 commits November 11, 2019 16:35

dogswatch: remove stale code

d862836

dogswatch: add manager tests

69cb3ac

dogswatch: move to extras/

d912fb6

dogswatch: pull images more eagerly

fde6115

dogswatch: periodically check for updates

467b98e

dogswatch: make checkers more aptly named

735dee8

dogswatch: remove updog check that errors

e38994f

We can't yet check for a usable configured state from updog, that'll have to come later. For now, checking that it exists on disk is enough to convince ourselves that we're in the right place.

dogswatch: handle success uncordoning

c99cf50

dogswatch: update checks for intent handling

f149da2

dogswatch: add tests to controller

a112336

dogswatch: refactor processs and poster interfaces

0b811c2

This allows for the testing that's now present.

dogswatch: add testoutput

dd18161

dogswatch: add initial docs

545bae6

dogswatch: remove buildkit caching

21ee352

dogswatch: update docstrings

990ace3

dogswatch: revise log to debuggable logging

7e86d64

jahkeup force-pushed the dogswatch branch from 7cffc98 to 7e86d64 Compare November 12, 2019 00:47

jahkeup requested a review from bcressey November 12, 2019 18:49

jahkeup commented Nov 14, 2019

View reviewed changes

dogswatch: Update logging, climits, list update.

c8c4a0f

* Add logging to command execution. * Increase, temporarily, memory of running pod. * Update ListUpdate logic.

patraw force-pushed the dogswatch branch from 939250c to c8c4a0f Compare November 14, 2019 21:44

patraw merged commit 5d86c74 into develop Nov 15, 2019

iliana added this to the v0.2.0 milestone Nov 19, 2019

jahkeup mentioned this pull request Dec 9, 2019

dogswatch: fix Agent handler and policy check #573

Merged

iliana deleted the dogswatch branch December 10, 2019 21:45

This was referenced Feb 26, 2020

Protect controller from becoming unscheduleable bottlerocket-os/bottlerocket-update-operator#14

Closed

Idea: transform intent into runnable behavior bottlerocket-os/bottlerocket-update-operator#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dogswatch: initial kubernetes operator #239

dogswatch: initial kubernetes operator #239

jahkeup commented Sep 16, 2019 •

edited

Loading

jahkeup commented Sep 17, 2019

iliana left a comment

iliana Sep 18, 2019

iliana Sep 18, 2019

zmrow left a comment

jahkeup commented Sep 18, 2019

bcressey Sep 19, 2019

jahkeup Sep 19, 2019

bcressey Sep 19, 2019

jahkeup Sep 19, 2019

tjkirch commented Sep 26, 2019

jahkeup commented Oct 1, 2019 •

edited by webern

Loading

jahkeup commented Oct 5, 2019

jahkeup commented Oct 5, 2019

patraw left a comment

patraw Nov 5, 2019

jahkeup Nov 7, 2019

jahkeup commented Nov 12, 2019 •

edited by webern

Loading

jahkeup left a comment

jahkeup Nov 14, 2019

dogswatch: initial kubernetes operator #239

dogswatch: initial kubernetes operator #239

Conversation

jahkeup commented Sep 16, 2019 • edited Loading

Change

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jahkeup commented Sep 17, 2019

iliana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmrow left a comment

Choose a reason for hiding this comment

jahkeup commented Sep 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjkirch commented Sep 26, 2019

jahkeup commented Oct 1, 2019 • edited by webern Loading

jahkeup commented Oct 5, 2019

jahkeup commented Oct 5, 2019

patraw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahkeup commented Nov 12, 2019 • edited by webern Loading

jahkeup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahkeup commented Sep 16, 2019 •

edited

Loading

jahkeup commented Oct 1, 2019 •

edited by webern

Loading

jahkeup commented Nov 12, 2019 •

edited by webern

Loading