Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dogswatch: initial kubernetes operator #239

Merged
merged 51 commits into from
Nov 15, 2019
Merged

dogswatch: initial kubernetes operator #239

merged 51 commits into from
Nov 15, 2019

Conversation

jahkeup
Copy link
Member

@jahkeup jahkeup commented Sep 16, 2019

Change

Issue #, if available:

#184 #185 #186

Description of changes:

The dogswatch Kubernetes Operator implements interfaces to the cluster's orchestrator and Thar itself to coordinate updates as described in #184 . This implementation uses labels, annotations, and Kubernetes' SDK primitives to stream updates and post wanted changes for which the Nodes respond in appropriate responsive actions.

The controller and agent alike concentrate their communicated state and progress into "intents" in which their current state, their "wanted" state, and activity status are reported to drive a Node's upgrade. Nodes that are driven are targeted by regularly checking for updates and then once available post their need for an update by way of labels that cause a controller to target a Node when the controlller deems it appropriate to do so. Controllers limit the number of on going actions and may implement their own policy dictating how and when Nodes may proceed.

As it stands today, the controller does not handle rollback scenarios nor is it capable of deeper understanding of an update that would permit it to halt a rollback in the cluster as update metadata is not yet propagated to it from the requesting Node. These checks, and richer ones, are anticipated to be added over time.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Remaining items

  • Test end-to-end update
  • additional stubbed "functional tests"
  • uncordon after successful update

@jahkeup
Copy link
Member Author

jahkeup commented Sep 17, 2019

Updated branch with fixed rebased history.

Copy link
Contributor

@iliana iliana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't read Go and don't really know how k8s works but looks fine at a glance :)

How does this end up on the system? Are we going to integrate it with the usual spec file builder or does this need something different?

log.WithError(err).Fatalf("agent stopped")
}
}
log.Info("bark bark! 🐕")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's dogswatch should it be 🐕👀

workspaces/dogswatch/pkg/agent/agent.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time reading through some of this this and would love to give this a proper code review, but without some deeper knowledge of Kubernetes I can't say whether what you're doing is correct. If you are confident it works, I think we need to go with that for now.

From a code perspective, there are a ton of abstractions here. Future readers would definitely appreciate a bunch more comments and some module level docs around what these abstractions represent in the greater puzzle.

🍰

@jahkeup
Copy link
Member Author

jahkeup commented Sep 18, 2019

From a code perspective, there are a ton of abstractions here. Future readers would definitely appreciate a bunch more comments and some module level docs around what these abstractions represent in the greater puzzle.

Indeed! I was sprinkling in what I was thinking where I had reflective moments, but I do intend to spend time expanding the comments and docstrings to the exported symbols at the minimum.

workspaces/dogswatch/pkg/k8sutil/marker.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/marker/keys.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/marker/values.go Outdated Show resolved Hide resolved
OperatorBuildVersion = OperatorDevelopmentDoNotUseInProduction
)

// OperatorVersion describes compatibility versioning at the Platform level (the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of versioning machinery that I'm not convinced will lead to a great experience.

Does this mean that if we break compatibility then customers will need to run one operator per version on their cluster? That's going to be pretty tough to communicate and if a customer misses the notification then updates will silently stop working.

I'd prefer some form of negotiation where we select the "best" version of the upgrade workflow based on what the node supports.

If this isn't yet implemented I'd recommend dropping it and revisiting it in the form of a new feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a hard check on matching version compatibility though it is used for selecting Nodes (based on the presence of the label). Having this versioning imparted now and asserting that its a 1.0.0 wherein the appropriate set of label and annotation changes are part of the bound contract is necessary in my opinion. It being here and not necessarily acted on allows us to act on it later.

// NodeName limits the nodestream to a single Node resource with the
// provided name.
NodeName string
// ResyncPeriod is the time between complete resynchronization of the cached
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this jittered somehow? It sounds expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resync period must be specified for the informer which will regularly resync itself with the configured resource and selection. I am soliciting feedback through other channels for some k8s feedback/review - this is one of the points I'm hoping to get some better understanding of.

workspaces/dogswatch/pkg/platform/updog/platform.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/intent/intent.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/intent/intent.go Outdated Show resolved Hide resolved
workspaces/dogswatch/pkg/controller/kubernetes.go Outdated Show resolved Hide resolved
@tjkirch
Copy link
Contributor

tjkirch commented Sep 26, 2019

Talked to @jahkeup, just confirming that the force-push today was only a rebase onto latest development.

@jahkeup jahkeup force-pushed the dogswatch branch 2 times, most recently from b4a6742 to 7d21b08 Compare October 1, 2019 00:47
@jahkeup
Copy link
Member Author

jahkeup commented Oct 1, 2019

This'll probably need some incremental changes on top of it to make sure its working with the changes in bottlerocket-os/bottlerocket-update-operator#16 and actually act on an update (there are none right now and the commands aren't there). Otherwise, these changes can be reviewed at this point though a handful of earlier feedback items are still pending fixes that were punted.

@jahkeup jahkeup requested a review from patraw October 3, 2019 18:11
@patraw patraw removed their request for review October 3, 2019 18:11
@jahkeup jahkeup requested review from patraw and removed request for jhaynes October 3, 2019 18:11
@patraw patraw requested a review from jhaynes October 3, 2019 18:11
@jahkeup jahkeup removed the request for review from jhaynes October 4, 2019 17:21
@jahkeup
Copy link
Member Author

jahkeup commented Oct 5, 2019

We'll still want eyes on this before merging. The testing requires an update starting point with the updated interface as called by the updog integration. Otherwise, additional testing and fixes have been added - these uncovered some unhandled scenarios, and are fixed!

I would encourage folks to a look at this code with a careful eye around the event handling and at the TODOs sprinkled around - this is a fit and trim implementation and has some MVP tint to it.


For those wanting to try this out:

I'll assume you have an ECR repository, k8s cluster (with ECR credential helper setup), and your kubectl locally configured:

# Set the registry you're using
DOCKER_IMAGE="$account_id.dkr.ecr.us-west-2.amazonaws.com/dogswatch"
# Build the container image
make DOCKER_IMAGE="$DOCKER_IMAGE" container
# Push it to ECR
$(aws --region us-west-2 ecr get-login --registry-id $account_id --no-include-email)
docker push "$DOCKER_IMAGE"
# Deploy to Kubernetes - the resources are created in the "thar" namespace.
make DOCKER_IMAGE="$DOCKER_IMAGE" deploy
# For each node you want to run dogswatch on:
kubectl label node $NODE_NAME thar.amazonaws.com/platform-version=1.0.0
# Check it out!
kubectl get -n thar deployments,daemonsets,pods

@jahkeup
Copy link
Member Author

jahkeup commented Oct 5, 2019

There's also some config and targets in there for https://github.com/kubernetes-sigs/kind if anyone wants to skip the launching of a cluster - I can't rattle the commands off the top of my head at the moment..

Copy link
Contributor

@patraw patraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some struggle and learning I managed to get a full e2e test working consistently. Most of the issues I encountered were not related to this PR. I made some small changes to get the tests working, which I put in the dogswatch-test branch. These changes contain hacky communication with updog until a standard interface is agreed upon, which is discussed in #184. The code itself has some sharp edges, but it works. And I also agree with the decision to use Go.

LGTM!

extras/dogswatch/Makefile Outdated Show resolved Hide resolved
extras/dogswatch/pkg/intent/intent.go Show resolved Hide resolved
extras/dogswatch/pkg/k8sutil/marker.go Outdated Show resolved Hide resolved
// Host is the integration for this platform and the host - this is a very light
// mapping between the platform requirements and the implementation backing the
// interaction itself.
type Host interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface and the Platform interface are almost the exact same and I feel like these could be collapsed into a single layer. Open to discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another area that was knowingly complected :( There's a few places where we want to be able to insert implementations for testing as well as altered cluster behavior (ex: "chaotic" mode where clusters bounce near continuously). I'm not keen on collapsing these as we'd lose the ability to insert these implementations - though as we discussed before, I'm game to unExporting these to reduce the surface area of the API itself. Some of these probably shouldn't have been Exported to begin with (thanks to Me!) 😝

@patraw patraw force-pushed the dogswatch branch 2 times, most recently from 895cf95 to 7cffc98 Compare November 5, 2019 18:35
Construct a view on the policy that's summarized for the policy check to
act on with. There are some issues that may be introduced with this
approach and need to be better formed, however, I think this will get
this functionality off the ground.
We can't yet check for a usable configured state from updog, that'll
have to come later. For now, checking that it exists on disk is enough
to convince ourselves that we're in the right place.
This allows for the testing that's now present.
@jahkeup
Copy link
Member Author

jahkeup commented Nov 12, 2019

Force-pushed with rebase on current develop. This accompanies a new README.md and small edits as pointed out in earlier review.

(Matt says) I think this comment used to point to: bottlerocket-os/bottlerocket-update-operator#15

Copy link
Member Author

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the additions @patraw ! There's a couple nits and a small change that I'd like to see in place before we merge this - namely the stub ID to avoid confusion in code and by humans.

extras/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
extras/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
extras/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
extras/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
extras/dogswatch/pkg/platform/updog/updog.go Outdated Show resolved Hide resolved
var buf bytes.Buffer
writer := bufio.NewWriter(&buf)
cmd.Stdout = writer
cmd.Stderr = writer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The human oriented messages and machine oriented messages are split between these two FDs, we probably shouldn't combine them. However, I think we can punt on this for now, but once deserialization comes into play, these will definitely need to be split.

* Add logging to command execution.
* Increase, temporarily, memory of running pod.
* Update ListUpdate logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants