Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block RHCOS gcp-routes service on both masters and workers #3619

Merged

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Mar 20, 2023

RHCOS includes a gcp-routes.service that is (allegedly?) needed on the bootstrap host, but which we replace with a better version on the masters. However, we were still leaving it running on workers, which should be a no-op, but is actually theoretically bad because of bugs / unwanted design features in the script. (https://issues.redhat.com/browse/OCPBUGSM-45189)

@openshift-ci openshift-ci bot requested review from cgwalters and sinnykumari March 20, 2023 16:41
@cgwalters
Copy link
Member

WDYT about moving gcp-routes into something managed by https://github.com/openshift/cluster-network-operator ?

@cgwalters
Copy link
Member

Oh wow, I hadn't realized we had duplicated this. I think we can just nuke the script from RHCOS right?

@danwinship
Copy link
Contributor Author

danwinship commented Mar 21, 2023

Oh wow, I hadn't realized we had duplicated this. I think we can just nuke the script from RHCOS right?

From what I can tell, it was left behind on purpose because we believed that it was necessary to run it on the bootstrap host. However, the azure-routes and alibaba-routes services exist only in MCO and not in RHCOS, and it's not clear to me why gcp-routes would be needed on the bootstrap host but azure-routes and alibaba-routes wouldn't. [EDIT: See comment below]

But yeah, probably inheriting this from RHCOS and then disabling it everywhere except the bootstrap host is not the best idea...

How could we test this? There is no simple way to build an OCP image which will use a patched version of RHCOS at bootstrap time is there?

WDYT about moving gcp-routes into something managed by https://github.com/openshift/cluster-network-operator ?

OK, so a few parts to that:

  • The code currently has ambiguous ownership and the network team may not want to resolve the ambiguity. 🙂 ({gcp,azure,alibaba}-routes is only needed for the manually-created apiserver loadbalancer, not for ordinary Kubernetes-managed loadbalancers.) OTOH it's true that we wrote most of the current MCO version of the code, and while we don't really understand it that well, I don't think any other team does either.
  • That said, even if you assume network team ownership of the code, it is functionally part of the apiserver loadbalancer, so it would make more sense to have it live with the other apiserver-loadbalancer-managing code, which is not in CNO. (Part of that is in the installer but I'm not sure what manages the loadbalancer post-install-time.)
  • Even if we wanted to have CNO manage it, we can't currently, because CNO has no install-time/bootstrap involvement, so we can't plausibly have it create a MachineConfig for the masters. There's been discussion about fixing that in the past (eg so CNO could take over ovs-configuration) but it's not currently planned. (SDN-1615)

@cgwalters
Copy link
Member

How could we test this? There is no simple way to build an OCP image which will use a patched version of RHCOS at bootstrap time is there?

It's all very doable with https://github.com/coreos/coreos-assembler then use the override env var for the installer.

That said, I think if we want to just test this being disabled on the bootstrap, it's a lot easier to patch the Ignition generated by openshift-install for the bootstrap node.

@danwinship
Copy link
Contributor Author

That said, I think if we want to just test this being disabled on the bootstrap, it's a lot easier to patch the Ignition generated by openshift-install for the bootstrap node.

OK, done. It fails. So it is needed on bootstrap. So... should we leave it in RHCOS and merge this PR to disable it in OCP, or would it be better to remove it from RHCOS and get it onto the bootstrap via some other means (something in the installer I guess?)

@cgwalters
Copy link
Member

cgwalters commented Mar 22, 2023

Thanks for testing that. Here's my thoughts:

  • Let's merge this PR
  • There's general agreement to have some movement on Rework build process to generate rhel-coreos-base distinct from ocp-rhel-coreos os#799 which would totally solve this problem because we'd end up creating e.g. github.com/openshift/node-stuff (perhaps it should have a less boring name) and it would hold this service in one place, and that one place would also appear on the bootstrap node too. (Because the bootstrap will start doing an in-place update from a more "pristine OS" to our quay.io/openshift/node:4.12 image). EDIT: Also to be clear, a huge, in fact giant, really EPIC improvement from having a clear github.com/openshift/node-stuff is that code in that repository can be written in Go (or Rust...) and we can stop injecting horrible shell scripts...

So from my PoV,
/approve
I'll leave it to an MCO team member to add lgtm.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2023
@danwinship
Copy link
Contributor Author

/retest-required

@danwinship
Copy link
Contributor Author

/retest-required

(I've confirmed with cluster-bot that the PR works. Tests are just being flaky...)

@danwinship
Copy link
Contributor Author

/retest

@cgwalters
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c748a0f and 2 for PR HEAD e9bccbf in total

@danwinship
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b0c3b2b and 1 for PR HEAD e9bccbf in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

@danwinship: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-gcp-ovn-upgrade e9bccbf link false /test okd-scos-e2e-gcp-ovn-upgrade
ci/prow/e2e-hypershift e9bccbf link false /test e2e-hypershift
ci/prow/e2e-alibabacloud-ovn e9bccbf link false /test e2e-alibabacloud-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@danwinship
Copy link
Contributor Author

/skip

@danwinship
Copy link
Contributor Author

/retest-required

@openshift-merge-robot openshift-merge-robot merged commit 580aa60 into openshift:master Apr 7, 2023
@danwinship danwinship deleted the gcp-routes-bootstrap-only branch April 7, 2023 19:37
@danwinship
Copy link
Contributor Author

However, the azure-routes and alibaba-routes services exist only in MCO and not in RHCOS, and it's not clear to me why gcp-routes would be needed on the bootstrap host but azure-routes and alibaba-routes wouldn't.

(After digging into this some more...)

It turns out that azure-routes and alibaba-routes are solving a slightly different problem; the Azure/Alibaba LBs do DNAT, but they don't SNAT, so they mishandle hairpin traffic. So azure-routes/alibaba-routes just add rules so that clients on a master connecting to the apiserver LB get redirected to the local apiserver rather than using the LB.

So, if they aren't needed on the bootstrap host, then it must be the case that there are no clients running on the bootstrap host that use the apiserver LB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants