-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-25406: Azure Run ovs-configuration.service before dnsmasq.service #4087
OCPBUGS-25406: Azure Run ovs-configuration.service before dnsmasq.service #4087
Conversation
@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@rioliu-rh could you take a look at this? As scott mentioned, there are some contacts to collab with for verification. |
@@ -8,7 +8,7 @@ contents: | | |||
Wants=NetworkManager-wait-online.service | |||
After=firstboot-osupdate.target | |||
After=NetworkManager-wait-online.service openvswitch.service network.service nodeip-configuration.service | |||
Before=kubelet-dependencies.target node-valid-hostname.service | |||
Before=kubelet-dependencies.target node-valid-hostname.service dnsmasq.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as this does not break any of the e2e tests (and maybe we should run some payloads?) this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll run the same payload jobs that ran on the original change as soon as the standard pre-submits complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reduce potential impact, would you consider it acceptable if the Before dnsmasq.service only applies to Azure? i.e. add this to Azure templates instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was originally concerned that it would introduce risk of divergent changes, but looks like this file is pretty stable as it hasn't changed in three years. Updated the PR.
/test e2e-gcp-op-single-node |
/payload 4.16 nightly blocking |
@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b7ea8bc0-a99f-11ee-8308-3b168ea866a2-0 |
ARO doesn't have a way to test 4.14 builds at installation time, so i just tested against the latest 4.13 stream + MCO image override on upgrade.
The above worked, and saw the We also tested the change by making a drop-in on the @ventifus if you have anything to add in your testing please do. Lgtm otherwise. |
LGTM. I can confirm that dnsmasq starts later as expected. It now starts as soon as |
/hold |
installed ARO cluster with $ oc get infrastructures/cluster -o yaml | yq -y '.status'
apiServerInternalURI: https://api-int.m64e0l2i.northcentralus.aroapp.io:6443
apiServerURL: https://api.m64e0l2i.northcentralus.aroapp.io:6443
controlPlaneTopology: HighlyAvailable
cpuPartitioning: None
etcdDiscoveryDomain: ''
infrastructureName: rioliu-0103a-8m6ml
infrastructureTopology: HighlyAvailable
platform: Azure
platformStatus:
azure:
cloudName: AzurePublicCloud
networkResourceGroupName: rioliu-0103a-rg
resourceGroupName: aro-m64e0l2i
type: Azure
$ oc get clusterversion version -o yaml | yq '.status.history[]|[.version, .state]'
[
"4.15.0-0.ci.test-2024-01-03-020219-ci-ln-kfz9szk-latest",
"Completed"
]
[
"4.14.7",
"Completed"
]
[
"4.13.27",
"Completed"
]
[
"4.12.25",
"Completed"
]
[
"4.12.25",
"Completed"
] check status of sh-5.1# systemctl status ovs-configuration.service
○ ovs-configuration.service - Configures OVS with proper host networking configuration
Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; preset: disabled)
Active: inactive (dead) since Wed 2024-01-03 06:47:19 UTC; 1h 0min ago
Main PID: 1461 (code=exited, status=0/SUCCESS)
CPU: 2.378s
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev enP29997s1 proto kernel metric 1024 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2364]: +++ dirname /tmp/configure-ovs-2024-01-03-06-47-19-CJL9xpL0Ul
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1463]: ++ rm -rf '/tmp/configure-ovs-*'
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1463]: ++ exit 0
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + e=0
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + trap handle_termination INT TERM
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + '[' 0 -eq 0 ']'
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + exit 0
sh-5.1# systemctl status dnsmasq.service
● dnsmasq.service - DNS caching server.
Loaded: loaded (/etc/systemd/system/dnsmasq.service; enabled; preset: disabled)
Active: active (running) since Wed 2024-01-03 06:47:19 UTC; 1h 1min ago
Main PID: 2446 (dnsmasq)
Tasks: 1 (limit: 101686)
Memory: 1.8M
CPU: 249ms
CGroup: /system.slice/dnsmasq.service
└─2446 /usr/sbin/dnsmasq -k
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.ADDRESS[1]: fe80::4478:eda5:9ef1:2b6b/64
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.GATEWAY: --
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.ROUTE[1]: dst = fe80::/64, nh = ::, mt = 1024
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2376]: OVN mode - br-ex device exists
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h systemd[1]: Starting DNS caching server....
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: started, version 2.85 cache disabled
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: reading /etc/resolv.conf.dnsmasq
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: using nameserver 168.63.129.16#53
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h systemd[1]: Started DNS caching server.. check systemd units dependency of sh-5.1# systemctl show ovs-configuration.service | grep -i before
Before=node-valid-hostname.service shutdown.target kubelet-dependencies.target network-online.target dnsmasq.service
sh-5.1# systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─dnsmasq.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │ └─final.target
● ├─network-online.target
● │ ├─crio.service
● │ ├─dnsmasq.service
○ │ ├─iscsi.service
○ │ ├─iscsid.service
● │ ├─kubelet-auto-node-size.service
○ │ ├─nfs-mountd.service
○ │ ├─nfs-server.service
○ │ ├─rc-local.service
● │ ├─rpc-statd-notify.service
● │ ├─rpc-statd.service
● │ ├─kubelet-dependencies.target
● │ │ ├─crio.service
● │ │ ├─kubelet.service
○ │ │ └─shutdown.target
○ │ │ └─final.target
○ │ └─shutdown.target
○ │ └─final.target
○ └─shutdown.target
○ └─final.target
sh-5.1# systemctl list-dependencies --after dnsmasq.service
dnsmasq.service
○ ├─ovs-configuration.service
● ├─system.slice
● ├─systemd-journald.socket
...
scale machineset to provision new node $ oc scale --replicas=4 machineset.machine.openshift.io/rioliu-0103a-8m6ml-worker-northcentralus -n openshift-machine-api
machineset.machine.openshift.io/rioliu-0103a-8m6ml-worker-northcentralus scaled new node can be provisioned and joined the cluster successfully $ oc get machine.machine.openshift.io -n openshift-machine-api --sort-by metadata.creationTimestamp
NAME PHASE TYPE REGION ZONE AGE
rioliu-0103a-8m6ml-master-0 Running Standard_D8s_v3 northcentralus 5h39m
rioliu-0103a-8m6ml-master-1 Running Standard_D8s_v3 northcentralus 5h39m
rioliu-0103a-8m6ml-master-2 Running Standard_D8s_v3 northcentralus 5h39m
rioliu-0103a-8m6ml-worker-northcentralus-g9t8h Running Standard_D4s_v3 northcentralus 5h36m
rioliu-0103a-8m6ml-worker-northcentralus-llc9g Running Standard_D4s_v3 northcentralus 5h36m
rioliu-0103a-8m6ml-worker-northcentralus-lt2gw Running Standard_D4s_v3 northcentralus 5h36m
rioliu-0103a-8m6ml-worker-northcentralus-5k48j Running Standard_D4s_v3 northcentralus 9m46s
$ oc get node --sort-by metadata.creationTimestamp
NAME STATUS ROLES AGE VERSION
rioliu-0103a-8m6ml-master-1 Ready control-plane,master 5h38m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-master-2 Ready control-plane,master 5h38m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-master-0 Ready control-plane,master 5h38m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-lt2gw Ready worker 5h29m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-g9t8h Ready worker 5h29m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-llc9g Ready worker 5h28m v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-5k48j Ready worker 3m27s v1.28.4+7aa0a74
$ oc get node/rioliu-0103a-8m6ml-worker-northcentralus-5k48j -o yaml | yq -y '.metadata.annotations' | grep machine
machine.openshift.io/machine: openshift-machine-api/rioliu-0103a-8m6ml-worker-northcentralus-5k48j
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/desiredConfig: rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: '192916'
machineconfiguration.openshift.io/reason: ''
machineconfiguration.openshift.io/state: Done |
run critical cases against the cluster installed with CI image sh-5.1# systemctl status dnsmasq
○ dnsmasq.service - DNS caching server.
Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; disabled; preset: disabled)
Active: inactive (dead)
sh-5.1#
sh-5.1# systemctl status ovs-configuration
○ ovs-configuration.service - Configures OVS with proper host networking configuration
Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; preset: disabled)
Active: inactive (dead) since Wed 2024-01-03 08:20:23 UTC; 28min ago
Main PID: 1277 (code=exited, status=0/SUCCESS)
CPU: 306ms
sh-5.1# systemctl show ovs-configuration | grep -i before
Before=dnsmasq.service kubelet-dependencies.target node-valid-hostname.service shutdown.target 01-03 16:24:17.721 The Case Execution Summary:
01-03 16:24:17.721 PASS OCP-42347 Author:rioliu health check for machine-config-operator [Serial]
01-03 16:24:17.721 PASS OCP-42361 Author:rioliu add chrony systemd config [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-42364 Author:mhanss add selinux kernel argument [Disruptive] [Serial]
01-03 16:24:17.721 SKIP OCP-42365 Author:mhanss add real time kernel argument [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-42367 Author:mhanss add extension to RHCOS [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-42368 Author:mhanss add max pods to the kubelet config [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-42369 Author:mhanss add container runtime config [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-42390 Author:rioliu add machine config without ignition version. Block the MCO upgrade rollout if any of the pools are Degraded [Serial]
01-03 16:24:17.721 PASS OCP-42438 Author:mhanss add journald systemd config [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-43048 Author:mhanss create/delete custom machine config pool [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-43064 Author:mhanss create/delete custom machine config pool [Disruptive] [Serial]
01-03 16:24:17.721 SKIP OCP-43310 Author:mhanss add kernel arguments, kernel type and extension to the RHCOS and RHEL [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-45318 Author:rioliu add machine config without ignition version. Block the MCO upgrade rollout if any of the pools are Degraded [Serial]
01-03 16:24:17.721 SKIP OCP-53960 Author:sregidor No failed units in the bootstrap machine
01-03 16:24:17.721 PASS OCP-54085 Author:sregidor Update osImage changing /etc /usr and rpm [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-54159 Author:sregidor Apply a new osImage on a cluster with already installed rpms [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-56131 Author:sregidor Install all extensions [Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-62084 Author:sregidor Certificate rotation in paused pools[Disruptive] [Serial]
01-03 16:24:17.721 PASS OCP-64781 Author:sregidor MCO should be compliant with CIS benchmark rule
01-03 16:24:17.721 PASS OCP-67395 Author:sregidor rotate kubernetes certificate authority. Certificates managed via non-MC path.[Disruptive] [Serial]
01-03 16:24:17.721 SKIP OCP-67790 Author:sregidor create MC with extensions, 64k-pages kernel type and kernel argument [Disruptive] [Serial]
01-03 16:24:17.721 SKIP OCP-68695 Author:rioliu MCO should not be degraded when image-registry is not installed [Serial]
01-03 16:24:17.721 SKIP OCP-69184 Author:rioliu Enable feature gate MachineConfigNodes [Serial]
01-03 16:24:17.721 PASS OCP-70090 Author:rioliu apiserver-url.env file can be created on all cluster nodes [Serial] No failure found. |
/unhold |
/label qe-approved |
@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/payload 4.16 nightly blocking |
@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/454243a0-aa45-11ee-90ed-f96768b166ae-0 |
/payload 4.16 nightly blocking |
@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1427ca50-aa7d-11ee-98ff-23e48ca9acef-0 |
ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change lets go ahead and make this. We should also ensure that ARO makes their service more robust in the future.
8f78ef8
to
d638d96
Compare
setup a cluster on aws. $ oc get infrastructures/cluster -o jsonpath='{.status}' | jq
{
"apiServerInternalURI": "https://api-int.ci-ln-h48wb7b-76ef8.origin-ci-int-aws.dev.rhcloud.com:6443",
"apiServerURL": "https://api.ci-ln-h48wb7b-76ef8.origin-ci-int-aws.dev.rhcloud.com:6443",
"controlPlaneTopology": "HighlyAvailable",
"cpuPartitioning": "None",
"etcdDiscoveryDomain": "",
"infrastructureName": "ci-ln-h48wb7b-76ef8-b2xgc",
"infrastructureTopology": "HighlyAvailable",
"platform": "AWS",
"platformStatus": {
"aws": {
"region": "us-west-1"
},
"type": "AWS"
}
}
sh-5.1# systemctl show ovs-configuration | grep -i before
Before=node-valid-hostname.service shutdown.target kubelet-dependencies.target
sh-5.1#
sh-5.1# systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │ └─final.target
○ └─shutdown.target
○ └─final.target |
verified the same scenario on ARO cluster, new machine can be provisioned, node can be joined the cluster $ oc debug -q node/rioliu-0104a-6977r-worker-northcentralus-42s8h -- chroot /host systemctl show ovs-configuration | grep Before
Before=kubelet-dependencies.target node-valid-hostname.service dnsmasq.service shutdown.target
$ oc debug -q node/rioliu-0104a-6977r-worker-northcentralus-42s8h -- chroot /host systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─dnsmasq.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │ └─final.target
○ └─shutdown.target
○ └─final.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cdoern, sdodson, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
@sdodson: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/override ci/prow/e2e-gcp-op-single-node |
@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-gcp-op-single-node In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.15 |
@sdodson: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
854a866
into
openshift:master
@sdodson: Jira Issue OCPBUGS-25406: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-25406 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@sdodson: new pull request created: #4101 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-proxy-pull-test-container-v4.16.0-202401052011.p0.g854a866.assembly.stream for distgit openshift-proxy-pull-test. |
ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change in #3865 lets go ahead and make this.
We should also ensure that ARO makes their service more robust in the future.
Fixes: OCPBUGS-25406
- What I did
Added dnsmasq.service to the list of units that ovs-configuration.service should run before.
- How to verify it
- Description for the changelog
Run ovs-configuration.service before dnsmasq.service