Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-25406: Azure Run ovs-configuration.service before dnsmasq.service #4087

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

sdodson
Copy link
Member

@sdodson sdodson commented Jan 2, 2024

ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change in #3865 lets go ahead and make this.

We should also ensure that ARO makes their service more robust in the future.

Fixes: OCPBUGS-25406

- What I did
Added dnsmasq.service to the list of units that ovs-configuration.service should run before.

- How to verify it

  1. Regression test in normal OCP
  2. Confirm in ARO that this addresses OCPBUGS-25406, QE should consult with ARO @rogbas and @bennerv good points of contact.

- Description for the changelog
Run ovs-configuration.service before dnsmasq.service

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 2, 2024
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change in #3865 lets go ahead and make this.

We should also ensure that ARO makes their service more robust in the future.

Fixes: OCPBUGS-25406

- What I did
Added dnsmasq.service to the list of units that ovs-configuration.service should run before.

- How to verify it

  1. Regression test in normal OCP
  2. Confirm in ARO that this addresses OCPBUGS-25406, QE should consult with ARO @rogbas and @bennerv good points of contact.

- Description for the changelog
Run ovs-configuration.service before dnsmasq.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from jkyros and yuqi-zhang January 2, 2024 15:24
@sdodson
Copy link
Member Author

sdodson commented Jan 2, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 2, 2024
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr January 2, 2024 15:25
@sdodson
Copy link
Member Author

sdodson commented Jan 2, 2024

/uncc @jkyros
/cc @cdoern

@openshift-ci openshift-ci bot requested review from cdoern and removed request for jkyros January 2, 2024 15:29
@cdoern
Copy link
Contributor

cdoern commented Jan 2, 2024

@rioliu-rh could you take a look at this? As scott mentioned, there are some contacts to collab with for verification.

@@ -8,7 +8,7 @@ contents: |
Wants=NetworkManager-wait-online.service
After=firstboot-osupdate.target
After=NetworkManager-wait-online.service openvswitch.service network.service nodeip-configuration.service
Before=kubelet-dependencies.target node-valid-hostname.service
Before=kubelet-dependencies.target node-valid-hostname.service dnsmasq.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as this does not break any of the e2e tests (and maybe we should run some payloads?) this makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll run the same payload jobs that ran on the original change as soon as the standard pre-submits complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reduce potential impact, would you consider it acceptable if the Before dnsmasq.service only applies to Azure? i.e. add this to Azure templates instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was originally concerned that it would introduce risk of divergent changes, but looks like this file is pretty stable as it hasn't changed in three years. Updated the PR.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2024
@sdodson
Copy link
Member Author

sdodson commented Jan 2, 2024

/test e2e-gcp-op-single-node

@sdodson
Copy link
Member Author

sdodson commented Jan 2, 2024

/payload 4.16 nightly blocking

Copy link
Contributor

openshift-ci bot commented Jan 2, 2024

@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b7ea8bc0-a99f-11ee-8308-3b168ea866a2-0

@bennerv
Copy link

bennerv commented Jan 2, 2024

ARO doesn't have a way to test 4.14 builds at installation time, so i just tested against the latest 4.13 stream + MCO image override on upgrade.

  1. cherry-pick'd this commit into release-4.13 of MCO.
  2. build MCO images
  3. built new OCP release image with overridden MCO image
  4. install 4.13.23 ARO cluster
  5. force upgrade to new OCP release image built in step 3
  6. scale up new node

The above worked, and saw the dnsmasq.service in the Before directive on the ovs-configuration.service systemd unit.

We also tested the change by making a drop-in on the ovs-configuration.service on a 4.14 cluster and it worked as well.

@ventifus if you have anything to add in your testing please do. Lgtm otherwise.

@ventifus
Copy link

ventifus commented Jan 2, 2024

LGTM. I can confirm that dnsmasq starts later as expected. It now starts as soon as ovs-configuration.service has completed and all network configuration has finished.

@rioliu-rh
Copy link

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2024
@rioliu-rh
Copy link

rioliu-rh commented Jan 3, 2024

installed ARO cluster with 4.12.25 and upgraded it to CI image built based on this PR 4.15.0-0.ci.test-2024-01-03-020219-ci-ln-kfz9szk-latest

$ oc get infrastructures/cluster -o yaml | yq -y '.status'
apiServerInternalURI: https://api-int.m64e0l2i.northcentralus.aroapp.io:6443
apiServerURL: https://api.m64e0l2i.northcentralus.aroapp.io:6443
controlPlaneTopology: HighlyAvailable
cpuPartitioning: None
etcdDiscoveryDomain: ''
infrastructureName: rioliu-0103a-8m6ml
infrastructureTopology: HighlyAvailable
platform: Azure
platformStatus:
  azure:
    cloudName: AzurePublicCloud
    networkResourceGroupName: rioliu-0103a-rg
    resourceGroupName: aro-m64e0l2i
  type: Azure

$ oc get clusterversion version -o yaml | yq '.status.history[]|[.version, .state]'
[
  "4.15.0-0.ci.test-2024-01-03-020219-ci-ln-kfz9szk-latest",
  "Completed"
]
[
  "4.14.7",
  "Completed"
]
[
  "4.13.27",
  "Completed"
]
[
  "4.12.25",
  "Completed"
]
[
  "4.12.25",
  "Completed"
]

check status of ovs-configuration.service and dnsmasq.service

sh-5.1# systemctl status ovs-configuration.service
○ ovs-configuration.service - Configures OVS with proper host networking configuration
     Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; preset: disabled)
     Active: inactive (dead) since Wed 2024-01-03 06:47:19 UTC; 1h 0min ago
   Main PID: 1461 (code=exited, status=0/SUCCESS)
        CPU: 2.378s

Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev enP29997s1 proto kernel metric 1024 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2363]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[2364]: +++ dirname /tmp/configure-ovs-2024-01-03-06-47-19-CJL9xpL0Ul
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1463]: ++ rm -rf '/tmp/configure-ovs-*'
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1463]: ++ exit 0
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + e=0
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + trap handle_termination INT TERM
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + '[' 0 -eq 0 ']'
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h configure-ovs.sh[1461]: + exit 0

sh-5.1# systemctl status dnsmasq.service
● dnsmasq.service - DNS caching server.
     Loaded: loaded (/etc/systemd/system/dnsmasq.service; enabled; preset: disabled)
     Active: active (running) since Wed 2024-01-03 06:47:19 UTC; 1h 1min ago
   Main PID: 2446 (dnsmasq)
      Tasks: 1 (limit: 101686)
     Memory: 1.8M
        CPU: 249ms
     CGroup: /system.slice/dnsmasq.service
             └─2446 /usr/sbin/dnsmasq -k

Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.ADDRESS[1]:                         fe80::4478:eda5:9ef1:2b6b/64
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.GATEWAY:                            --
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2387]: IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 1024
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h bash[2376]: OVN mode - br-ex device exists
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h systemd[1]: Starting DNS caching server....
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: started, version 2.85 cache disabled
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: reading /etc/resolv.conf.dnsmasq
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h dnsmasq[2446]: using nameserver 168.63.129.16#53
Jan 03 06:47:19 rioliu-0103a-8m6ml-worker-northcentralus-g9t8h systemd[1]: Started DNS caching server..

check systemd units dependency of ovs-configuration.service

sh-5.1# systemctl show ovs-configuration.service | grep -i before
Before=node-valid-hostname.service shutdown.target kubelet-dependencies.target network-online.target dnsmasq.service

sh-5.1# systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─dnsmasq.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │   └─final.target
● ├─network-online.target
● │ ├─crio.service
● │ ├─dnsmasq.service
○ │ ├─iscsi.service
○ │ ├─iscsid.service
● │ ├─kubelet-auto-node-size.service
○ │ ├─nfs-mountd.service
○ │ ├─nfs-server.service
○ │ ├─rc-local.service
● │ ├─rpc-statd-notify.service
● │ ├─rpc-statd.service
● │ ├─kubelet-dependencies.target
● │ │ ├─crio.service
● │ │ ├─kubelet.service
○ │ │ └─shutdown.target
○ │ │   └─final.target
○ │ └─shutdown.target
○ │   └─final.target
○ └─shutdown.target
○   └─final.target

sh-5.1# systemctl list-dependencies --after dnsmasq.service
dnsmasq.service
○ ├─ovs-configuration.service
● ├─system.slice
● ├─systemd-journald.socket
...

scale machineset to provision new node

$ oc scale --replicas=4 machineset.machine.openshift.io/rioliu-0103a-8m6ml-worker-northcentralus -n openshift-machine-api
machineset.machine.openshift.io/rioliu-0103a-8m6ml-worker-northcentralus scaled

new node can be provisioned and joined the cluster successfully

$ oc get machine.machine.openshift.io -n openshift-machine-api --sort-by metadata.creationTimestamp
NAME                                             PHASE     TYPE              REGION           ZONE   AGE
rioliu-0103a-8m6ml-master-0                      Running   Standard_D8s_v3   northcentralus          5h39m
rioliu-0103a-8m6ml-master-1                      Running   Standard_D8s_v3   northcentralus          5h39m
rioliu-0103a-8m6ml-master-2                      Running   Standard_D8s_v3   northcentralus          5h39m
rioliu-0103a-8m6ml-worker-northcentralus-g9t8h   Running   Standard_D4s_v3   northcentralus          5h36m
rioliu-0103a-8m6ml-worker-northcentralus-llc9g   Running   Standard_D4s_v3   northcentralus          5h36m
rioliu-0103a-8m6ml-worker-northcentralus-lt2gw   Running   Standard_D4s_v3   northcentralus          5h36m
rioliu-0103a-8m6ml-worker-northcentralus-5k48j   Running   Standard_D4s_v3   northcentralus          9m46s

$ oc get node --sort-by metadata.creationTimestamp
NAME                                             STATUS   ROLES                  AGE     VERSION
rioliu-0103a-8m6ml-master-1                      Ready    control-plane,master   5h38m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-master-2                      Ready    control-plane,master   5h38m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-master-0                      Ready    control-plane,master   5h38m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-lt2gw   Ready    worker                 5h29m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-g9t8h   Ready    worker                 5h29m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-llc9g   Ready    worker                 5h28m   v1.28.4+7aa0a74
rioliu-0103a-8m6ml-worker-northcentralus-5k48j   Ready    worker                 3m27s   v1.28.4+7aa0a74

$ oc get node/rioliu-0103a-8m6ml-worker-northcentralus-5k48j -o yaml | yq -y '.metadata.annotations' | grep machine
machine.openshift.io/machine: openshift-machine-api/rioliu-0103a-8m6ml-worker-northcentralus-5k48j
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/desiredConfig: rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-c217ff932861b35333e7adaae434a04e
machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: '192916'
machineconfiguration.openshift.io/reason: ''
machineconfiguration.openshift.io/state: Done

@rioliu-rh
Copy link

run critical cases against the cluster installed with CI image 4.15.0-0.ci.test-2024-01-03-020219-ci-ln-kfz9szk-latest
system unit status

sh-5.1# systemctl status dnsmasq
○ dnsmasq.service - DNS caching server.
     Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; disabled; preset: disabled)
     Active: inactive (dead)
sh-5.1#
sh-5.1# systemctl status ovs-configuration
○ ovs-configuration.service - Configures OVS with proper host networking configuration
     Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; preset: disabled)
     Active: inactive (dead) since Wed 2024-01-03 08:20:23 UTC; 28min ago
   Main PID: 1277 (code=exited, status=0/SUCCESS)
        CPU: 306ms

sh-5.1# systemctl show ovs-configuration | grep -i before
Before=dnsmasq.service kubelet-dependencies.target node-valid-hostname.service shutdown.target

Job link: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/222139/console

01-03 16:24:17.721  The Case Execution Summary:
01-03 16:24:17.721   PASS OCP-42347 Author:rioliu health check for machine-config-operator [Serial]
01-03 16:24:17.721   PASS OCP-42361 Author:rioliu add chrony systemd config [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-42364 Author:mhanss add selinux kernel argument [Disruptive] [Serial]
01-03 16:24:17.721   SKIP OCP-42365 Author:mhanss add real time kernel argument [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-42367 Author:mhanss add extension to RHCOS [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-42368 Author:mhanss add max pods to the kubelet config [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-42369 Author:mhanss add container runtime config [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-42390 Author:rioliu add machine config without ignition version. Block the MCO upgrade rollout if any of the pools are Degraded [Serial]
01-03 16:24:17.721   PASS OCP-42438 Author:mhanss add journald systemd config [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-43048 Author:mhanss create/delete custom machine config pool [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-43064 Author:mhanss create/delete custom machine config pool [Disruptive] [Serial]
01-03 16:24:17.721   SKIP OCP-43310 Author:mhanss add kernel arguments, kernel type and extension to the RHCOS and RHEL [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-45318 Author:rioliu add machine config without ignition version. Block the MCO upgrade rollout if any of the pools are Degraded [Serial]
01-03 16:24:17.721   SKIP OCP-53960 Author:sregidor No failed units in the bootstrap machine
01-03 16:24:17.721   PASS OCP-54085 Author:sregidor Update osImage changing /etc /usr and rpm [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-54159 Author:sregidor Apply a new osImage on a cluster with already installed rpms [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-56131 Author:sregidor Install all extensions [Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-62084 Author:sregidor Certificate rotation in paused pools[Disruptive] [Serial]
01-03 16:24:17.721   PASS OCP-64781 Author:sregidor MCO should be compliant with CIS benchmark rule
01-03 16:24:17.721   PASS OCP-67395 Author:sregidor rotate kubernetes certificate authority. Certificates managed via non-MC path.[Disruptive] [Serial]
01-03 16:24:17.721   SKIP OCP-67790 Author:sregidor create MC with extensions, 64k-pages kernel type and kernel argument [Disruptive] [Serial]
01-03 16:24:17.721   SKIP OCP-68695 Author:rioliu MCO should not be degraded when image-registry is not installed [Serial]
01-03 16:24:17.721   SKIP OCP-69184 Author:rioliu Enable feature gate MachineConfigNodes [Serial]
01-03 16:24:17.721   PASS OCP-70090 Author:rioliu apiserver-url.env file can be created on all cluster nodes [Serial]

No failure found.

@rioliu-rh
Copy link

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 3, 2024
@rioliu-rh
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jan 3, 2024
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Jira Issue OCPBUGS-25406, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change in #3865 lets go ahead and make this.

We should also ensure that ARO makes their service more robust in the future.

Fixes: OCPBUGS-25406

- What I did
Added dnsmasq.service to the list of units that ovs-configuration.service should run before.

- How to verify it

  1. Regression test in normal OCP
  2. Confirm in ARO that this addresses OCPBUGS-25406, QE should consult with ARO @rogbas and @bennerv good points of contact.

- Description for the changelog
Run ovs-configuration.service before dnsmasq.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdodson
Copy link
Member Author

sdodson commented Jan 3, 2024

/payload 4.16 nightly blocking
No failures in previous run but some jobs were never properly triggered.

Copy link
Contributor

openshift-ci bot commented Jan 3, 2024

@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/454243a0-aa45-11ee-90ed-f96768b166ae-0

@sdodson
Copy link
Member Author

sdodson commented Jan 3, 2024

/payload 4.16 nightly blocking

Copy link
Contributor

openshift-ci bot commented Jan 3, 2024

@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1427ca50-aa7d-11ee-98ff-23e48ca9acef-0

ARO's dnsmasq.service, if present, must be run after ovs-configuration. This,
could of course be fixed in that service's definition but doing so is not
operationally easy to do at this point. There is potential risk that some
other dnsmasq.service implementation needs to be run before ovs-configuration
and this change would break them. Given the temporal proximity to the breaking
change lets go ahead and make this.

We should also ensure that ARO makes their service more robust in the future.
@sdodson sdodson force-pushed the OCPBUGS-25406-master branch from 8f78ef8 to d638d96 Compare January 3, 2024 21:25
@sdodson sdodson changed the title OCPBUGS-25406: Run ovs-configuration.service before dnsmasq.service OCPBUGS-25406: Azure Run ovs-configuration.service before dnsmasq.service Jan 3, 2024
@rioliu-rh
Copy link

setup a cluster on aws. dnsmasq service is not in dependency list of ovs-configuration

$ oc get infrastructures/cluster -o jsonpath='{.status}' | jq
{
  "apiServerInternalURI": "https://api-int.ci-ln-h48wb7b-76ef8.origin-ci-int-aws.dev.rhcloud.com:6443",
  "apiServerURL": "https://api.ci-ln-h48wb7b-76ef8.origin-ci-int-aws.dev.rhcloud.com:6443",
  "controlPlaneTopology": "HighlyAvailable",
  "cpuPartitioning": "None",
  "etcdDiscoveryDomain": "",
  "infrastructureName": "ci-ln-h48wb7b-76ef8-b2xgc",
  "infrastructureTopology": "HighlyAvailable",
  "platform": "AWS",
  "platformStatus": {
    "aws": {
      "region": "us-west-1"
    },
    "type": "AWS"
  }
}

sh-5.1# systemctl show ovs-configuration | grep -i before
Before=node-valid-hostname.service shutdown.target kubelet-dependencies.target
sh-5.1#
sh-5.1# systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │   └─final.target
○ └─shutdown.target
○   └─final.target

@rioliu-rh
Copy link

rioliu-rh commented Jan 4, 2024

verified the same scenario on ARO cluster, new machine can be provisioned, node can be joined the cluster
service dependency on new node

$ oc debug -q node/rioliu-0104a-6977r-worker-northcentralus-42s8h -- chroot /host systemctl show ovs-configuration | grep Before
Before=kubelet-dependencies.target node-valid-hostname.service dnsmasq.service shutdown.target

$ oc debug -q node/rioliu-0104a-6977r-worker-northcentralus-42s8h -- chroot /host systemctl list-dependencies --before ovs-configuration
ovs-configuration.service
● ├─dnsmasq.service
● ├─node-valid-hostname.service
● ├─kubelet-dependencies.target
● │ ├─crio.service
● │ ├─kubelet.service
○ │ └─shutdown.target
○ │   └─final.target
○ └─shutdown.target
○   └─final.target

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2024
Copy link
Contributor

openshift-ci bot commented Jan 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cdoern, sdodson, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e0dda82 and 2 for PR HEAD d638d96 in total

@sdodson
Copy link
Member Author

sdodson commented Jan 4, 2024

/retest-required
e2e-gcp-op-single-node doesn't seem to have passed since December 20th :-(

Copy link
Contributor

openshift-ci bot commented Jan 4, 2024

@sdodson: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-layering d638d96 link false /test e2e-gcp-op-layering

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor

/override ci/prow/e2e-gcp-op-single-node

Copy link
Contributor

openshift-ci bot commented Jan 5, 2024

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member Author

sdodson commented Jan 5, 2024

/cherry-pick release-4.15

@openshift-cherrypick-robot

@sdodson: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 854a866 into openshift:master Jan 5, 2024
13 of 14 checks passed
@openshift-ci-robot
Copy link
Contributor

@sdodson: Jira Issue OCPBUGS-25406: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-25406 has been moved to the MODIFIED state.

In response to this:

ARO's dnsmasq.service, if present, must be run after ovs-configuration. This, could of course be fixed in that service's definition but doing so is not operationally easy to do at this point. There is potential risk that some other dnsmasq.service implementation needs to be run before ovs-configuration and this change would break them. Given the temporal proximity to the breaking change in #3865 lets go ahead and make this.

We should also ensure that ARO makes their service more robust in the future.

Fixes: OCPBUGS-25406

- What I did
Added dnsmasq.service to the list of units that ovs-configuration.service should run before.

- How to verify it

  1. Regression test in normal OCP
  2. Confirm in ARO that this addresses OCPBUGS-25406, QE should consult with ARO @rogbas and @bennerv good points of contact.

- Description for the changelog
Run ovs-configuration.service before dnsmasq.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@sdodson: new pull request created: #4101

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-proxy-pull-test-container-v4.16.0-202401052011.p0.g854a866.assembly.stream for distgit openshift-proxy-pull-test.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants