OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

r4f4 · 2024-12-11T22:14:56Z

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

r4f4 · 2024-12-11T22:28:32Z

/test e2e-azure-ovn

r4f4 · 2024-12-11T22:28:43Z

/assign @patrickdillon

patrickdillon · 2024-12-12T03:05:47Z

/test ?

openshift-ci · 2024-12-12T03:05:50Z

@patrickdillon: The following commands are available to trigger required jobs:

/test altinfra-images

/test aro-unit

/test artifacts-images

/test e2e-agent-compact-ipv4

/test e2e-aws-ovn

/test e2e-aws-ovn-edge-zones-manifest-validation

/test e2e-aws-ovn-upi

/test e2e-azure-ovn

/test e2e-azure-ovn-upi

/test e2e-gcp-ovn

/test e2e-gcp-ovn-upi

/test e2e-metal-ipi-ovn-ipv6

/test e2e-openstack-ovn

/test e2e-vsphere-ovn

/test e2e-vsphere-ovn-upi

/test gofmt

/test golint

/test govet

/test images

/test integration-tests

/test integration-tests-nodejoiner

/test openstack-manifests

/test shellcheck

/test terraform-images

/test terraform-verify-vendor

/test tf-lint

/test unit

/test verify-codegen

/test verify-vendor

/test yaml-lint

The following commands are available to trigger optional jobs:

/test altinfra-e2e-aws-custom-security-groups

/test altinfra-e2e-aws-ovn

/test altinfra-e2e-aws-ovn-fips

/test altinfra-e2e-aws-ovn-imdsv2

/test altinfra-e2e-aws-ovn-localzones

/test altinfra-e2e-aws-ovn-proxy

/test altinfra-e2e-aws-ovn-shared-vpc

/test altinfra-e2e-aws-ovn-shared-vpc-local-zones

/test altinfra-e2e-aws-ovn-shared-vpc-wavelength-zones

/test altinfra-e2e-aws-ovn-single-node

/test altinfra-e2e-aws-ovn-wavelengthzones

/test altinfra-e2e-azure-capi-ovn

/test altinfra-e2e-azure-ovn-shared-vpc

/test altinfra-e2e-gcp-capi-ovn

/test altinfra-e2e-gcp-ovn-byo-network-capi

/test altinfra-e2e-gcp-ovn-secureboot-capi

/test altinfra-e2e-gcp-ovn-xpn-capi

/test altinfra-e2e-ibmcloud-capi-ovn

/test altinfra-e2e-nutanix-capi-ovn

/test altinfra-e2e-openstack-capi-ccpmso

/test altinfra-e2e-openstack-capi-ccpmso-zone

/test altinfra-e2e-openstack-capi-dualstack

/test altinfra-e2e-openstack-capi-dualstack-upi

/test altinfra-e2e-openstack-capi-dualstack-v6primary

/test altinfra-e2e-openstack-capi-externallb

/test altinfra-e2e-openstack-capi-nfv-intel

/test altinfra-e2e-openstack-capi-ovn

/test altinfra-e2e-openstack-capi-proxy

/test altinfra-e2e-vsphere-capi-multi-vcenter-ovn

/test altinfra-e2e-vsphere-capi-ovn

/test altinfra-e2e-vsphere-capi-static-ovn

/test altinfra-e2e-vsphere-capi-zones

/test azure-ovn-marketplace-images

/test e2e-agent-4control-ipv4

/test e2e-agent-5control-ipv4

/test e2e-agent-compact-ipv4-appliance-diskimage

/test e2e-agent-compact-ipv4-none-platform

/test e2e-agent-compact-ipv6-minimaliso

/test e2e-agent-ha-dualstack

/test e2e-agent-sno-ipv4-pxe

/test e2e-agent-sno-ipv6

/test e2e-aws-default-config

/test e2e-aws-overlay-mtu-ovn-1200

/test e2e-aws-ovn-custom-iam-profile

/test e2e-aws-ovn-edge-zones

/test e2e-aws-ovn-fips

/test e2e-aws-ovn-heterogeneous

/test e2e-aws-ovn-imdsv2

/test e2e-aws-ovn-proxy

/test e2e-aws-ovn-public-ipv4-pool

/test e2e-aws-ovn-public-ipv4-pool-disabled

/test e2e-aws-ovn-public-subnets

/test e2e-aws-ovn-shared-vpc-custom-security-groups

/test e2e-aws-ovn-shared-vpc-edge-zones

/test e2e-aws-ovn-single-node

/test e2e-aws-ovn-techpreview

/test e2e-aws-ovn-upgrade

/test e2e-aws-ovn-workers-rhel8

/test e2e-aws-upi-proxy

/test e2e-azure-default-config

/test e2e-azure-ovn-resourcegroup

/test e2e-azure-ovn-shared-vpc

/test e2e-azure-ovn-techpreview

/test e2e-azurestack

/test e2e-azurestack-upi

/test e2e-crc

/test e2e-external-aws

/test e2e-external-aws-ccm

/test e2e-gcp-ovn-byo-vpc

/test e2e-gcp-ovn-heterogeneous

/test e2e-gcp-ovn-techpreview

/test e2e-gcp-ovn-xpn

/test e2e-gcp-secureboot

/test e2e-gcp-upgrade

/test e2e-gcp-upi-xpn

/test e2e-gcp-user-provisioned-dns

/test e2e-ibmcloud-ovn

/test e2e-metal-assisted

/test e2e-metal-ipi-ovn

/test e2e-metal-ipi-ovn-dualstack

/test e2e-metal-ipi-ovn-swapped-hosts

/test e2e-metal-ipi-ovn-virtualmedia

/test e2e-metal-single-node-live-iso

/test e2e-nutanix-ovn

/test e2e-openstack-ccpmso

/test e2e-openstack-ccpmso-zone

/test e2e-openstack-dualstack

/test e2e-openstack-dualstack-upi

/test e2e-openstack-externallb

/test e2e-openstack-nfv-intel

/test e2e-openstack-proxy

/test e2e-openstack-singlestackv6

/test e2e-powervs-capi-ovn

/test e2e-vsphere-multi-vcenter-ovn

/test e2e-vsphere-ovn-multi-network

/test e2e-vsphere-ovn-techpreview

/test e2e-vsphere-ovn-upi-zones

/test e2e-vsphere-ovn-zones

/test e2e-vsphere-ovn-zones-techpreview

/test e2e-vsphere-static-ovn

/test okd-scos-e2e-aws-ovn

/test okd-scos-images

/test tf-fmt

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-installer-master-altinfra-images

pull-ci-openshift-installer-master-aro-unit

pull-ci-openshift-installer-master-artifacts-images

pull-ci-openshift-installer-master-e2e-aws-ovn

pull-ci-openshift-installer-master-gofmt

pull-ci-openshift-installer-master-golint

pull-ci-openshift-installer-master-govet

pull-ci-openshift-installer-master-images

pull-ci-openshift-installer-master-okd-scos-e2e-aws-ovn

pull-ci-openshift-installer-master-shellcheck

pull-ci-openshift-installer-master-tf-fmt

pull-ci-openshift-installer-master-tf-lint

pull-ci-openshift-installer-master-unit

pull-ci-openshift-installer-master-verify-codegen

pull-ci-openshift-installer-master-verify-vendor

pull-ci-openshift-installer-master-yaml-lint

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

patrickdillon · 2024-12-12T03:19:59Z

Wow. This e2e-azure run shows a dramatic %50+ increase in machine provisioning time:

time="2024-12-12T01:30:21Z" level=debug msg="                  Machine Provisioning: 4m30s"

compared to two other ci runs I spot checked:

time="2024-11-20T23:15:58Z" level=debug msg="                  Machine Provisioning: 10m46s"

[1] [2, took exactly the same amount of time]

time="2024-12-06T02:54:34Z" level=debug msg="                  Machine Provisioning: 10m16s"

[3]

Let's get some more samples.

/test e2e-azure-ovn
/test e2e-azure-default-config
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview

patrickdillon · 2024-12-12T03:25:01Z

This issue escaped our notice with Terraform installs, I think, because all resource creation was wrapped in a thirty-minute timeout. In CAPI installs we break up resource creation into two phases with 15 min timeouts, which should be more than sufficient, so it exposed this problem.

patrickdillon · 2024-12-12T03:56:22Z

@jlebon can you check this implementation instead? I believe the current #9305 did not yield results, but early testing of this is looking great (see above).

In response to #9305 (comment)

Yes, Azure has excessively large disks (1TB) because that is the only way to guarantee iops on an os disk in this wonderful cloud. So this is a significant improvement for Azure. On all other clouds, control plane osdisk size is configurable, but we have no telemetry to help inform how common it is to increase. I think this is fine as a workaround--I believe you mentioned that a fix has already landed upstream.

Perhaps we need to consider how the installer will remove this in the future--a jira tied to a release?

r4f4 · 2024-12-12T08:52:53Z

/retitle OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts

openshift-ci-robot · 2024-12-12T08:53:03Z

@r4f4: This pull request references Jira Issue OCPBUGS-46144, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jinyunma

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

barbacbd

/lgtm

jlebon

Yes, this works and should be quicker. The caveat is that it introduces a layout difference across platforms, but at the same time we do document that having a separate /var partition is required for large disks, so this is just us following our own advice.

Docs: https://docs.openshift.com/container-platform/4.17/installing/installing_platform_agnostic/installing-platform-agnostic.html#installation-user-infra-machines-advanced_vardisk_installing-platform-agnostic

pkg/asset/ignition/node.go

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

r4f4 · 2024-12-17T21:48:58Z

Update: addressed review comments.

r4f4 · 2024-12-18T08:50:11Z

/retest-required

patrickdillon · 2024-12-18T19:11:18Z

/approve

I have created https://issues.redhat.com/browse/CORS-3800 so that we will watch the upstream rhcos issue and can revert this once it has been resolved.

This lgtm. @jlebon thanks for reviewing.

openshift-ci · 2024-12-18T19:11:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [patrickdillon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

barbacbd

/lgtm

r4f4 · 2024-12-19T21:10:17Z

/label acknowledge-critical-fixes-only

openshift-ci-robot · 2024-12-19T22:08:54Z

/retest-required

Remaining retests: 0 against base HEAD ec72ce6 and 2 for PR HEAD a10520c in total

openshift-ci-robot · 2024-12-20T04:27:10Z

/retest-required

Remaining retests: 0 against base HEAD ec72ce6 and 2 for PR HEAD a10520c in total

r4f4 · 2024-12-20T14:31:14Z

/override ci/prow/e2e-azure-ovn-upi
e2e failures not related to PR change. The NotReady nodes issue is known and it's affecting multiple clouds.

openshift-ci · 2024-12-20T14:31:47Z

@r4f4: Overrode contexts on behalf of r4f4: ci/prow/e2e-azure-ovn-upi

In response to this:

/override ci/prow/e2e-azure-ovn-upi
e2e failures not related to PR change. The NotReady nodes issue is known and it's affecting multiple clouds.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-12-20T17:44:43Z

@r4f4: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`c15d459`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-12-20T18:30:32Z

/retest-required

Remaining retests: 0 against base HEAD f7a8032 and 1 for PR HEAD a10520c in total

openshift-ci-robot · 2024-12-20T21:11:22Z

@r4f4: Jira Issue OCPBUGS-46144: All pull requests linked via external trackers have merged:

openshift/installer#9310

Jira Issue OCPBUGS-46144 has been moved to the MODIFIED state.

In response to this:

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-12-21T00:38:53Z

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-altinfra
This PR has been included in build ose-installer-altinfra-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-12-21T00:44:51Z

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-terraform-providers
This PR has been included in build ose-installer-terraform-providers-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-12-21T00:47:51Z

[ART PR BUILD NOTIFIER]

Distgit: ose-baremetal-installer
This PR has been included in build ose-baremetal-installer-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-12-21T02:15:10Z

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-artifacts
This PR has been included in build ose-installer-artifacts-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot requested review from jhixson74 and rna-afk December 11, 2024 22:15

r4f4 mentioned this pull request Dec 11, 2024

azure: avoid timeouts due to growfs #9305

Closed

openshift-ci bot assigned patrickdillon Dec 11, 2024

openshift-ci bot changed the title ~~azure: use separate /var to avoid growfs timeouts~~ OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts Dec 12, 2024

openshift-ci bot requested a review from jinyunma December 12, 2024 08:53

barbacbd reviewed Dec 13, 2024

View reviewed changes

openshift-ci bot assigned barbacbd Dec 13, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2024

jlebon reviewed Dec 17, 2024

View reviewed changes

pkg/asset/ignition/node.go Outdated Show resolved Hide resolved

pkg/asset/ignition/node.go Show resolved Hide resolved

pkg/asset/ignition/node.go Outdated Show resolved Hide resolved

pkg/asset/ignition/node.go Outdated Show resolved Hide resolved

OCPBUGS-43625: azure: use separate /var to avoid growfs timeouts

a10520c

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

r4f4 force-pushed the azure-avoid-growfs-var branch from c15d459 to a10520c Compare December 17, 2024 21:48

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024

barbacbd reviewed Dec 19, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2024

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Dec 19, 2024

openshift-merge-bot bot merged commit 7c28220 into openshift:main Dec 20, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

r4f4 commented Dec 11, 2024

r4f4 commented Dec 11, 2024

r4f4 commented Dec 11, 2024

patrickdillon commented Dec 12, 2024

openshift-ci bot commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

r4f4 commented Dec 12, 2024

openshift-ci-robot commented Dec 12, 2024

barbacbd left a comment

jlebon left a comment

r4f4 commented Dec 17, 2024

r4f4 commented Dec 18, 2024

patrickdillon commented Dec 18, 2024

openshift-ci bot commented Dec 18, 2024

barbacbd left a comment

r4f4 commented Dec 19, 2024

openshift-ci-robot commented Dec 19, 2024

openshift-ci-robot commented Dec 20, 2024

r4f4 commented Dec 20, 2024

openshift-ci bot commented Dec 20, 2024

openshift-ci bot commented Dec 20, 2024 •

edited

Loading

openshift-ci-robot commented Dec 20, 2024

openshift-ci-robot commented Dec 20, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

Conversation

r4f4 commented Dec 11, 2024

r4f4 commented Dec 11, 2024

r4f4 commented Dec 11, 2024

patrickdillon commented Dec 12, 2024

openshift-ci bot commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

patrickdillon commented Dec 12, 2024

r4f4 commented Dec 12, 2024

openshift-ci-robot commented Dec 12, 2024

barbacbd left a comment

Choose a reason for hiding this comment

jlebon left a comment

Choose a reason for hiding this comment

r4f4 commented Dec 17, 2024

r4f4 commented Dec 18, 2024

patrickdillon commented Dec 18, 2024

openshift-ci bot commented Dec 18, 2024

barbacbd left a comment

Choose a reason for hiding this comment

r4f4 commented Dec 19, 2024

openshift-ci-robot commented Dec 19, 2024

openshift-ci-robot commented Dec 20, 2024

r4f4 commented Dec 20, 2024

openshift-ci bot commented Dec 20, 2024

openshift-ci bot commented Dec 20, 2024 • edited Loading

openshift-ci-robot commented Dec 20, 2024

openshift-ci-robot commented Dec 20, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

openshift-bot commented Dec 21, 2024

openshift-ci bot commented Dec 20, 2024 •

edited

Loading