Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

Merged
merged 1 commit into from
Dec 20, 2024

Conversation

r4f4
Copy link
Contributor

@r4f4 r4f4 commented Dec 11, 2024

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 11, 2024

/test e2e-azure-ovn

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 11, 2024

/assign @patrickdillon

@patrickdillon
Copy link
Contributor

/test ?

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@patrickdillon: The following commands are available to trigger required jobs:

/test altinfra-images
/test aro-unit
/test artifacts-images
/test e2e-agent-compact-ipv4
/test e2e-aws-ovn
/test e2e-aws-ovn-edge-zones-manifest-validation
/test e2e-aws-ovn-upi
/test e2e-azure-ovn
/test e2e-azure-ovn-upi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack-ovn
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test gofmt
/test golint
/test govet
/test images
/test integration-tests
/test integration-tests-nodejoiner
/test openstack-manifests
/test shellcheck
/test terraform-images
/test terraform-verify-vendor
/test tf-lint
/test unit
/test verify-codegen
/test verify-vendor
/test yaml-lint

The following commands are available to trigger optional jobs:

/test altinfra-e2e-aws-custom-security-groups
/test altinfra-e2e-aws-ovn
/test altinfra-e2e-aws-ovn-fips
/test altinfra-e2e-aws-ovn-imdsv2
/test altinfra-e2e-aws-ovn-localzones
/test altinfra-e2e-aws-ovn-proxy
/test altinfra-e2e-aws-ovn-shared-vpc
/test altinfra-e2e-aws-ovn-shared-vpc-local-zones
/test altinfra-e2e-aws-ovn-shared-vpc-wavelength-zones
/test altinfra-e2e-aws-ovn-single-node
/test altinfra-e2e-aws-ovn-wavelengthzones
/test altinfra-e2e-azure-capi-ovn
/test altinfra-e2e-azure-ovn-shared-vpc
/test altinfra-e2e-gcp-capi-ovn
/test altinfra-e2e-gcp-ovn-byo-network-capi
/test altinfra-e2e-gcp-ovn-secureboot-capi
/test altinfra-e2e-gcp-ovn-xpn-capi
/test altinfra-e2e-ibmcloud-capi-ovn
/test altinfra-e2e-nutanix-capi-ovn
/test altinfra-e2e-openstack-capi-ccpmso
/test altinfra-e2e-openstack-capi-ccpmso-zone
/test altinfra-e2e-openstack-capi-dualstack
/test altinfra-e2e-openstack-capi-dualstack-upi
/test altinfra-e2e-openstack-capi-dualstack-v6primary
/test altinfra-e2e-openstack-capi-externallb
/test altinfra-e2e-openstack-capi-nfv-intel
/test altinfra-e2e-openstack-capi-ovn
/test altinfra-e2e-openstack-capi-proxy
/test altinfra-e2e-vsphere-capi-multi-vcenter-ovn
/test altinfra-e2e-vsphere-capi-ovn
/test altinfra-e2e-vsphere-capi-static-ovn
/test altinfra-e2e-vsphere-capi-zones
/test azure-ovn-marketplace-images
/test e2e-agent-4control-ipv4
/test e2e-agent-5control-ipv4
/test e2e-agent-compact-ipv4-appliance-diskimage
/test e2e-agent-compact-ipv4-none-platform
/test e2e-agent-compact-ipv6-minimaliso
/test e2e-agent-ha-dualstack
/test e2e-agent-sno-ipv4-pxe
/test e2e-agent-sno-ipv6
/test e2e-aws-default-config
/test e2e-aws-overlay-mtu-ovn-1200
/test e2e-aws-ovn-custom-iam-profile
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-heterogeneous
/test e2e-aws-ovn-imdsv2
/test e2e-aws-ovn-proxy
/test e2e-aws-ovn-public-ipv4-pool
/test e2e-aws-ovn-public-ipv4-pool-disabled
/test e2e-aws-ovn-public-subnets
/test e2e-aws-ovn-shared-vpc-custom-security-groups
/test e2e-aws-ovn-shared-vpc-edge-zones
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-techpreview
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-upi-proxy
/test e2e-azure-default-config
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview
/test e2e-azurestack
/test e2e-azurestack-upi
/test e2e-crc
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-gcp-ovn-byo-vpc
/test e2e-gcp-ovn-heterogeneous
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-xpn
/test e2e-gcp-secureboot
/test e2e-gcp-upgrade
/test e2e-gcp-upi-xpn
/test e2e-gcp-user-provisioned-dns
/test e2e-ibmcloud-ovn
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-swapped-hosts
/test e2e-metal-ipi-ovn-virtualmedia
/test e2e-metal-single-node-live-iso
/test e2e-nutanix-ovn
/test e2e-openstack-ccpmso
/test e2e-openstack-ccpmso-zone
/test e2e-openstack-dualstack
/test e2e-openstack-dualstack-upi
/test e2e-openstack-externallb
/test e2e-openstack-nfv-intel
/test e2e-openstack-proxy
/test e2e-openstack-singlestackv6
/test e2e-powervs-capi-ovn
/test e2e-vsphere-multi-vcenter-ovn
/test e2e-vsphere-ovn-multi-network
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-ovn-zones-techpreview
/test e2e-vsphere-static-ovn
/test okd-scos-e2e-aws-ovn
/test okd-scos-images
/test tf-fmt

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-installer-master-altinfra-images
pull-ci-openshift-installer-master-aro-unit
pull-ci-openshift-installer-master-artifacts-images
pull-ci-openshift-installer-master-e2e-aws-ovn
pull-ci-openshift-installer-master-gofmt
pull-ci-openshift-installer-master-golint
pull-ci-openshift-installer-master-govet
pull-ci-openshift-installer-master-images
pull-ci-openshift-installer-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-installer-master-shellcheck
pull-ci-openshift-installer-master-tf-fmt
pull-ci-openshift-installer-master-tf-lint
pull-ci-openshift-installer-master-unit
pull-ci-openshift-installer-master-verify-codegen
pull-ci-openshift-installer-master-verify-vendor
pull-ci-openshift-installer-master-yaml-lint

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@patrickdillon
Copy link
Contributor

Wow. This e2e-azure run shows a dramatic %50+ increase in machine provisioning time:

time="2024-12-12T01:30:21Z" level=debug msg="                  Machine Provisioning: 4m30s"

compared to two other ci runs I spot checked:

time="2024-11-20T23:15:58Z" level=debug msg="                  Machine Provisioning: 10m46s"

[1] [2, took exactly the same amount of time]

time="2024-12-06T02:54:34Z" level=debug msg="                  Machine Provisioning: 10m16s"

[3]

Let's get some more samples.

/test e2e-azure-ovn
/test e2e-azure-default-config
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview

@patrickdillon
Copy link
Contributor

This issue escaped our notice with Terraform installs, I think, because all resource creation was wrapped in a thirty-minute timeout. In CAPI installs we break up resource creation into two phases with 15 min timeouts, which should be more than sufficient, so it exposed this problem.

@patrickdillon
Copy link
Contributor

@jlebon can you check this implementation instead? I believe the current #9305 did not yield results, but early testing of this is looking great (see above).

In response to #9305 (comment)

Yes, Azure has excessively large disks (1TB) because that is the only way to guarantee iops on an os disk in this wonderful cloud. So this is a significant improvement for Azure. On all other clouds, control plane osdisk size is configurable, but we have no telemetry to help inform how common it is to increase. I think this is fine as a workaround--I believe you mentioned that a fix has already landed upstream.

Perhaps we need to consider how the installer will remove this in the future--a jira tied to a release?

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 12, 2024

/retitle OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts

@openshift-ci openshift-ci bot changed the title azure: use separate /var to avoid growfs timeouts OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts Dec 12, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 12, 2024
@openshift-ci-robot
Copy link
Contributor

@r4f4: This pull request references Jira Issue OCPBUGS-46144, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jinyunma

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jinyunma December 12, 2024 08:53
Copy link
Contributor

@barbacbd barbacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2024
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works and should be quicker. The caveat is that it introduces a layout difference across platforms, but at the same time we do document that having a separate /var partition is required for large disks, so this is just us following our own advice.

Docs: https://docs.openshift.com/container-platform/4.17/installing/installing_platform_agnostic/installing-platform-agnostic.html#installation-user-infra-machines-advanced_vardisk_installing-platform-agnostic

pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
pkg/asset/ignition/node.go Show resolved Hide resolved
pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
Using the workaround of a separate /var partition until the issue is
fixed in RHCOS.
@r4f4 r4f4 force-pushed the azure-avoid-growfs-var branch from c15d459 to a10520c Compare December 17, 2024 21:48
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2024
@r4f4
Copy link
Contributor Author

r4f4 commented Dec 17, 2024

Update: addressed review comments.

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 18, 2024

/retest-required

@patrickdillon
Copy link
Contributor

/approve

I have created https://issues.redhat.com/browse/CORS-3800 so that we will watch the upstream rhcos issue and can revert this once it has been resolved.

This lgtm. @jlebon thanks for reviewing.

Copy link
Contributor

openshift-ci bot commented Dec 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024
Copy link
Contributor

@barbacbd barbacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2024
@r4f4
Copy link
Contributor Author

r4f4 commented Dec 19, 2024

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Dec 19, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ec72ce6 and 2 for PR HEAD a10520c in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ec72ce6 and 2 for PR HEAD a10520c in total

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 20, 2024

/override ci/prow/e2e-azure-ovn-upi
e2e failures not related to PR change. The NotReady nodes issue is known and it's affecting multiple clouds.

Copy link
Contributor

openshift-ci bot commented Dec 20, 2024

@r4f4: Overrode contexts on behalf of r4f4: ci/prow/e2e-azure-ovn-upi

In response to this:

/override ci/prow/e2e-azure-ovn-upi
e2e failures not related to PR change. The NotReady nodes issue is known and it's affecting multiple clouds.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Dec 20, 2024

@r4f4: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn c15d459 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD f7a8032 and 1 for PR HEAD a10520c in total

@openshift-merge-bot openshift-merge-bot bot merged commit 7c28220 into openshift:main Dec 20, 2024
29 checks passed
@openshift-ci-robot
Copy link
Contributor

@r4f4: Jira Issue OCPBUGS-46144: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-46144 has been moved to the MODIFIED state.

In response to this:

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-altinfra
This PR has been included in build ose-installer-altinfra-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-terraform-providers
This PR has been included in build ose-installer-terraform-providers-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-baremetal-installer
This PR has been included in build ose-baremetal-installer-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-artifacts
This PR has been included in build ose-installer-artifacts-container-v4.19.0-202412202338.p0.g7c28220.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants