-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310
OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310
Conversation
/test e2e-azure-ovn |
/assign @patrickdillon |
/test ? |
@patrickdillon: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Wow. This e2e-azure run shows a dramatic %50+ increase in machine provisioning time:
compared to two other ci runs I spot checked:
[1] [2, took exactly the same amount of time]
Let's get some more samples. /test e2e-azure-ovn |
This issue escaped our notice with Terraform installs, I think, because all resource creation was wrapped in a thirty-minute timeout. In CAPI installs we break up resource creation into two phases with 15 min timeouts, which should be more than sufficient, so it exposed this problem. |
@jlebon can you check this implementation instead? I believe the current #9305 did not yield results, but early testing of this is looking great (see above). In response to #9305 (comment) Yes, Azure has excessively large disks (1TB) because that is the only way to guarantee iops on an os disk in this wonderful cloud. So this is a significant improvement for Azure. On all other clouds, control plane osdisk size is configurable, but we have no telemetry to help inform how common it is to increase. I think this is fine as a workaround--I believe you mentioned that a fix has already landed upstream. Perhaps we need to consider how the installer will remove this in the future--a jira tied to a release? |
/retitle OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts |
@r4f4: This pull request references Jira Issue OCPBUGS-46144, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works and should be quicker. The caveat is that it introduces a layout difference across platforms, but at the same time we do document that having a separate /var
partition is required for large disks, so this is just us following our own advice.
Using the workaround of a separate /var partition until the issue is fixed in RHCOS.
c15d459
to
a10520c
Compare
Update: addressed review comments. |
/retest-required |
/approve I have created https://issues.redhat.com/browse/CORS-3800 so that we will watch the upstream rhcos issue and can revert this once it has been resolved. This lgtm. @jlebon thanks for reviewing. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/label acknowledge-critical-fixes-only |
1 similar comment
/override ci/prow/e2e-azure-ovn-upi |
@r4f4: Overrode contexts on behalf of r4f4: ci/prow/e2e-azure-ovn-upi In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@r4f4: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@r4f4: Jira Issue OCPBUGS-46144: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-46144 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-altinfra |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-terraform-providers |
[ART PR BUILD NOTIFIER] Distgit: ose-baremetal-installer |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-artifacts |
Using the workaround of a separate /var partition until the issue is fixed in RHCOS.