-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811
WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811
Conversation
Before this commit, bootstrap machines that failed to come up would look like [1]: level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..." level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused" level=info msg="Pulling debug logs from the bootstrap machine" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" With this commit, that last error will look like: level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 3.84.188.207:22: connect: connection refused" without the unrelated (to this failure mode) distraction about SSH keys. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Benefit to baking this into the installer instead of using the CI templates is that we don't have to teach the templates how to extract the IP from the Terraform state. And this way will also help end users if AWS has a hardware error or whatever. |
If we can't reach the bootstrap machine via SSH. Before this commit, we would occasionally see connection issues like [1]: level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..." level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused" level=info msg="Pulling debug logs from the bootstrap machine" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" With this commit, when we see those connection-refused errors, we attempt to retrieve console logs for the bootstrap instance. This will make it easier for users and users to see why the machine failed to come up. It should be especially useful in continuous integration when bumping RHCOS boot images [2], when such boot-time failures are more likely. I've only implemented it on AWS for the moment, but I've set it up so we can extend it to other platforms going forward. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076 [2]: openshift#2777 (comment)
bb13454
to
1ac911d
Compare
Added a debugging commit so we can see this in action. |
On the flip side, this is only issue to CI for that.. otherwise its not very useful to the end users... |
CI is important too ;). And the AWS Go SDK is likely to be more stable than Terraform state inspection, so I'd rather not dup state inspection between this repo and the release repo. But if you want the console-gathering itself to land in openshift/release, how about a new |
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I think for now we have punted on having console logs captured. the CI where this is most relevant because of is breakages we already capture these. /close |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Builds on #2810; review that first. This PR adds AWS console-log retrieval to those connection-refused cases. It should be especially useful in continuous integration when bumping RHCOS boot images (e.g. this job), when such boot-time failures are more likely.