Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

Closed

Conversation

wking
Copy link
Member

@wking wking commented Dec 13, 2019

Builds on #2810; review that first. This PR adds AWS console-log retrieval to those connection-refused cases. It should be especially useful in continuous integration when bumping RHCOS boot images (e.g. this job), when such boot-time failures are more likely.

Before this commit, bootstrap machines that failed to come up would
look like [1]:

  level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..."
  level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused"
  level=info msg="Pulling debug logs from the bootstrap machine"
  level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused"
  level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

With this commit, that last error will look like:

  level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 3.84.188.207:22: connect: connection refused"

without the unrelated (to this failure mode) distraction about SSH
keys.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wking
You can assign the PR to them by writing /assign @wking in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 13, 2019
@wking
Copy link
Member Author

wking commented Dec 13, 2019

Benefit to baking this into the installer instead of using the CI templates is that we don't have to teach the templates how to extract the IP from the Terraform state. And this way will also help end users if AWS has a hardware error or whatever.

If we can't reach the bootstrap machine via SSH.  Before this commit,
we would occasionally see connection issues like [1]:

  level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..."
  level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused"
  level=info msg="Pulling debug logs from the bootstrap machine"
  level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused"
  level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

With this commit, when we see those connection-refused errors, we
attempt to retrieve console logs for the bootstrap instance.  This
will make it easier for users and users to see why the machine failed
to come up.  It should be especially useful in continuous integration
when bumping RHCOS boot images [2], when such boot-time failures are
more likely.

I've only implemented it on AWS for the moment, but I've set it up so
we can extend it to other platforms going forward.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076
[2]: openshift#2777 (comment)
@wking wking force-pushed the gather-ssh-connection-refused-console-logs branch from bb13454 to 1ac911d Compare December 13, 2019 02:10
@wking wking changed the title cmd/openshift-install/gather: Gather bootstrap console logs WIP: cmd/openshift-install/gather: Gather bootstrap console logs Dec 13, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2019
@wking
Copy link
Member Author

wking commented Dec 13, 2019

Added a debugging commit so we can see this in action.

@abhinavdahiya
Copy link
Contributor

On the flip side, this is only issue to CI for that.. otherwise its not very useful to the end users...

@wking
Copy link
Member Author

wking commented Dec 16, 2019

On the flip side, this is only issue to CI for that...

CI is important too ;). And the AWS Go SDK is likely to be more stable than Terraform state inspection, so I'd rather not dup state inspection between this repo and the release repo. But if you want the console-gathering itself to land in openshift/release, how about a new openshift-install gather subcommand (ips?) that spits out the IP addresses or some such to isolate the CI-side code from the Terraform state format?

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws d92d9e8 link /test e2e-aws
ci/prow/e2e-aws-scaleup-rhel7 d92d9e8 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-openstack d92d9e8 link /test e2e-openstack
ci/prow/e2e-aws-fips d92d9e8 link /test e2e-aws-fips
ci/prow/e2e-libvirt d92d9e8 link /test e2e-libvirt
ci/prow/shellcheck d92d9e8 link /test shellcheck
ci/prow/yaml-lint d92d9e8 link /test yaml-lint
ci/prow/tf-lint d92d9e8 link /test tf-lint
ci/prow/e2e-aws-upgrade d92d9e8 link /test e2e-aws-upgrade
ci/prow/images d92d9e8 link /test images
ci/prow/gofmt d92d9e8 link /test gofmt
ci/prow/govet d92d9e8 link /test govet
ci/prow/unit d92d9e8 link /test unit
ci/prow/verify-vendor d92d9e8 link /test verify-vendor
ci/prow/golint d92d9e8 link /test golint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

I think for now we have punted on having console logs captured.

the CI where this is most relevant because of is breakages we already capture these.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

I think for now we have punted on having console logs captured.

the CI where this is most relevant because of is breakages we already capture these.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. version/4.5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants