WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

wking · 2019-12-13T01:01:43Z

Builds on #2810; review that first. This PR adds AWS console-log retrieval to those connection-refused cases. It should be especially useful in continuous integration when bumping RHCOS boot images (e.g. this job), when such boot-time failures are more likely.

Before this commit, bootstrap machines that failed to come up would look like [1]: level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..." level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused" level=info msg="Pulling debug logs from the bootstrap machine" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" With this commit, that last error will look like: level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 3.84.188.207:22: connect: connection refused" without the unrelated (to this failure mode) distraction about SSH keys. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076

openshift-ci-robot · 2019-12-13T01:01:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wking
You can assign the PR to them by writing /assign @wking in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-12-13T01:33:48Z

Benefit to baking this into the installer instead of using the CI templates is that we don't have to teach the templates how to extract the IP from the Terraform state. And this way will also help end users if AWS has a hardware error or whatever.

If we can't reach the bootstrap machine via SSH. Before this commit, we would occasionally see connection issues like [1]: level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..." level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused" level=info msg="Pulling debug logs from the bootstrap machine" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused" level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" With this commit, when we see those connection-refused errors, we attempt to retrieve console logs for the bootstrap instance. This will make it easier for users and users to see why the machine failed to come up. It should be especially useful in continuous integration when bumping RHCOS boot images [2], when such boot-time failures are more likely. I've only implemented it on AWS for the moment, but I've set it up so we can extend it to other platforms going forward. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076 [2]: openshift#2777 (comment)

wking · 2019-12-13T02:10:39Z

Added a debugging commit so we can see this in action.

abhinavdahiya · 2019-12-13T14:47:49Z

On the flip side, this is only issue to CI for that.. otherwise its not very useful to the end users...

wking · 2019-12-16T19:12:30Z

On the flip side, this is only issue to CI for that...

CI is important too ;). And the AWS Go SDK is likely to be more stable than Terraform state inspection, so I'd rather not dup state inspection between this repo and the release repo. But if you want the console-gathering itself to land in openshift/release, how about a new openshift-install gather subcommand (ips?) that spits out the IP addresses or some such to isolate the CI-side code from the Terraform state format?

openshift-ci-robot · 2020-01-31T01:24:16Z

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-04-22T00:48:13Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`d92d9e8`	link	`/test e2e-aws`
ci/prow/e2e-aws-scaleup-rhel7	`d92d9e8`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-openstack	`d92d9e8`	link	`/test e2e-openstack`
ci/prow/e2e-aws-fips	`d92d9e8`	link	`/test e2e-aws-fips`
ci/prow/e2e-libvirt	`d92d9e8`	link	`/test e2e-libvirt`
ci/prow/shellcheck	`d92d9e8`	link	`/test shellcheck`
ci/prow/yaml-lint	`d92d9e8`	link	`/test yaml-lint`
ci/prow/tf-lint	`d92d9e8`	link	`/test tf-lint`
ci/prow/e2e-aws-upgrade	`d92d9e8`	link	`/test e2e-aws-upgrade`
ci/prow/images	`d92d9e8`	link	`/test images`
ci/prow/gofmt	`d92d9e8`	link	`/test gofmt`
ci/prow/govet	`d92d9e8`	link	`/test govet`
ci/prow/unit	`d92d9e8`	link	`/test unit`
ci/prow/verify-vendor	`d92d9e8`	link	`/test verify-vendor`
ci/prow/golint	`d92d9e8`	link	`/test golint`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2020-05-08T23:07:44Z

I think for now we have punted on having console logs captured.

the CI where this is most relevant because of is breakages we already capture these.

/close

openshift-ci-robot · 2020-05-08T23:07:58Z

@abhinavdahiya: Closed this PR.

In response to this:

I think for now we have punted on having console logs captured.

the CI where this is most relevant because of is breakages we already capture these.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 13, 2019

openshift-ci-robot requested review from jhixson74 and jstuever December 13, 2019 01:02

wking added 2 commits December 12, 2019 18:09

WIP: DEBUG

1ac911d

wking force-pushed the gather-ssh-connection-refused-console-logs branch from bb13454 to 1ac911d Compare December 13, 2019 02:10

wking changed the title ~~cmd/openshift-install/gather: Gather bootstrap console logs~~ WIP: cmd/openshift-install/gather: Gather bootstrap console logs Dec 13, 2019

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2019

WIP: More debugging hacks

d92d9e8

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2020

abhinavdahiya added the version/4.5 label Feb 10, 2020

openshift-ci-robot closed this May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

wking commented Dec 13, 2019

openshift-ci-robot commented Dec 13, 2019

wking commented Dec 13, 2019

wking commented Dec 13, 2019

abhinavdahiya commented Dec 13, 2019

wking commented Dec 16, 2019

openshift-ci-robot commented Jan 31, 2020

openshift-ci-robot commented Apr 22, 2020

abhinavdahiya commented May 8, 2020

openshift-ci-robot commented May 8, 2020

WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

WIP: cmd/openshift-install/gather: Gather bootstrap console logs #2811

Conversation

wking commented Dec 13, 2019

openshift-ci-robot commented Dec 13, 2019

wking commented Dec 13, 2019

wking commented Dec 13, 2019

abhinavdahiya commented Dec 13, 2019

wking commented Dec 16, 2019

openshift-ci-robot commented Jan 31, 2020

openshift-ci-robot commented Apr 22, 2020

abhinavdahiya commented May 8, 2020

openshift-ci-robot commented May 8, 2020