Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manifest: Add conntrack (tools) but without the daemon #502

Merged
merged 1 commit into from
Mar 10, 2021

Conversation

cgwalters
Copy link
Member

coreos/fedora-coreos-tracker#404
https://bugzilla.redhat.com/show_bug.cgi?id=1925698
openshift/machine-config-operator#2421

This will help us work around a believed kernel bug for OpenShift right now. We may remove this later.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2021
@travier
Copy link
Member

travier commented Feb 19, 2021

Will need a release-4.7 backport

@LorbusChris
Copy link
Member

I wonder whether this should be done in upstream FCOS for now. Things like this would otherwise have to be managed in https://github.com/openshift/okd-machine-os/, broadening the gap between FCOS and the OKD machine OS

@cgwalters
Copy link
Member Author

I agree this is creating an OKD gap; we could address that by having okd-machine-os just install the conntrack-tools package alongside the rest of the stuff like kubelet etc.

I am less certain about blocking this quick-fix-for-OCP on adding this to FCOS - that's basically a permanent commitment, although sentiment seemed to be in favor-ish.

@jlebon
Copy link
Member

jlebon commented Feb 19, 2021

As a quick hack, this seems fine to me!

/approve

But long-term we should either get that package split out, or e.g. moved to the MCD as was discussed. Given that, I'd rather not do this hack at all in FCOS if we can, because remove-from-packages is really not great (and honestly, we should be looking at dropping the ones we currently have).

@jlebon
Copy link
Member

jlebon commented Feb 19, 2021

Things like this would otherwise have to be managed in openshift/okd-machine-os, broadening the gap between FCOS and the OKD machine OS

Because FCOS is more general than RHCOS, I think some gap will basically always exist and we should embrace that and figure out how to manage it best (e.g. like the extensions strengthening work),

@lucab
Copy link
Contributor

lucab commented Feb 19, 2021

This is going to change cri-o behavior, which performs some networking-related logic based on auto-detection of binary presence.
Ideally, before landing this, there should be a tri-state knob (on | off | autodetect) in cri-o configuration so that the expected behavior can be pinned.

@cgwalters
Copy link
Member Author

This is going to change cri-o behavior,

Right but...is that code always something we want anyways? I am not sure.

One option to make this even more obviously a hack for current OCP is to move the binary to e.g. /usr/opt/openshift-private/bin/conntrack, and that would also defeat crio finding it too - if indeed that is a problem.

@mrunalp
Copy link
Member

mrunalp commented Feb 19, 2021

@aojea ptal from the crio hostport manager perspective.

@aojea
Copy link

aojea commented Feb 19, 2021

This is going to change cri-o behavior, which performs some networking-related logic based on auto-detection of binary presence.
Ideally, before landing this, there should be a tri-state knob (on | off | autodetect) in cri-o configuration so that the expected behavior can be pinned.

cri-o manage the hostport in the containers, without the conntrack binary crio has a bug in certain UDP scenarios and it also doesn't pass one e2e test kubernetes/kubernetes#91216.

I don't know the reasons why this is this way honestly, but the correct behaviour for crio, since is the owner of the hostport logic, is to use the conntrack binary to delete the stale entries ... the conntrack logic doesn't modify any behaviour, it fixes a bug

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2021
@mrunalp
Copy link
Member

mrunalp commented Feb 19, 2021

@aojea Thanks! Now the question is if the crio code is enough for the gcp issue or we still need the changes proposed in openshift/machine-config-operator#2421 cc: @michaelgugino

@aojea
Copy link

aojea commented Feb 19, 2021

@aojea Thanks! Now the question is if the crio code is enough for the gcp issue or we still need the changes proposed in openshift/machine-config-operator#2421 cc: @michaelgugino

that issue is orthogonal to crio, that is due to tcp connection, this is about stale udp connections.

@cgwalters
Copy link
Member Author

/retest

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

9 similar comments
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member Author

/hold
CI failure looks unrelated, not sure what's going on. Will ping DPTP Monday.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, ashcrow, cgwalters, jlebon, travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ashcrow,cgwalters,jlebon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

cgwalters commented Feb 23, 2021

Ohh I see the problem:

"0/26 nodes are available: 18 Insufficient devices.kubevirt.io/kvm, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."

Edit: pinged DPTP to scale up nodes with kvm.

@cgwalters
Copy link
Member Author

We debugged this to openshift/release@7a961fa which steered jobs for this repo back to build01, which means we lost nested virt. DPTP is investigating a fix.

@cgwalters
Copy link
Member Author

openshift/release#16193 merged, let's see
/retest

@cgwalters
Copy link
Member Author

OK now we ran on build02, but are missing /dev/kvm.

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

Now retrying after openshift/release#16197

@cgwalters
Copy link
Member Author

/retest

@cgwalters cgwalters mentioned this pull request Feb 24, 2021
@cgwalters
Copy link
Member Author

I added some more debugging
/retest

@cgwalters
Copy link
Member Author

/retest

4 similar comments
@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2021

@cgwalters: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 269de9a link /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@miabbott
Copy link
Member

/retest

@miabbott
Copy link
Member

So nice to have working CI again! Thank you so much @cgwalters!

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2021
@openshift-merge-robot openshift-merge-robot merged commit 32ce8cb into openshift:master Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.