Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube-proxy should exempt local traffic from going through conntrack #109672

Closed
linxiulei opened this issue Apr 26, 2022 · 22 comments
Closed

Kube-proxy should exempt local traffic from going through conntrack #109672

linxiulei opened this issue Apr 26, 2022 · 22 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@linxiulei
Copy link
Contributor

What would you like to be added?

An accept rule should be installed before the conntrack rule exempt local traffic such as requests to localhost:10248/healthz

...
-N KUBE-SERVICES
-A INPUT -m comment --comment "kubernetes health check service ports" -j KUBE-NODEPORTS
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -j KUBE-FIREWALL
...

Why is this needed?

Tracking traffic via loop interface seems unnecessary and has extra overhead. Especially when the conntrack table is full, the traffic would be dropped and it causes more problems.

@linxiulei linxiulei added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 26, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2022
@aojea
Copy link
Member

aojea commented Apr 26, 2022

local generated traffic doesn't go through INPUT, it enters the OUTPUT chain ... is there any issue you are seeing or is some optimization you want to add?

image

@linxiulei
Copy link
Contributor Author

To clarify, the local traffic (localhost to localhost) would firstly go through OUTPUT chain and then INPUT chain. Is it right?

It doesn't matter if it only goes to OUTPUT chain because OUTPUT chain also uses conntrack.

-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

The issue is when the conntrack table is full the localhost health check against kubelet is failed. However, this doesn't really reflect the true problem.

@aojea
Copy link
Member

aojea commented Apr 26, 2022

The issue is when the conntrack table is full the localhost health check against kubelet is failed.

You should report first the bug, I don't think this is going to solve any problems if you have the conntrack table full, is just one connection

@linxiulei
Copy link
Contributor Author

IMHO, this is expected as conntrack table can't be infinite. The applications can exhaust it no matter how big it is. The point is that ideally conntrack table being full should not impact kubelet (and its health check).

@neolit123
Copy link
Member

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 26, 2022
@aojea
Copy link
Member

aojea commented Apr 27, 2022

I understand your reasoning, but if you have your conntrack table full, the kubelet probe is going to be the minor of your problems
We should determine the root cause of the exhaustion of the conntrack table and see if its origin is some architectural problem, but we can not start patching the edge cases or this will not scale ... are we going to add rules for apiserver, controller-manager, ... and the rest of the components?

@linxiulei
Copy link
Contributor Author

I agree. My understanding on this problem is:

  • Conntrack table exhaustion is inevitable from the workload. K8S should consider the worse scenario if possible to avoid impacts from the workload. Ideally, the communication between K8S components should be fine when conntrack table is full.
  • since the conntrack is setup by kube-proxy, the architectural problem could be inside the kube-proxy. For example, we could have an exempt rule before entering rules using conntrack. (kube-proxy already does something similar. See the chain KUBE-NODEPORTS). This exempt rule can be used for all basic kubelet traffic.

@aojea
Copy link
Member

aojea commented Apr 27, 2022

hmm, kube-proxy rules are not creating the conntrack entries, those rules are using the entries ... the kernel enables the conntrack hooks and tracks the protocols, all the NAT kernel modules use those entries to be able to perform the different operations ...

If you want some traffic to not be "tracked" by the kernel you have to use something like

iptables -t raw -A PREROUTING <your traffic filter here> -j NOTRACK

but if you hit conntrack table limits I think that your system is wrong dimensioned, same as if you have memory, cpu or disk exhausted, you should be able to know the number of conntrack entries you want to handle and configure the limit correctly in advance

@thockin
Copy link
Member

thockin commented Apr 28, 2022

I'm in favor of adding NOTRACK rules if and when we can. The problem is that it's not always clear when we can do that.

E.g. iptables mod kube-proxy supports nodeports on localhost - we need to track those. So the rules have to at least check -d 127.0.0.0/8 AND compare -p and --dport against all known nodeports.

I seem to recall trying to do this a while back and finding more corner cases. TCP connections do clear after the connection closes, maybe there's more we can do there.

Can you run conntrack -L just to be sure your problems are all localhost?

@linxiulei
Copy link
Contributor Author

@aojea ah, I missed that the conntrack is enabled as long as iptables/netfilter is used. Thanks!

By exempting, I meant to set up the NOTRACK rules for critical traffic so the components relying on those traffic can be more robust for extreme circumstance. However, it's debatable where we should put those NOTRACK rules.

@thockin I don't have the environment now but I'd bet the most usage comes from pods instead of the localhost. So the small traffic through localhost is a victim of conntrack table being full. But they are very important to critical K8S components in the node.

@dcbw
Copy link
Member

dcbw commented May 26, 2022

@linxiulei you should be able to adjust the conntrack table size with net.netfilter.nf_conntrack_max. Is that not an option on your systems?

@thockin
Copy link
Member

thockin commented May 26, 2022

kube-proxy sets those sysctls but has a scaling config you can set, if you really need more. --conntrack-max-per-core

As I mentioned above - I do think it would be neat to NOTRACK things when we know we can, it's just harder than it sounded (or I just failed the first time I tried :)

@linxiulei
Copy link
Contributor Author

@dcbw Not really as nf_conntrack_max may be used up no matter how high you specify. Also setting a high value of it will use up memory that is not charged by any pods, so it will impact the whole system.

@thockin I have created NOTRACK iptables rules in my environment and it worked well for local traffic (eg. health check to kubelet). But it is a standalone configuration script that insert the rules. It'd be neat if kube-proxy does that by default.

@thockin
Copy link
Member

thockin commented Jun 9, 2022

Can you provide details on exactly what NOTRACK rules you are using?

@linxiulei
Copy link
Contributor Author

Yeah, something like:

    iptables -w -t raw -A PREROUTING -s 127.0.0.1 -p tcp --sport 10248 -j NOTRACK
    iptables -w -t raw -A OUTPUT -d 127.0.0.1 -p tcp --dport 10248 -j NOTRACK

    iptables -w -A INPUT -s 127.0.0.1 -p tcp --sport 10248 -j ACCEPT
    iptables -w -A OUTPUT -d 127.0.0.1 -p tcp --dport 10248 -j ACCEPT

The 10248 is the port that kubelet's healthz service listens to

@thockin
Copy link
Member

thockin commented Jun 23, 2022

Traffic to localhost is tracked in some cases (nodeports in iptables mode).

If you chjange kube-proxy to do the above it should fail e2e tests. :(

@thockin
Copy link
Member

thockin commented Aug 4, 2022

I'm going to triage accept this, because I agree, but it's not as easy as it sounds. :(

@thockin thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 16, 2022
@thockin thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 16, 2022
@aojea
Copy link
Member

aojea commented Mar 19, 2023

Just for reference as Eric and I were finding new corner cases with the notrack approach, and something important to consider is that if there is some firewalling in place that depends on the connection state, if the connection is no tracked, the returning traffic will not match any ESTABLISHED state, and will be dropped

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 18, 2024
@thockin thockin added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Mar 28, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 28, 2024
@thockin
Copy link
Member

thockin commented Oct 24, 2024

Closing in favor of #127259 as the accumulator issue.

@thockin thockin closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants