Kube-proxy should exempt local traffic from going through conntrack #109672

linxiulei · 2022-04-26T12:56:41Z

What would you like to be added?

An accept rule should be installed before the conntrack rule exempt local traffic such as requests to localhost:10248/healthz

...
-N KUBE-SERVICES
-A INPUT -m comment --comment "kubernetes health check service ports" -j KUBE-NODEPORTS
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -j KUBE-FIREWALL
...

Why is this needed?

Tracking traffic via loop interface seems unnecessary and has extra overhead. Especially when the conntrack table is full, the traffic would be dropped and it causes more problems.

The text was updated successfully, but these errors were encountered:

aojea · 2022-04-26T14:28:12Z

local generated traffic doesn't go through INPUT, it enters the OUTPUT chain ... is there any issue you are seeing or is some optimization you want to add?

linxiulei · 2022-04-26T15:29:47Z

To clarify, the local traffic (localhost to localhost) would firstly go through OUTPUT chain and then INPUT chain. Is it right?

It doesn't matter if it only goes to OUTPUT chain because OUTPUT chain also uses conntrack.

-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

The issue is when the conntrack table is full the localhost health check against kubelet is failed. However, this doesn't really reflect the true problem.

aojea · 2022-04-26T18:23:45Z

The issue is when the conntrack table is full the localhost health check against kubelet is failed.

You should report first the bug, I don't think this is going to solve any problems if you have the conntrack table full, is just one connection

linxiulei · 2022-04-26T18:35:49Z

IMHO, this is expected as conntrack table can't be infinite. The applications can exhaust it no matter how big it is. The point is that ideally conntrack table being full should not impact kubelet (and its health check).

neolit123 · 2022-04-26T20:08:38Z

/sig network

aojea · 2022-04-27T07:41:39Z

I understand your reasoning, but if you have your conntrack table full, the kubelet probe is going to be the minor of your problems
We should determine the root cause of the exhaustion of the conntrack table and see if its origin is some architectural problem, but we can not start patching the edge cases or this will not scale ... are we going to add rules for apiserver, controller-manager, ... and the rest of the components?

linxiulei · 2022-04-27T09:27:34Z

I agree. My understanding on this problem is:

Conntrack table exhaustion is inevitable from the workload. K8S should consider the worse scenario if possible to avoid impacts from the workload. Ideally, the communication between K8S components should be fine when conntrack table is full.
since the conntrack is setup by kube-proxy, the architectural problem could be inside the kube-proxy. For example, we could have an exempt rule before entering rules using conntrack. (kube-proxy already does something similar. See the chain KUBE-NODEPORTS). This exempt rule can be used for all basic kubelet traffic.

aojea · 2022-04-27T14:10:50Z

hmm, kube-proxy rules are not creating the conntrack entries, those rules are using the entries ... the kernel enables the conntrack hooks and tracks the protocols, all the NAT kernel modules use those entries to be able to perform the different operations ...

If you want some traffic to not be "tracked" by the kernel you have to use something like

iptables -t raw -A PREROUTING <your traffic filter here> -j NOTRACK

but if you hit conntrack table limits I think that your system is wrong dimensioned, same as if you have memory, cpu or disk exhausted, you should be able to know the number of conntrack entries you want to handle and configure the limit correctly in advance

thockin · 2022-04-28T20:32:18Z

I'm in favor of adding NOTRACK rules if and when we can. The problem is that it's not always clear when we can do that.

E.g. iptables mod kube-proxy supports nodeports on localhost - we need to track those. So the rules have to at least check -d 127.0.0.0/8 AND compare -p and --dport against all known nodeports.

I seem to recall trying to do this a while back and finding more corner cases. TCP connections do clear after the connection closes, maybe there's more we can do there.

Can you run conntrack -L just to be sure your problems are all localhost?

linxiulei · 2022-04-29T10:42:17Z

@aojea ah, I missed that the conntrack is enabled as long as iptables/netfilter is used. Thanks!

By exempting, I meant to set up the NOTRACK rules for critical traffic so the components relying on those traffic can be more robust for extreme circumstance. However, it's debatable where we should put those NOTRACK rules.

@thockin I don't have the environment now but I'd bet the most usage comes from pods instead of the localhost. So the small traffic through localhost is a victim of conntrack table being full. But they are very important to critical K8S components in the node.

dcbw · 2022-05-26T15:27:27Z

@linxiulei you should be able to adjust the conntrack table size with net.netfilter.nf_conntrack_max. Is that not an option on your systems?

thockin · 2022-05-26T15:57:41Z

kube-proxy sets those sysctls but has a scaling config you can set, if you really need more. --conntrack-max-per-core

As I mentioned above - I do think it would be neat to NOTRACK things when we know we can, it's just harder than it sounded (or I just failed the first time I tried :)

linxiulei · 2022-05-26T19:47:02Z

@dcbw Not really as nf_conntrack_max may be used up no matter how high you specify. Also setting a high value of it will use up memory that is not charged by any pods, so it will impact the whole system.

@thockin I have created NOTRACK iptables rules in my environment and it worked well for local traffic (eg. health check to kubelet). But it is a standalone configuration script that insert the rules. It'd be neat if kube-proxy does that by default.

thockin · 2022-06-09T16:06:18Z

Can you provide details on exactly what NOTRACK rules you are using?

linxiulei · 2022-06-10T15:56:53Z

Yeah, something like:

    iptables -w -t raw -A PREROUTING -s 127.0.0.1 -p tcp --sport 10248 -j NOTRACK
    iptables -w -t raw -A OUTPUT -d 127.0.0.1 -p tcp --dport 10248 -j NOTRACK

    iptables -w -A INPUT -s 127.0.0.1 -p tcp --sport 10248 -j ACCEPT
    iptables -w -A OUTPUT -d 127.0.0.1 -p tcp --dport 10248 -j ACCEPT

The 10248 is the port that kubelet's healthz service listens to

thockin · 2022-06-23T00:02:53Z

Traffic to localhost is tracked in some cases (nodeports in iptables mode).

If you chjange kube-proxy to do the above it should fail e2e tests. :(

thockin · 2022-08-04T00:11:21Z

I'm going to triage accept this, because I agree, but it's not as easy as it sounds. :(

k8s-triage-robot · 2022-11-16T21:52:09Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-16T22:13:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

aojea · 2023-03-19T13:33:50Z

Just for reference as Eric and I were finding new corner cases with the notrack approach, and something important to consider is that if there is some firewalling in place that depends on the connection state, if the connection is no tracked, the returning traffic will not match any ESTABLISHED state, and will be dropped

k8s-triage-robot · 2024-03-18T13:46:58Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

thockin · 2024-10-24T19:15:18Z

Closing in favor of #127259 as the accumulator issue.

linxiulei added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 26, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2022

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 26, 2022

thockin assigned thockin and aojea Apr 28, 2022

thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2022

thockin unassigned thockin and aojea Aug 18, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 16, 2022

thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 16, 2022

sanmai-NL mentioned this issue Sep 8, 2023

NetworkPolicy: connection-oriented semantics are undocumented #120525

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 18, 2024

thockin added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Mar 28, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 28, 2024

thockin mentioned this issue Sep 12, 2024

Need better conntrack management for UDP services (especially DNS) #127259

Open

thockin closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kube-proxy should exempt local traffic from going through conntrack #109672

Kube-proxy should exempt local traffic from going through conntrack #109672

linxiulei commented Apr 26, 2022

aojea commented Apr 26, 2022

linxiulei commented Apr 26, 2022

aojea commented Apr 26, 2022

linxiulei commented Apr 26, 2022

neolit123 commented Apr 26, 2022

aojea commented Apr 27, 2022

linxiulei commented Apr 27, 2022

aojea commented Apr 27, 2022 •

edited

Loading

thockin commented Apr 28, 2022

linxiulei commented Apr 29, 2022

dcbw commented May 26, 2022

thockin commented May 26, 2022 •

edited

Loading

linxiulei commented May 26, 2022

thockin commented Jun 9, 2022

linxiulei commented Jun 10, 2022

thockin commented Jun 23, 2022

thockin commented Aug 4, 2022

k8s-triage-robot commented Nov 16, 2022

k8s-triage-robot commented Dec 16, 2022

aojea commented Mar 19, 2023

k8s-triage-robot commented Mar 18, 2024

thockin commented Oct 24, 2024

Kube-proxy should exempt local traffic from going through conntrack #109672

Kube-proxy should exempt local traffic from going through conntrack #109672

Comments

linxiulei commented Apr 26, 2022

What would you like to be added?

Why is this needed?

aojea commented Apr 26, 2022

linxiulei commented Apr 26, 2022

aojea commented Apr 26, 2022

linxiulei commented Apr 26, 2022

neolit123 commented Apr 26, 2022

aojea commented Apr 27, 2022

linxiulei commented Apr 27, 2022

aojea commented Apr 27, 2022 • edited Loading

thockin commented Apr 28, 2022

linxiulei commented Apr 29, 2022

dcbw commented May 26, 2022

thockin commented May 26, 2022 • edited Loading

linxiulei commented May 26, 2022

thockin commented Jun 9, 2022

linxiulei commented Jun 10, 2022

thockin commented Jun 23, 2022

thockin commented Aug 4, 2022

k8s-triage-robot commented Nov 16, 2022

k8s-triage-robot commented Dec 16, 2022

aojea commented Mar 19, 2023

k8s-triage-robot commented Mar 18, 2024

thockin commented Oct 24, 2024

aojea commented Apr 27, 2022 •

edited

Loading

thockin commented May 26, 2022 •

edited

Loading