add CALICO_DISABLE_TUNNEL env #9394

gojoy · 2024-10-25T11:07:33Z

Expected Behavior

A environment variable is needed to control whether process calico-node -allocate-tunnel-addrs is started when
run the calico pod in k8s.

Current Behavior

Possible Solution

add the a judgment by CALICO_DISABLE_TUNNTL env in the rc.local

Steps to Reproduce (for bugs)

Context

In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend

Your Environment

Calico version v3.27.2,
Calico dataplane (iptables, windows etc.) bgp
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
Operating System and version:linux
Link to your project (optional):

fasaxc · 2024-10-25T13:05:20Z

Are you running Typha? The allocate-tunnel-ips program will use typha if enabled, which is our fan-out proxy. Using that removes the load from the API server.

caseydavenport · 2024-10-25T15:46:00Z

Yeah, agree that Typha is likely the solution here. If there's a reason that even with Typha this is causing load, that is most likely a bug in the allocateip.go code - I don't think we need / should have an env var to control whether this code is enabled assuming the code is written efficiently.

cyclinder · 2024-11-04T07:24:09Z

Some users don't use Typha to install calico, we still need the env var until we figure out a potential bug in the allocateip.go, do you think so? @caseydavenport

caseydavenport · 2024-11-04T22:04:50Z

@cyclinder if there truly is a bug in allocateip.go that is causing excessive resource usage, I'd rather we just fix that - that way, everyone benefits and not just those who discover this new arcane environment variable.

In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend

@gojoy do you have evidence that allocate-tunnel-addrs was causing load in your cluster? If so, could you provide that evidence so we can use it to diagnose what might be going wrong?

gojoy · 2024-11-05T02:02:43Z

@cyclinder if there truly is a bug in allocateip.go that is causing excessive resource usage, I'd rather we just fix that - that way, everyone benefits and not just those who discover this new arcane environment variable.

In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend

@gojoy do you have evidence that allocate-tunnel-addrs was causing load in your cluster? If so, could you provide that evidence so we can use it to diagnose what might be going wrong?

Yes, we've noticed that each time reconcileTunnelAddrs is executed, the backend service will be called multiple times. When a large number of new nodes join the cluster, it will increase the load on the backend. In the scenario where the tunnel is enabled, this belongs to the normal logic. However, the key point is that when the cluster administrator determines that the tunnel will not be used in the cluster, the question is whether there can be a variable to control stopping this process, since these reconciles responsible for the tunnel have no effect. This may not be in conflict with enabling Typha.

gojoy · 2024-11-05T03:47:39Z

allocateip.go

One more point to add. When using the Kubernetes API Datastore, allocateip.go requests the kubernetes apiserver for queries with resourceVersion="", which will result in not using the cache of Kubernetes and increase the overhead of requests.

caseydavenport · 2024-11-05T15:57:47Z

@gojoy I still am not convinced that an environment variable to control this is the correct approach.

Felix also watches Nodes(), and will have similar scale characteristics that require the use of Typha at even a moderate number of nodes. I think the correct solution here is to enable Typha in your cluster - it's exactly what it was designed and intended for!

caseydavenport · 2024-11-05T16:00:43Z

that is most likely a bug in the allocateip.go code

There is an actual "bug" here, which is that we are performing a Get() directly against the API server for the node instead of using a cached version. Specifically these lines:

calico/node/pkg/allocateip/allocateip.go

Lines 225 to 235 in 07ad564

    
           // Get node resource for given nodename. 
        
           node, err := c.Nodes().Get(ctx, nodename, options.GetOptions{}) 
        
           if err != nil { 
        
           	log.WithError(err).Fatalf("failed to fetch node resource '%s'", nodename) 
        
           } 
        
           // Get list of ip pools 
        
           ipPoolList, err := c.IPPools().List(ctx, options.ListOptions{}) 
        
           if err != nil { 
        
           	log.WithError(err).Fatal("Unable to query IP pool configuration") 
        
           }

I think there is a strong case to be made for improving that to be cache driven instead of generating API calls.

gojoy · 2024-11-07T01:56:13Z

@gojoy I still am not convinced that an environment variable to control this is the correct approach.

Felix also watches Nodes(), and will have similar scale characteristics that require the use of Typha at even a moderate number of nodes. I think the correct solution here is to enable Typha in your cluster - it's exactly what it was designed and intended for!

Got it. We will deploy the Typha component.Thanks

cyclinder · 2024-11-07T02:03:09Z

@caseydavenport That makes sense, let me improve this part, thanks!

liuxu623 · 2024-11-23T02:58:09Z

that is most likely a bug in the allocateip.go code

There is an actual "bug" here, which is that we are performing a Get() directly against the API server for the node instead of using a cached version. Specifically these lines:

calico/node/pkg/allocateip/allocateip.go

Lines 225 to 235 in 07ad564

// Get node resource for given nodename.

node, err := c.Nodes().Get(ctx, nodename, options.GetOptions{})

if err != nil {

log.WithError(err).Fatalf("failed to fetch node resource '%s'", nodename)

}

// Get list of ip pools

ipPoolList, err := c.IPPools().List(ctx, options.ListOptions{})

if err != nil {

log.WithError(err).Fatal("Unable to query IP pool configuration")

}

I think there is a strong case to be made for improving that to be cache driven instead of generating API calls.

@caseydavenport I found typha is no help for client operations, like c.Nodes().Get() and c.IPPools().List() is direct send request to apiserver even use typha.

gojoy changed the title ~~add CALICO_DISABLE_TUNNTL env~~ add CALICO_DISABLE_TUNNEL env Oct 25, 2024

fasaxc added the kind/enhancement label Oct 25, 2024

gojoy mentioned this issue Nov 5, 2024

add env var to disable tunnel in calico node #9443

Open

3 tasks

liuxu623 linked a pull request Nov 20, 2024 that will close this issue

Reduce get/update node API calls in allocate-tunnel-addrs #9506

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add CALICO_DISABLE_TUNNEL env #9394

add CALICO_DISABLE_TUNNEL env #9394

gojoy commented Oct 25, 2024

fasaxc commented Oct 25, 2024

caseydavenport commented Oct 25, 2024

cyclinder commented Nov 4, 2024

caseydavenport commented Nov 4, 2024

gojoy commented Nov 5, 2024

gojoy commented Nov 5, 2024

caseydavenport commented Nov 5, 2024

caseydavenport commented Nov 5, 2024 •

edited

Loading

gojoy commented Nov 7, 2024

cyclinder commented Nov 7, 2024

liuxu623 commented Nov 23, 2024

add CALICO_DISABLE_TUNNEL env #9394

add CALICO_DISABLE_TUNNEL env #9394

Comments

gojoy commented Oct 25, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

fasaxc commented Oct 25, 2024

caseydavenport commented Oct 25, 2024

cyclinder commented Nov 4, 2024

caseydavenport commented Nov 4, 2024

gojoy commented Nov 5, 2024

gojoy commented Nov 5, 2024

caseydavenport commented Nov 5, 2024

caseydavenport commented Nov 5, 2024 • edited Loading

gojoy commented Nov 7, 2024

cyclinder commented Nov 7, 2024

liuxu623 commented Nov 23, 2024

caseydavenport commented Nov 5, 2024 •

edited

Loading