Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap request to first cpn takes a long time #133

Open
magicite opened this issue Jul 22, 2022 · 4 comments
Open

Bootstrap request to first cpn takes a long time #133

magicite opened this issue Jul 22, 2022 · 4 comments

Comments

@magicite
Copy link

Might be the same as #109.

I'm creating a cluster on some old Dell R620s with sidero and am noticing that once the first control plane node gets to the point where it needs to receive the bootstrap request, it takes a variable amount of time to receive it. I've seen between 6 and 15 minutes. If I kill the cacppt-controller-manager pod, that seems to kickstart things.

I'm running the latest of everything.

[root@dill04 sidero]# clusterctl upgrade plan --kubeconfig-context admin@ben-sidero-demo-2
Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-talos         cabpt-system    BootstrapProvider        v0.5.4            Already up to date
control-plane-talos     cacppt-system   ControlPlaneProvider     v0.4.6            Already up to date
cluster-api             capi-system     CoreProvider             v1.2.0            Already up to date
infrastructure-sidero   sidero-system   InfrastructureProvider   v0.5.2            Already up to date

You are already up to date!

Attached is the cacppt-controller-manager log. This log has two cluster provisions, with the first one not having the "takes a long time" issue, and the second one experiencing the issue. I think the interesting bits start at 1.658505243132774e+09 (last failed message).
cacppt-controller-manager-delayed-bootstrap.txt
cpn_console.txt

@smira
Copy link
Member

smira commented Jul 22, 2022

@Unix4ever any ideas?

@magicite
Copy link
Author

A few updates in case it's helpful.

  • I've provisioned workload clusters when the management cluster was itself created with talosctl cluster create using docker, and on a baremetal talos cluster. I've only seen this issue with the docker cluster.
  • Sometimes on the docker cluster, I don't hit this issue at all. I'd say maybe I hit the issue about half the time?

Here's how I'm creating the docker management cluster:

talosctl cluster create \
  --name bootstrap \
  --kubernetes-version 1.24.3 \
  -p 69:69/udp,8081:8081/tcp,51821:51821/udp \
  --memory 4096 \
  --workers 0 \
  --nameservers 16.110.135.51,16.110.135.52 \
  --registry-mirror docker.io=http://dill04.us.cray.com:2022 \
  --registry-mirror k8s.gcr.io=http://dill04.us.cray.com:2023 \
  --registry-mirror quay.io=http://dill04.us.cray.com:2024 \
  --registry-mirror gcr.io=http://dill04.us.cray.com:2025 \
  --registry-mirror ghcr.io=http://dill04.us.cray.com:2026 \
  --registry-mirror registry.k8s.io=http://dill04.us.cray.com:2027 \
  --with-cluster-discovery=false \
  --config-patch @env.yaml \
  --config-patch-control-plane @env.yaml \
  --config-patch-worker @env.yaml \
  --endpoint $HOST_IP

with the patch file being

- op: add
  path: /machine/env
  value:
     http_proxy: xxx
     https_proxy: xxx
     no_proxy: xxx
- op: add
  path: /cluster/allowSchedulingOnMasters
  value: true
- op: add
  path: /machine/time
  value:
    servers:
    - 16.110.135.123
    - 16.229.168.10

@smira
Copy link
Member

smira commented Oct 27, 2022

Are you using latest versions of the providers? We had some fixes since that time.

@magicite
Copy link
Author

Yes - I am using the latest released versions of the providers.

[root@dill04 demo-1.2]# clusterctl --kubeconfig-context admin@bootstrap upgrade plan
Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-talos         cabpt-system    BootstrapProvider        v0.5.5            Already up to date
control-plane-talos     cacppt-system   ControlPlaneProvider     v0.4.10           Already up to date
cluster-api             capi-system     CoreProvider             v1.2.4            Already up to date
infrastructure-sidero   sidero-system   InfrastructureProvider   v0.5.5            Already up to date

You are already up to date!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants