Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openstack-scs-1-29-v1 / openstack-scs-1-28-v2 not deployable (cilium issues) #143

Closed
Nils98Ar opened this issue Jul 18, 2024 · 14 comments · Fixed by #147 or SovereignCloudStack/cluster-stack-operator#243
Assignees
Labels
bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling
Milestone

Comments

@Nils98Ar
Copy link
Member

Nils98Ar commented Jul 18, 2024

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady.
The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

@Nils98Ar
Copy link
Member Author

Nils98Ar commented Jul 18, 2024

This could be the reason (cso-controller-manager logs):

{
  "level": "ERROR",
  "time": "2024-07-18T15:36:24.881Z",
  "file": "kube/kube.go:206",
  "message": "failed to apply object",
  "controller": "clusteraddon",
  "controllerGroup": "clusterstack.x-k8s.io",
  "controllerKind": "ClusterAddon",
  "ClusterAddon": {
    "name": "cluster-addon-cluster-scs",
    "namespace": "project-test"
  },
  "namespace": "kube-system",
  "name": "cilium",
  "reconcileID": "ca7c0a4b-19a8-47f6-a99a-04c254712b1d",
  "obj": "apps/v1, Kind=DaemonSet",
  "error": "failed to apply object: failed to create typed patch object (kube-system/cilium; apps/v1, Kind=DaemonSet): .spec.template.spec.securityContext.appArmorProfile: field not declared in schema",
  "stacktrace": "github.com/SovereignCloudStack/cluster-stack-operator/pkg/kube.(*kube).Apply\n\t/src/cluster-stack-operator/pkg/kube/kube.go:206\ngithub.com/SovereignCloudStack/cluster-stack-operator/internal/controller.(*ClusterAddonReconciler).templateAndApplyClusterAddonHelmChart\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:737\ngithub.com/SovereignCloudStack/cluster-stack-operator/internal/controller.(*ClusterAddonReconciler).Reconcile\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:276\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"
}

@Nils98Ar
Copy link
Member Author

Nils98Ar commented Jul 18, 2024

Seems that .spec.template.spec.securityContext.appArmorProfile was introduced in Kubernetes 1.30 and in cilium helm chart version 1.15.5 (the mentioned ClusterStack Releases use version 1.15.6).
https://kubernetes.io/docs/tutorials/security/apparmor/#securing-a-pod

The helm chart should normally check the Kubernetes version using .Capabilities.KubeVersion.Version during helm install and skip the appArmorProfile for Kubernetes versions < 1.30 . Maybe this does not work in the ClusterStacks scenario? I am not sure in which context the templating is done.
E.g. https://github.com/cilium/cilium/blob/v1.15.6/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml#L86-L94

@Nils98Ar Nils98Ar changed the title openstack-scs-* not working on our SCS OpenStack cloud openstack-scs-* 1.28/1.29 not working on our SCS OpenStack cloud Jul 18, 2024
@Nils98Ar
Copy link
Member Author

These should be all relevant parts of the helm chart with checks for Kubernetes < 1.30: https://github.com/search?q=repo%3Acilium%2Fcilium%20%22%3C1.30.0%22&type=code

@Nils98Ar Nils98Ar changed the title openstack-scs-* 1.28/1.29 not working on our SCS OpenStack cloud openstack-scs-1-29-v1 / openstack-scs-1-29-v2 not deployable (cilium issues) Jul 18, 2024
@Nils98Ar Nils98Ar changed the title openstack-scs-1-29-v1 / openstack-scs-1-29-v2 not deployable (cilium issues) openstack-scs-1-29-v1 / openstack-scs-1-28-v2 not deployable (cilium issues) Jul 18, 2024
@chess-knight
Copy link
Member

CSO does helm template | kubectl apply -f - and that's why Cilium's helm chart semverCompare logic doesn't work here. It should work for 1.30 as you wrote. But for <1.30.0 it is a bug.

@chess-knight chess-knight added bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling labels Jul 19, 2024
@Nils98Ar
Copy link
Member Author

Yes it does work for 1.30.

@Nils98Ar
Copy link
Member Author

@Nils98Ar
Copy link
Member Author

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

@chess-knight
Copy link
Member

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

AFAIK CSPO only cares about node images. What do you mean by security group „0“?

@chess-knight chess-knight moved this from Backlog to Todo in Sovereign Cloud Stack Jul 23, 2024
@chess-knight chess-knight added this to the R7 (v8.0.0) milestone Jul 23, 2024
@michal-gubricky
Copy link
Contributor

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady. The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

Hi @Nils98Ar, I just tested the creation of the cluster using the main branch of the cluster-stacks repo, built it via csctl, and did not encounter your error. The Kubernetes version is 1.28.11.

NAME                                            STATUS   ROLES           AGE     VERSION
test-cluster-5cgr8-4pj6m                        Ready    control-plane   3m38s   v1.28.11
test-cluster-5cgr8-tst5n                        Ready    control-plane   31m     v1.28.11
test-cluster-5cgr8-xdwvh                        Ready    control-plane   24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-2v865   Ready    <none>          28m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-jdldc   Ready    <none>          24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-p5wh7   Ready    <none>          24m     v1.28.11

@chess-knight
Copy link
Member

@michal-gubricky, what is the state of the ClusterAddon object?

@michal-gubricky
Copy link
Contributor

@michal-gubricky, what is the state of the ClusterAddon object?

Here are all pods in kube-system namespace and also state of the cluster-addon resource:

ubuntu@mg-cluster-stack-vm:~$ k get clusteraddons.clusterstack.x-k8s.io cluster-addon-test-cluster 
NAME                         CLUSTER        HOOK   READY   AGE   REASON   MESSAGE
cluster-addon-test-cluster   test-cluster          true    79m 
ubuntu@mg-cluster-stack-vm:~$ k get po -n kube-system --kubeconfig test-cluster.kubeconfig 
NAME                                                     READY   STATUS    RESTARTS         AGE
cilium-fk2b9                                             1/1     Running   1                66m
cilium-gmh4x                                             1/1     Running   0                39m
cilium-l9jgw                                             1/1     Running   0                63m
cilium-lgmsv                                             1/1     Running   0                60m
cilium-mj7qz                                             1/1     Running   1 (49m ago)      60m
cilium-ncxr4                                             1/1     Running   0                52m
cilium-operator-8645b8bb4f-ppd9l                         1/1     Running   9 (3m28s ago)    66m
cilium-operator-8645b8bb4f-v9vl7                         1/1     Running   9 (5m46s ago)    66m
coredns-5dd5756b68-fhdn2                                 1/1     Running   0                66m
coredns-5dd5756b68-r7mwx                                 1/1     Running   0                66m
etcd-test-cluster-5cgr8-4pj6m                            1/1     Running   1 (19m ago)      39m
etcd-test-cluster-5cgr8-tst5n                            1/1     Running   1 (19m ago)      66m
etcd-test-cluster-5cgr8-xdwvh                            1/1     Running   0                60m
kube-apiserver-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (21m ago)      39m
kube-apiserver-test-cluster-5cgr8-tst5n                  1/1     Running   5 (23m ago)      67m
kube-apiserver-test-cluster-5cgr8-xdwvh                  1/1     Running   4 (23m ago)      60m
kube-controller-manager-test-cluster-5cgr8-4pj6m         1/1     Running   1 (27m ago)      39m
kube-controller-manager-test-cluster-5cgr8-tst5n         1/1     Running   9 (3m33s ago)    66m
kube-controller-manager-test-cluster-5cgr8-xdwvh         1/1     Running   3 (5m48s ago)    60m
kube-proxy-5dhjg                                         1/1     Running   0                39m
kube-proxy-649sl                                         1/1     Running   0                60m
kube-proxy-7gs4w                                         1/1     Running   0                63m
kube-proxy-7hpxb                                         1/1     Running   0                66m
kube-proxy-c62mx                                         1/1     Running   0                52m
kube-proxy-ch5fd                                         1/1     Running   0                52m
kube-scheduler-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (3m28s ago)    39m
kube-scheduler-test-cluster-5cgr8-tst5n                  1/1     Running   8 (24m ago)      67m
kube-scheduler-test-cluster-5cgr8-xdwvh                  1/1     Running   3 (5m46s ago)    60m
metrics-server-666c6745d5-d6nvf                          1/1     Running   0                66m
openstack-cinder-csi-controllerplugin-78c4557887-qhvjr   6/6     Running   17 (3m30s ago)   66m
openstack-cinder-csi-nodeplugin-qxmdq                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-rvppn                    3/3     Running   0                63m
openstack-cinder-csi-nodeplugin-ssll9                    3/3     Running   0                39m
openstack-cinder-csi-nodeplugin-t6cql                    3/3     Running   0                66m
openstack-cinder-csi-nodeplugin-vt9z6                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-w8lcr                    3/3     Running   0                60m
openstack-cloud-controller-manager-4vdcl                 1/1     Running   2 (2m47s ago)    52m
openstack-cloud-controller-manager-6dkfw                 1/1     Running   2 (5m48s ago)    35m
openstack-cloud-controller-manager-f84zp                 1/1     Running   4 (19m ago)      46m

@chess-knight
Copy link
Member

chess-knight commented Jul 24, 2024

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

@michal-gubricky
Copy link
Contributor

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

Yeah, I was just looking at the version in Chart.yaml and there is 1.15.6.

@chess-knight
Copy link
Member

chess-knight commented Aug 8, 2024

Hi @janiskemper, can you please take a look? IMO we have three options here:

  1. Use a workaround from 🐛 Fix issue with cilium when k8s < 1.30 #150
  2. Use helm template --kube-version ... in the CSO cluster-addon logic if it is possible for this controller to know the workload k8s version. This of course needs to be tested if it is enough first. I think that also not only for the cilium helm chart it is a good idea to template charts with known k8s version, because probably multiple helm charts use Capabilities.KubeVersion.
  3. Downgrade cilium chart for k8s < 1.30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment