Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMware Instances not able to reach to management server #10012

Closed
vishesh92 opened this issue Nov 29, 2024 · 13 comments
Closed

VMware Instances not able to reach to management server #10012

vishesh92 opened this issue Nov 29, 2024 · 13 comments

Comments

@vishesh92
Copy link
Member

ISSUE TYPE
  • Bug Report
COMPONENT NAME
VR
CLOUDSTACK VERSION
Checked in 4.18 & 4.19
CONFIGURATION
OS / ENVIRONMENT

CloudStack with VMWare

SUMMARY

Management server is not reachable from the instances. This causes issues especially with CKS not being able to create load balancer.

STEPS TO REPRODUCE
1. Create an isolated network
2. Launch an instance
3. Try to connect to the management server from the instance
EXPECTED RESULTS
Able to reach the management server from the instance
ACTUAL RESULTS
Not able to reach the management server from the instance
@weizhouapache weizhouapache changed the title Instances not able to reach to management server VMware Instances not able to reach to management server Nov 29, 2024
@weizhouapache
Copy link
Member

@vishesh92
I have updated the title to emphysize this issue happens with VMware only

as Vishesh said, this causes CKS control nodes are not able to create lb rules as they cannot connect to cloudstack management server

@rajujith
Copy link
Collaborator

rajujith commented Dec 3, 2024

We could add a policy routing rule in VR to forward the traffic to the management server/ enpoint.url from instances via public interface. CC @vishesh92 @weizhouapache

@DaanHoogland
Copy link
Contributor

@rajujith ,
Why do we want VMs be able to reach the MS? they could go through the north net if needed. It does not seem like a good idea to route them directly. I understand that it would be convenient for CKS nodes, but it does not seem like a good idea to me.
cc @weizhouapache @vishesh92

@weizhouapache
Copy link
Member

@rajujith , Why do we want VMs be able to reach the MS? they could go through the north net if needed. It does not seem like a good idea to route them directly. I understand that it would be convenient for CKS nodes, but it does not seem like a good idea to me. cc @weizhouapache @vishesh92

@DaanHoogland
I agree we/users should have better network design.
the VM should reach the MS via public network, not private network.

I think we could consider it as a known limitation for CKS on vmware.

@rajujith
Copy link
Collaborator

rajujith commented Dec 4, 2024

@rajujith , Why do we want VMs be able to reach the MS? they could go through the north net if needed. It does not seem like a good idea to route them directly. I understand that it would be convenient for CKS nodes, but it does not seem like a good idea to me. cc @weizhouapache @vishesh92

@DaanHoogland, consider the management server URL public/intranet. It is expected that all intended users, including guest instances, should be able to access it from their client devices. In this specific case, the client is CKS. If there is a use case to allow access only from the CKS nodes but not regular guest instances even that could be implemented. The traffic traversal is CKS node -> VR guest interface -> VR public interface -> other hopes in the path -> management server public interface directly or via LB. Since the guest instance traffic is not traversing the management networks I believe it is a regular traffic that can be allowed.

@kiranchavala
Copy link
Contributor

@DaanHoogland

To provide further context on this

  1. Once a Kubernetes cluster is deployed on the vmware by cloudstack

  2. End user downloads the kubernetes config to interact with the kubernetes cluster via the kubectl tool

kubeconfig

  1. End user populates the kubeconfig file so that kubectl tool can interact with the cluster

vi .kube/config

  1. Check the status of the cluster and nodes
kubectl get nodes
NAME                       STATUS   ROLES           AGE     VERSION
test-control-19377251026   Ready    control-plane   4d22h   v1.28.4
test-node-19377261815      Ready    <none>          4d22h   v1.28.4
  1. Deploy a sample application (example: nginx) on kubernetes cluster

kubectl apply -f nginx.yaml


apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
  1. Check the application status
 kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-86dcfdf4c6-fvjxp   1/1     Running   0          66m
nginx-deployment-86dcfdf4c6-jqjxw   1/1     Running   0          66m
  1. Now end user want to access the application via public IP ( public IP range ) which is provided by cloudstack

kubectl expose deploy/nginx-deployment --port=80 --type=LoadBalancer

  1. Now check the external public IP , it will be struck in pending state
k get svc
NAME               TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP      10.96.0.1        <none>        443/TCP        4d22h
nginx-deployment   LoadBalancer   10.104.200.145   <pending>     80:32333/TCP   66m
  1. This is because the controller pods( cloud-controller-manager and kube-controller-manager)

which are responsible for assigning a public IP address to the application fails

kube-system            kube-controller-manager-test-control-19377251026   1/1     Running   79 (6h15m ago)   4d22h   172.30.1.109   test-control-19377251026
kube-system            cloud-controller-manager-574bcb86c-9fcgd           1/1     Running   77 (8h ago)    5d1h

  1. On checking the Logs of the cloud-controller pod, we can see a timeout issue.

k logs -f cloud-controller-manager-574bcb86c-9fcgd -n kube-system

47DXsoV%2BOnK945EvKkfLmLj4tU%3D: dial tcp 10.1.35.76:8080: i/o timeout, and error by node name: error retrieving node addresses: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listVirtualMachines&name=test-control-19377251026&response=json&signature=N6Ykg6Es%2BQWJYXIQ5KKXjyAHoiI%3D: dial tcp 10.1.35.76:8080: i/o timeout
E1204 07:44:13.663811       1 controller.go:244] error processing service kube-system/nginx-lb (will retry): failed to ensure load balancer: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=ab9e4156d8f724ccfa61f379a8aaa0a6&listall=true&response=json&signature=MYL8q14GpEp5QxnRDg4nRXnkhQg%3D: dial tcp 10.1.35.76:8080: i/o timeout
I1204 07:44:13.664131       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"kube-system", Name:"nginx-lb", UID:"b9e4156d-8f72-4ccf-a61f-379a8aaa0a63", APIVersion:"v1", ResourceVersion:"937", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=ab9e4156d8f724ccfa61f379a8aaa0a6&listall=true&response=json&signature=MYL8q14GpEp5QxnRDg4nRXnkhQg%3D: dial tcp 10.1.35.76:8080: i/o timeout
I1204 07:44:13.665040       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"nginx-deployment", UID:"efa6bff6-3d27-411f-a5be-2a94f7416df7", APIVersion:"v1", ResourceVersion:"995877", FieldPath:""}): type: 'Normal' reason: 'EnsuringLoadBalancer' Ensuring load balancer
E1204 07:44:39.309010       1 node_controller.go:237] error retrieving instance ID: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listVirtualMachines&name=test-node-19377261815&response=json&signature=1ONErKvq5OZxjpR26hzPmd8PLtY%3D: dial tcp 10.1.35.76:8080: i/o timeout
E1204 07:44:43.665211       1 controller.go:244] error processing service default/nginx-deployment (will retry): failed to ensure load balancer: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=aefa6bff63d27411fa5be2a94f7416df&listall=true&response=json&signature=AHQ1nikqf3Vi7ISV9zwO%2FuxgjgU%3D: dial tcp 10.1.35.76:8080: i/o timeout
I1204 07:44:43.665448       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"nginx-deployment", UID:"efa6bff6-3d27-411f-a5be-2a94f7416df7", APIVersion:"v1", ResourceVersion:"995877", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=aefa6bff63d27411fa5be2a94f7416df&listall=true&response=json&signature=AHQ1nikqf3Vi7ISV9zwO%2FuxgjgU%3D: dial tcp 10.1.35.76:8080: i/o timeout
E1204 07:45:39.310773       1 node_controller.go:245] Error getting node addresses for node "test-node-19377261815": error fetching node by provider ID: error retrieving node addresses: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listVirtualMachines&id=&response=json&signature=547DXsoV%2BOnK945EvKkfLmLj4tU%3D: dial tcp 10.1.35.76:8080: i/o timeout, and error by node name: error retrieving node addresses: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listVirtualMachines&name=test-node-19377261815&response=json&signature=1ONErKvq5OZxjpR26hzPmd8PLtY%3D: dial tcp 10.1.35.76:8080: i/o timeout
E1204 07:46:20.106490       1 controller.go:719] failed to check if load balancer exists for service kube-system/nginx-lb: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=ab9e4156d8f724ccfa61f379a8aaa0a6&listall=true&response=json&signature=MYL8q14GpEp5QxnRDg4nRXnkhQg%3D: dial tcp 10.1.35.76:8080: i/o timeout
E1204 07:46:20.106727       1 controller.go:685] failed to update load balancer hosts for service kube-system/nginx-lb: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=ab9e4156d8f724ccfa61f379a8aaa0a6&listall=true&response=json&signature=MYL8q14GpEp5QxnRDg4nRXnkhQg%3D: dial tcp 10.1.35.76:8080: i/o timeout
I1204 07:46:20.107496       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"kube-system", Name:"nginx-lb", UID:"b9e4156d-8f72-4ccf-a61f-379a8aaa0a63", APIVersion:"v1", ResourceVersion:"937", FieldPath:""}): type: 'Warning' reason: 'UpdateLoadBalancerFailed' Error updating load balancer with new hosts map[test-control-19377251026:{} test-node-19377261815:{}]: error retrieving load balancer rules: Get http://10.1.35.76:8080/client/api?apiKey=m8GvJ-HIcVY9XurofyAvOLn6mV7aAURljtgpCVH13b5O48ej_ewMNjNo32iKe64oeSuYrzI_gEgg7JCTo_UjPA&command=listLoadBalancerRules&keyword=ab9e4156d8f724ccfa61f379a8aaa0a6&listall=true&response=json&signature=MYL8q14GpEp5QxnRDg4nRXnkhQg%3D: dial tcp 10.1.35.76:8080: i/o timeout

So basically a user on a vmware is unable to access the Kubernetes application via the kubernetes loadbalancer service

The workaround is expose the application via NodePort service

Follow steps 1 to 6

  1. Expose the application via node-port
kubectl  expose deploy/nginx-deployment --port=80 --type=Nodeport 

NAME                TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes          ClusterIP      10.96.0.1        <none>        443/TCP        5d1h
nginx-deployment2   NodePort       10.103.111.85    <none>        80:32014/TCP   4s
  1. Navigate to network and acquire a public IP

Screenshot 2024-12-04 at 3 53 06 PM

  1. Allow the firewall port 80 on the public IP address

Screenshot 2024-12-04 at 3 54 50 PM

  1. Add a loadbalcner rule mentioning the private node port

Screenshot 2024-12-04 at 3 56 45 PM

  1. Add the kubernetes node

Screenshot 2024-12-04 at 1 24 46 PM

  1. Access the application on the manually acquired public IP on port 80

Screenshot 2024-12-04 at 4 05 04 PM

I think for now we can document this workaround for CKS deployments on vmware and mention the loadbalancer service is not supported on vmware

cc @rajujith @vishesh92 @weizhouapache

Ref

https://kubernetes.io/docs/tutorials/kubernetes-basics/expose/expose-intro/
https://kubernetes.io/docs/concepts/services-networking/

@weizhouapache
Copy link
Member

thanks @kiranchavala providing the workaround

I agree we could document it for now.

We will work on static routes improvement and policy-based routes for next release.
With it, user vms could reach the mgmt server by adding a new route via public network.
cc @rajujith @DaanHoogland @alexandremattioli

@DaanHoogland
Copy link
Contributor

So, for now we need to tell vmware/cks users to create loadbalancer rules by hand. Would that work, @weizhouapache @vishesh92 ?

@vishesh92
Copy link
Member Author

So, for now we need to tell vmware/cks users to create loadbalancer rules by hand. Would that work, @weizhouapache @vishesh92 ?

We can tell users to create the LB rules by hand. But, the user would also have to ensure that the list of VMs stays up to date in the load balancer rules.
IMO, for long term, solution is to allow connectivity between MS & nodes. As of now, this also prevents the user to run a VM which could runs some automation scripts to manage cloudstack resources.

@DaanHoogland DaanHoogland moved this from Todo to Discuss in Apache CloudStack BugFest - Issues Dec 16, 2024
@DaanHoogland
Copy link
Contributor

So, for now we need to tell vmware/cks users to create loadbalancer rules by hand. Would that work, @weizhouapache @vishesh92 ?

We can tell users to create the LB rules by hand. But, the user would also have to ensure that the list of VMs stays up to date in the load balancer rules. IMO, for long term, solution is to allow connectivity between MS & nodes. As of now, this also prevents the user to run a VM which could runs some automation scripts to manage cloudstack resources.

How about setting a rule on the VR, only in the case the VM is a CKS control node?

@weizhouapache
Copy link
Member

So, for now we need to tell vmware/cks users to create loadbalancer rules by hand. Would that work, @weizhouapache @vishesh92 ?

We can tell users to create the LB rules by hand. But, the user would also have to ensure that the list of VMs stays up to date in the load balancer rules. IMO, for long term, solution is to allow connectivity between MS & nodes. As of now, this also prevents the user to run a VM which could runs some automation scripts to manage cloudstack resources.

How about setting a rule on the VR, only in the case the VM is a CKS control node?

@DaanHoogland
technically it is feasible, there are multiple options to achieve it.
however, from security concern, I would suggest not to enable it. The control IP of VR on VMware is private, uservm should not be able to use it.

@DaanHoogland
Copy link
Contributor

ok, for 4.19.2 we will put the text in #10012 (comment) in the documentation and after that, convert this issue to a discussion on how to address it in the end.

@kiranchavala
Copy link
Contributor

Created a doc pr

apache/cloudstack-documentation#466

@DaanHoogland DaanHoogland modified the milestones: 4.19.2, unplanned Jan 3, 2025
@apache apache locked and limited conversation to collaborators Jan 23, 2025
@DaanHoogland DaanHoogland converted this issue into discussion #10258 Jan 23, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
Development

No branches or pull requests

5 participants