Service and Endpoints for the node exporters are not correctly configured #826

SSvilen · 2021-11-23T14:08:13Z

The windows node exporter is installed on all windows worker nodes, but the required Service and Endpoint resources are no created at all.
There is a service object created, but it's from type ClusterIP, which in this case won't work.
The Service should be of type 'ExternalName' and the Endpoints should be updated by the operator on every node join/deletion operation.
For instance:

apiVersion: v1
kind: Service
metadata:
 labels:
   name: windows-exporter
 name: windows-exporter
 namespace: openshift-windows-machine-config-operator
spec:
 type: ExternalName
 ports:
   - name: metrics
     port: 9182
     protocol: TCP
     targetPort: 9182
 externalName: nodexporter
---
apiVersion: v1
kind: Endpoints
metadata:
 labels:
   name: windows-exporter
 name: windows-exporter
 namespace: openshift-windows-machine-config-operator
subsets:
 - addresses:
     - ip: 1.1.1.1
       targetRef:
         kind: Node
         name: winmach-q84jj
         uid: ab8028e7-a0ed-4f83-89e5-b577be2231ed
     - ip: 1.1.1.1
       targetRef:
         kind: Node
         name: winmach-t5vgm
         uid: 1b710328-88d5-4142-a78f-dd414705cc19
   ports:
     - name: metrics
       port: 9182
       protocol: TCP

mansikulkarni96 · 2021-11-23T15:17:38Z

@SSvilen thanks for the provided information.

As you can see in the manifests/windows-exporter_v1_service.yaml, the type is not set to ClusterIP.
The required service and endpoint names should be both windows-exporter as that name is used to get the resources in the operator code.
I suspect monitoring is not enabled in the operator namespace. Please ensure label openshift.io/cluster-monitoring=true is present in the openshift-windows-machine-config-operator namespace which is required for monitoring resources to be created by WMCO in that namespace.
If it is not enabled you can see a log like: install the prometheus-operator to enable Prometheus configuration in the WMCO logs.
Community Operators have a checkbox to enable monitoring in the operator namespace, if you are building from source you can use oc label ns openshift-windows-machine-config-operator openshift.io/cluster-monitoring=true --overwrite to set the label.
Let us know if that resolves the issue!

SSvilen · 2021-11-23T15:36:26Z

@mansikulkarni96,

The monitoring for the namespace is enabled. The problem is that the node exporter is installed on the windows worker nodes and it's not running as a pod, like it is for the linux based os. So prometheus operator can not properly discover the endpoint for that servicemonitor.
So I had to recreate the service and manually create the endpoint object, which in turn points to the windows nodes.
Or am I overthinking this?

mansikulkarni96 · 2021-11-23T15:53:15Z

@SSvilen Thanks for confirming that.
The behaviour you see is the expected behaviour, windows-exporter runs as a Windows service on the Windows nodes which is different from it's linux counterpart due to support reasons.
Prometheus operator should be able to discover the endpoint, you can take a look at the service_monitor.yaml, you can see how the re-labellings are applied to make the endpoint discoverable.
What you are expecting is exactly what the operator does, it updates the endpoint objects on every node join/deletion operation, more details in metrics.go if you are interested in the code base.
If you could provide Windows Machine Config Operator logs and details about the exact operator version, OCP version and the steps followed to reach this point, I should be able to help you out further.

SSvilen · 2021-11-23T17:16:09Z

OK, I see what's happening.

controller.windowsmachine    invalid Machine    {"name": "winmach-t5vgm", "error": "no internal IP address associated",

and based on the code in metrics.go an internal IP address is expected.

The status field of the machine shows type 'InternalDNS'

status:                                                                                                                                                                                                                                    
  addresses:                                                                                                                                                                                                                               
   - address: winmach-q84jj                                                                                                                                                                                                                 
     type: InternalDNS

I'm not sure why that is.

mansikulkarni96 · 2021-11-23T19:17:24Z

@SSvilen can you provide details about the WMCO version, cloud provider, OCP version and the Windows Server version used for the VM?
This is what the support matrix looks like Supported Cloud Providers based on OKD/OCP Version and WMCO version and Supported Windows Server versions.

SSvilen · 2021-11-24T14:06:38Z

@mansikulkarni96 ,

WMCO 3.1
OCP 4.8
Windows 20H2

But it would be also beneficial, if there is a bit more logging. For instance here. That would make the troubleshooting easier.

mansikulkarni96 · 2021-11-24T14:57:17Z

@SSvilen logging info noted.
According to your comment, the Windows worker node is present, is that added by using WMCO?
If yes, then the "no internal IP address associated" error should have resolved on its own as the IP address is not just required for metrics but also for the SSH connection to the VM.
I would request you some more information for the deubgging further:

Cloud provider information: is it vmware vSphere?
Node configuration method used here, provide info from one of the two:

Full output of oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
Windows MachineSet yaml/ configMap yaml depending on the Node configuration method used.
Output of oc get network.operator cluster -o yaml

SSvilen · 2021-11-25T09:09:55Z

@mansikulkarni96 ,

1.Cloud provider information: is it vmware vSphere?

Yes.

2. Node configuration method used here, provide info from one of the two:
- BYOH
- machinesSet

machinesSet

network.txt
operatorlogs.txt
machineSet.txt

Thanks!

mansikulkarni96 · 2021-11-30T20:22:56Z

@SSvilen Thanks for providing the logs, from the operator logs I can see the IP address cannot be found to configure the Windows machine into a node. You should see the same issue if you try to oc describe the machine object, it is trying to configure. I suspect it has to do with the golden image creation for vSphere. Please make sure you have followed all the steps described in vsphere-golden-image.md

SSvilen · 2021-12-06T09:45:27Z

@mansikulkarni96,

ok thanks. We'll look at it again.

openshift-bot · 2022-03-06T11:15:11Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-04-05T11:40:05Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

MattPOlson · 2022-04-22T18:12:58Z

@mansikulkarni96,

ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

mansikulkarni96 · 2022-05-09T16:37:37Z

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

SSvilen · 2022-05-10T07:37:52Z

@mansikulkarni96,
ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

MattPOlson · 2022-05-10T12:58:16Z

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

The cluster is running in vsphere and we are using machinesets to provision the servers. If change the service to be of type 'ExternalName' and create an endpoint that includes the node it works fine, its just not happening automatically like it should.

MattPOlson · 2022-05-10T14:09:46Z

@mansikulkarni96,
ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

Reverse DNS lookup works fine in our network, the internal IP still isn't being populated on the machine so the endpoint isn't being created.

ping -a 10.33..

Pinging k8s-se-****************** [10.33..] with 32 bytes of data:
Reply from 10.33..: bytes=32 time=2ms TTL=121
Reply from 10.33..: bytes=32 time=2ms TTL=121

SSvilen · 2022-05-10T18:33:28Z

@MattPOlson ,

why do the logs from the operator say when you add a new machine?
Are they BYOH or do you provision with machine sets?

MattPOlson · 2022-05-10T18:39:45Z

@MattPOlson ,

why do the logs from the operator say when you add a new machine? Are they BYOH or do you provision with machine sets?

Its throwing this error. I'm trying to figure out where/how in the code the operator gets the external IP address. They are provisioned as machine sets.

DEBUG controller.windowsmachine invalid Machine {"name": "k8s-se-platform-01-bq57b-win-lprdv", "error": "no internal IP address associated", "errorVerbose": "no internal IP address associated\ngithub.com/openshift/windows-machine-config-operator/controllers.getInternalIPAddress\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:523\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).isValidMachine\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:203\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).SetupWithManager.func2\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:114\nsigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Update\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/predicate/predicate.go:87\nsigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnUpdate\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:88\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}

mansikulkarni96 · 2022-05-10T18:54:36Z

@MattPOlson can you add the full WMCO log snippet? Those are just the initial debug logs, which should resolve themseleves once the IP for the machine is available.

MattPOlson · 2022-05-10T19:08:31Z

@mansikulkarni96 sure.
windows-machine-config-operator-8dc56cbb7-wfhdh-manager.log

MattPOlson · 2022-05-25T12:48:13Z

Any updates on this, I feel like this is either a legit issue or something isn't documented correctly as far as the setup goes. I looked through the code but I can't figure out why the internal IP still isn't being populated on the machine so the endpoint isn't being created.

saifshaikh48 · 2022-05-25T15:27:29Z

@MattPOlson can I ask what OCP and WMCO version you are using?
In the log you shared, I see some failures to watch/get the OperatorCondition k8s resource. The fix for this was backported to WMCO 3.1.1 and 4.0.1 for OCP 4.8 and 4.9 respectively.

MattPOlson · 2022-05-25T16:02:20Z

@saifshaikh48 sure:
operator: community-windows-machine-config-operator.v4.0.1
cluster: 4.9.0-0.okd-2022-02-12-140851

saifshaikh48 · 2022-05-25T16:14:13Z

Interesting, that version should have the proper permissions.

openshift-bot · 2022-06-24T18:52:04Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-06-24T18:56:02Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MattPOlson · 2022-07-12T20:54:28Z

This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes.

/reopen

openshift-ci · 2022-07-12T20:54:53Z

@MattPOlson: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sebsoto · 2022-07-13T13:04:55Z

I'll look into this today

/reopen

openshift-ci · 2022-07-13T13:06:48Z

@sebsoto: Reopened this issue.

In response to this:

I'll look into this today

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MattPOlson · 2022-07-13T14:16:51Z

windows-machine-config-operator-76cd78c4f5-45kv9-manager.log

sebsoto · 2022-07-13T14:28:23Z

Seeing

1.657650479492904e+09	DEBUG	events	Warning	{"object": {"kind":"Namespace","name":"openshift-windows-machine-config-operator","uid":"6fabb20a-a268-4c58-8fc7-30e887bb7dce","apiVersion":"v1","resourceVersion":"27258196"}, "reason": "labelValidationFailed", "message": "Cluster monitoring openshift.io/cluster-monitoring=true label is not enabled in openshift-windows-machine-config-operator namespace"}

and

1.6576521713493032e+09	INFO	metrics	install the prometheus-operator to enable Prometheus configuration

In the logs but the ns has the correct openshift.io/cluster-monitoring=true label on it

sebsoto · 2022-07-13T14:50:32Z

WMCO checks for metrics being enabled on the namespace its deployed only in at startup.
WMCO ignores the change if metrics are enabled/disabled while WMCO is running.

Thinking about two potential options to fix this

WMCO watches the namespace and enables/disables its metrics functionality depending on the label
WMCO checks the namespace label anytime it needs to reconcile the endpoint object

mtnbikenc · 2022-07-19T13:49:24Z

/remove-lifecycle rotten

openshift-bot · 2022-10-18T01:00:30Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sebsoto · 2022-10-18T15:36:37Z

This can be solved through https://issues.redhat.com/browse/WINC-545

sebsoto · 2022-10-18T15:36:54Z

/remove-lifecycle stale

sebsoto · 2022-10-18T15:36:59Z

/lifecycle frozen

aravindhp mentioned this issue Nov 25, 2021

Pods metrics endpoint #829

Closed

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 5, 2022

openshift-ci bot closed this as completed Jun 24, 2022

openshift-ci bot reopened this Jul 13, 2022

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 19, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022

openshift-ci bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 18, 2022

Service and Endpoints for the node exporters are not correctly configured #826

Service and Endpoints for the node exporters are not correctly configured #826

Comments

SSvilen commented Nov 23, 2021

mansikulkarni96 commented Nov 23, 2021

SSvilen commented Nov 23, 2021

mansikulkarni96 commented Nov 23, 2021

SSvilen commented Nov 23, 2021

mansikulkarni96 commented Nov 23, 2021

SSvilen commented Nov 24, 2021

mansikulkarni96 commented Nov 24, 2021 • edited Loading

SSvilen commented Nov 25, 2021

mansikulkarni96 commented Nov 30, 2021

SSvilen commented Dec 6, 2021

openshift-bot commented Mar 6, 2022

openshift-bot commented Apr 5, 2022

MattPOlson commented Apr 22, 2022

mansikulkarni96 commented May 9, 2022 • edited Loading

SSvilen commented May 10, 2022

MattPOlson commented May 10, 2022

MattPOlson commented May 10, 2022 • edited Loading

SSvilen commented May 10, 2022

MattPOlson commented May 10, 2022

mansikulkarni96 commented May 10, 2022

MattPOlson commented May 10, 2022

MattPOlson commented May 25, 2022

saifshaikh48 commented May 25, 2022

MattPOlson commented May 25, 2022

saifshaikh48 commented May 25, 2022

openshift-bot commented Jun 24, 2022

openshift-ci bot commented Jun 24, 2022

MattPOlson commented Jul 12, 2022

openshift-ci bot commented Jul 12, 2022

sebsoto commented Jul 13, 2022

openshift-ci bot commented Jul 13, 2022

MattPOlson commented Jul 13, 2022

sebsoto commented Jul 13, 2022

sebsoto commented Jul 13, 2022

mtnbikenc commented Jul 19, 2022

openshift-bot commented Oct 18, 2022

sebsoto commented Oct 18, 2022

sebsoto commented Oct 18, 2022

sebsoto commented Oct 18, 2022

mansikulkarni96 commented Nov 24, 2021 •

edited

Loading

mansikulkarni96 commented May 9, 2022 •

edited

Loading

MattPOlson commented May 10, 2022 •

edited

Loading