[Windows] CoreDNS pod IP is not updated on Windows node #10901

ValeriiVozniuk · 2024-09-16T13:25:00Z

Environmental Info:
K3s Version:
k3s version v1.28.13+k3s1 (47737e1)
go version go1.22.5

Node(s) CPU architecture, OS, and Version:
Windows 2019 server

Cluster Configuration:
3 Linux master servers, 1 Windows worker

Describe the bug:
Pods on Windows worker are incorrectly resolving kube-dns service ip to non-existend pod ip, thus unable to resolve cluster records.

Steps To Reproduce:

Installed K3s: v1.28.13 via curl, Windows binary built and installed according to [K3s][Windows Port] Build script, multi-call binary, and Flannel #7259
Deployed sample windows container app
Try to reach from it other services from cluster

Expected behavior:
Workloads in pod are able to resolve both cluster and external records

Actual behavior:
Ping request could not find host c.cc. Please check the name and try again.

Additional context / logs:
I've set up tcpdump capture on all Linux nodes, and saw that all DNS requests are going to 10.42.0.41 IP, but CoreDNS pod did have a different IP, 10.42.0.6. After restarting CoreDNS pod, and Windows pods, new IP was assigned for CoreDNS, but Windows pod was still looking for 10.42.0.41. I've checked Endpoints/EndpointSlices on Linux nodes, and all addresses there were up to date.
Now, on Windows node I've enabled debug logging on k3s, searched for 10.42.0.41, and found the line starting with this:

time="2024-09-16T10:46:15Z" level=debug msg="Network Response : [{\"ActivityId\":\"9F7F991D-780A-4F1A-8ADC-BF1158013E49\",\"AdditionalParams\":{},\"EncapOverhead\":50,\"Health\":{\"LastErrorCode\":0,\"LastUpdateTime\":133697534994858093},\"ID\":\"FA77E64E-43DB-40C7-ACC8-ED5653B16C37\",\"IPAddress\":\"10.42.1.36\",\"InterfaceConstraint\":{\"InterfaceGuid\":\"00000000-0000-0000-0000-000000000000\"},\"IsRemoteEndpoint\":true,\"MacAddress\":\"02-11-0a-2a-01-24\",\"Name\":\"Ethernet\",\"Namespace\":{\"CompartmentGuid\":\"00000000-0000-0000-0000-000000000000\",\"ID\":\"00000000-0000-0000-0000-000000000000\"},\"Policies\":[{\"PA\":\"172.20.100.142\",\"Type\":\"PA\"}],\"PrefixLength\":32,\"Resources\":{\"AdditionalParams\":{},\"AllocationOrder\":1,\"Allocators\":[{\"AdditionalParams\":{},\"AllocationOrder\":0,\"CA\":\"10.42.1.36\",\"Health\":{\"LastErrorCode\":0,\"LastUpdateTime\":133697534995285484},\"ID\":\"649D7778-55AD-4C0B-8A1F-E0F14DCA205F\",\"IsLocal\":false,\"IsPolicy\":true,\"PA\":\"172.20.100.142\",\"State\":3,\"Tag\":\"VNET Policy\"}],\"Health\":{\"LastErrorCode\":0,\"LastUpdateTime\":133697534994858093}

The relevant part (whole line size is 120k) after formatting looks like this

  {
    "ActivityId": "6632602D-4A8B-4BB0-9F87-0557A3C08E5F",
    "AdditionalParams": {},
    "EncapOverhead": 50,
    "Health": {
      "LastErrorCode": 0,
      "LastUpdateTime": 133697535028756110
    },
    "ID": "6F995C07-0B4A-4CFB-BA6E-C62DD48A554D",
    "IPAddress": "10.42.0.41",
    "InterfaceConstraint": {
      "InterfaceGuid": "00000000-0000-0000-0000-000000000000"
    },
    "IsRemoteEndpoint": true,
    "MacAddress": "02-11-0a-2a-00-29",
    "Name": "Ethernet",
    "Namespace": {
      "CompartmentGuid": "00000000-0000-0000-0000-000000000000",
      "ID": "00000000-0000-0000-0000-000000000000"
    },
    "Policies": [
      {
        "PA": "172.30.10.142",
        "Type": "PA"
      }
    ],
    "PrefixLength": 32,
    "Resources": {
      "AdditionalParams": {},
      "AllocationOrder": 1,
      "Allocators": [
        {
          "AdditionalParams": {},
          "AllocationOrder": 0,
          "CA": "10.42.0.41",
          "Health": {
            "LastErrorCode": 0,
            "LastUpdateTime": 133697535029095230
          },
          "ID": "84BF1BFC-7A97-4264-B096-08EE7FECA8C8",
          "IsLocal": false,
          "IsPolicy": true,
          "PA": "172.30.10.142",
          "State": 3,
          "Tag": "VNET Policy"
        }
      ],
      "Health": {
        "LastErrorCode": 0,
        "LastUpdateTime": 133697535028756110
      },
      "ID": "6632602D-4A8B-4BB0-9F87-0557A3C08E5F",
      "PortOperationTime": 0,
      "State": 1,
      "SwitchOperationTime": 0,
      "VfpOperationTime": 0,
      "parentId": "4A458A1E-A97F-4722-B2F7-B031D040B8B7"
    },
    "SharedContainers": [],
    "State": 1,
    "Type": "Overlay",
    "Version": 38654705669,
    "VirtualNetwork": "d036ee19-52d3-4650-89a4-622b91ac9277",
    "VirtualNetworkName": "flannel.4096"
  },

Looking for id 6F995C07-0B4A-4CFB-BA6E-C62DD48A554D in hnsdiag list all output shown that this item indeed attached to kube-dns service

PS C:\Users\Administrator> hnsdiag list all | findstr 6f995c07-0b4a-4cfb-ba6e-c62dd48a554d
Ethernet         6f995c07-0b4a-4cfb-ba6e-c62dd48a554d flannel.4096
ce87ca71-adfd-4334-a5ce-760eae297bef |  10.43.0.10      | 6f995c07-0b4a-4cfb-ba6e-c62dd48a554d
302522f6-f540-4120-881c-b7a2eeecf0f6 |  10.43.0.10      | 6f995c07-0b4a-4cfb-ba6e-c62dd48a554d
ddd666f5-c1e9-4fe1-af6b-da97be5cdb29 |  10.43.0.10      | 6f995c07-0b4a-4cfb-ba6e-c62dd48a554d

But where this "Network Response" is coming from, and why it contains outdated data, is not clear for me.

The text was updated successfully, but these errors were encountered:

ValeriiVozniuk · 2024-09-27T10:39:58Z

I've tested this behavior on all current releases: 1.28.14/1.29.9/1.30.5/1.31.1, and the issue is present everywhere.
The steps to reproduce:

Create 3-master linux cluster.
Tweak flannel settings to following to have connectivity with Windows worker (according to https://github.com/microsoft/SDN/tree/master/Kubernetes/flannel/overlay):
/var/lib/rancher/k3s/agent/etc/flannel/net-fix.json

{
        "Network": "10.42.0.0/16",
        "EnableIPv6": false,
        "EnableIPv4": true,
        "IPv6Network": "::/0",
        "Backend": {
        "Type": "vxlan",
        "VNI": 4096,
        "Name": "vxlan0",
        "Port": 4789
}
}

/var/lib/rancher/k3s/agent/etc/cni/net.d/20-flannel.conflist

{
  "name":"vxlan0",
  "cniVersion":"1.0.0",
  "plugins":[
    {
      "type":"flannel",
      "delegate":{
        "hairpinMode":true,
        "forceAddress":true,
        "isDefaultGateway":true
      }
    },
    {
      "type":"portmap",
      "capabilities":{
        "portMappings":true
      }
    },
    {
      "type":"bandwidth",
      "capabilities":{
        "bandwidth":true
      }
    }
  ]
}

Add following to k3s service:

--disable-network-policy --flannel-conf=/var/lib/rancher/k3s/agent/etc/flannel/net-fix.json --flannel-cni-conf=/var/lib/rancher/k3s/agent/etc/cni/net.d/20-flannel.conflist

Join Windows worker, and start a test workload on it (I've took sample app from https://learn.microsoft.com/en-us/azure/aks/learn/quick-windows-container-deploy-portal?tabs=azure-cli#deploy-the-application).
Exec into started pod into cmd or powershell, and run nslookup.
If CoreDNS pod was not restarted, nslookup will start quickly, and prompt you with

C:\inetpub\wwwroot>nslookup
Default Server:  kube-dns.kube-system.svc.cluster.local
Address:  10.43.0.10

>

Now, record CoreDNS pod IP, and delete this pod. Wait for replacement pod to start, and check its IP, it should be different from previous one.
Exec into Windows pod again, and run nslookup, this time you will get

DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.43.0.10

>

If you try to resolve any name, it will fail.

> c.cc
Server:  UnKnown
Address:  10.43.0.10

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out

Type server <CoreDNS_Pod_New_IP>, once you get the prompt back, try to resolve any domain again. It will succeed.

> server 10.42.1.5
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [10.42.1.5]
Address:  10.42.1.5

> c.cc
Server:  [10.42.1.5]
Address:  10.42.1.5

Non-authoritative answer:
Name:    c.cc
Address:  54.252.112.134

>

Thus the network connectivity is working, the issue is Windows not updating new destination IP for kube-dns service. If you run tcpdump on the done where original CoreDNS pod was running, you would see DNS requests from Windows node with old pod IP.
With 1.30.5 I saw that sometimes the resolve with CoreDNS is working after pod reboot until you restart Windows worker, then it start to fail as described above.
The only way to restore DNS after that is to stop k3s on Windows node, delete its node from Kubernetes, delete all created folders on Windows node, and re-join it again. After that it will work till pod/node reboot.

brandond · 2024-09-30T19:33:16Z

We don't currently build or support K3s on Windows, so you're probably going to be on your own in figuring this one out for the moment. It sounds like an issue with kube-proxy though, as that's what's responsible for maintaining local connectivity to cluster service endpoints.

ValeriiVozniuk · 2024-10-01T06:54:51Z

Well, I was looking on the pull request above, and #9313, and was under impression that at least some work is ongoing for Windows support, and issues like one I've described here should be reported :)

github-actions · 2024-11-15T20:05:19Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

ValeriiVozniuk · 2024-11-17T21:15:12Z

Not stale

dereknola · 2024-11-19T18:31:36Z

Microsoft was doing the bulk of the work to get K3s on Windows working, as part of their work on https://learn.microsoft.com/en-us/azure/aks/hybrid/aks-edge-overview. After the alpha release, they seemed to have abandoned the project of natively supporting K3s on Windows in favor of simply using their linux based Mariner VM.

github-project-automation bot added this to K3s Development Sep 16, 2024

github-project-automation bot moved this to New in K3s Development Sep 16, 2024

caroline-suse-rancher moved this from New to In Triage in K3s Development Oct 8, 2024

github-actions bot added the status/stale label Nov 15, 2024

github-actions bot removed the status/stale label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Windows] CoreDNS pod IP is not updated on Windows node #10901

[Windows] CoreDNS pod IP is not updated on Windows node #10901

ValeriiVozniuk commented Sep 16, 2024

ValeriiVozniuk commented Sep 27, 2024

brandond commented Sep 30, 2024

ValeriiVozniuk commented Oct 1, 2024

github-actions bot commented Nov 15, 2024

ValeriiVozniuk commented Nov 17, 2024

dereknola commented Nov 19, 2024

[Windows] CoreDNS pod IP is not updated on Windows node #10901

[Windows] CoreDNS pod IP is not updated on Windows node #10901

Comments

ValeriiVozniuk commented Sep 16, 2024

ValeriiVozniuk commented Sep 27, 2024

brandond commented Sep 30, 2024

ValeriiVozniuk commented Oct 1, 2024

github-actions bot commented Nov 15, 2024

ValeriiVozniuk commented Nov 17, 2024

dereknola commented Nov 19, 2024