Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Suddenly unable to work #3183

Open
1 task done
wuwo1952368901 opened this issue Nov 7, 2024 · 12 comments
Open
1 task done

[Bug]: Suddenly unable to work #3183

wuwo1952368901 opened this issue Nov 7, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@wuwo1952368901
Copy link

Contact Details

No response

What happened?

Suddenly unable to ping between nodes.

Version

v0.24.2

What OS are you using?

No response

Relevant log output

No response

Contributing guidelines

  • Yes, I did.
@wuwo1952368901 wuwo1952368901 added the bug Something isn't working label Nov 7, 2024
@wuwo1952368901
Copy link
Author

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

@abhishek9686
Copy link
Member

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

can you provide more information on your environment?

  1. clients are running on which OS?
  2. Are they behind NAT?

@wuwo1952368901
Copy link
Author

wuwo1952368901 commented Nov 8, 2024

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

@yabinma
Copy link
Collaborator

yabinma commented Nov 8, 2024

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

@wuwo1952368901
Copy link
Author

They are not behind NAT.
OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds

@yabinma
Copy link
Collaborator

yabinma commented Nov 9, 2024

They are not behind NAT.
OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds

Auto Endpoint detection is enabled by default. So that the hosts are able to communicate each other with internal ip if they are in the same sub network.

In your setup, the host could not communicate each other with the network IP of k8s cluster.
Or you may disable the auto endpoint detection. In netmaker.env, set ENDPOINT_DETECTION=false and restart the containers with docker compose down & docker compose up -d

@wuwo1952368901
Copy link
Author

After synchronizing the configuration through "netclient pull", the node still cannot ping. Use the "wg show" command to check for the following:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 3 seconds ago
  transfer: 209.23 KiB received, 143.68 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 35 seconds ago
  transfer: 5.31 MiB received, 958.77 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.5/32
  transfer: 0 B received, 39.17 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 39.31 KiB sent
  persistent keepalive: every 20 seconds

The last two nodes cannot be pinged properly. The wg show command shows that the problematic nodes do not have a "latest handshake".

@yabinma @afeiszli

@abhishek9686
Copy link
Member

10.104.0.5
Hi,
can share the output of wg show of this peer 10.104.0.5

@wuwo1952368901
Copy link
Author

10.104.0.5
Hi,
can share the output of wg show of this peer 10.104.0.5

This is the information for the "wg show" on 10.104.0.5:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 11 seconds ago
  transfer: 11.06 MiB received, 56.23 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 21 seconds ago
  transfer: 368.95 MiB received, 321.13 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 489.67 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.1/32
  transfer: 0 B received, 465.68 KiB sent
  persistent keepalive: every 20 seconds

@wuwo1952368901
Copy link
Author

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148

@abhishek9686
Copy link
Member

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148

can you share your network diagram?

@wuwo1952368901
Copy link
Author

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148

can you share your network diagram?

Is this what you want?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants