Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New instances in a rebooted sled are unable to reach existing instances in other sleds on their private IPs #5214

Closed
askfongjojo opened this issue Mar 7, 2024 · 10 comments · Fixed by #5568
Assignees
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Mar 7, 2024

I noticed this issue after running a bunch of scrimlet reboot tests on rack2. One of the instances in question happens to be on a scrimlet I rebooted at the tail end of the testing. It was however created at least an hour after the reboot happened so it's unclear how it could be related.

Here are the instance details:

# instance name uuid sled external IP private IP
1 prov-time-16c-32m c856f03c-f45a-4288-94fa-c68b3a283482 BRM44220011 172.20.26.186 172.30.0.24
2 sbmysql-9 29e7866a-d504-4629-9ece-8bee96fbab73 BRM42220014 172.20.26.72 172.30.0.21

Instance 1 is able to reach all other instances on their private IPs within the subnet except for instance 2

ubuntu@vm-16c-32m:~$ ping 172.30.0.21
PING 172.30.0.21 (172.30.0.21) 56(84) bytes of data.
^C
--- 172.30.0.21 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 8179ms

ubuntu@vm-16c-32m:~$ ping 172.30.0.9
PING 172.30.0.9 (172.30.0.9) 56(84) bytes of data.
64 bytes from 172.30.0.9: icmp_seq=1 ttl=64 time=0.468 ms
64 bytes from 172.30.0.9: icmp_seq=2 ttl=64 time=0.345 ms
^C
--- 172.30.0.9 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1027ms
rtt min/avg/max/mdev = 0.345/0.406/0.468/0.061 ms

But it can reach instance 2 on its external IP

ubuntu@vm-16c-32m:~$ ping 172.20.26.72
PING 172.20.26.72 (172.20.26.72) 56(84) bytes of data.
64 bytes from 172.20.26.72: icmp_seq=1 ttl=62 time=0.620 ms
64 bytes from 172.20.26.72: icmp_seq=2 ttl=62 time=0.405 ms
^C
--- 172.20.26.72 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 0.405/0.512/0.620/0.107 ms

The same goes with instance 2 against other instances in the subnet vs instance 1:

ubuntu@sbmysql9:~$ ping 172.30.0.24
PING 172.30.0.24 (172.30.0.24) 56(84) bytes of data.
^C
--- 172.30.0.24 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9202ms

ubuntu@sbmysql9:~$ ping 172.30.0.9
PING 172.30.0.9 (172.30.0.9) 56(84) bytes of data.
64 bytes from 172.30.0.9: icmp_seq=1 ttl=64 time=0.430 ms
64 bytes from 172.30.0.9: icmp_seq=2 ttl=64 time=0.293 ms
64 bytes from 172.30.0.9: icmp_seq=3 ttl=64 time=0.376 ms
^C
--- 172.30.0.9 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2035ms
rtt min/avg/max/mdev = 0.293/0.366/0.430/0.056 ms

ubuntu@sbmysql9:~$ ping 172.20.26.186 
PING 172.20.26.186 (172.20.26.186) 56(84) bytes of data.
64 bytes from 172.20.26.186: icmp_seq=1 ttl=62 time=0.423 ms
64 bytes from 172.20.26.186: icmp_seq=2 ttl=62 time=0.385 ms
^C
--- 172.20.26.186 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.385/0.404/0.423/0.019 ms

Instance 2 was rebooted (stopped/started) once after it was created. I didn't check the private IP connectivity between the two events so it's unclear if the connectivity was there prior to the instance reboot.

Here are the firewall opte entries from opteadm. The list is VERY long and I haven't been able to interpret what it means... but I'm dumping it here in case it helps.

BRM42220014 # /opt/oxide/opte/bin/opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     EPHEMERAL IPv4   FLOATING IPv4    IPv6 ADDRESS                             EXTERNAL IPv6                            FLOATING IPv6                            STATE   
opte3                            A8:40:25:F9:F8:1B        172.30.0.21      172.20.26.72                      None                                     None                                     None                                     running 

opte3-dump-layer-firewall.log
(Note: I was using the instances for some netperf and iperf3 tests. This is why there are a gazillion number of ports in use.)

@askfongjojo
Copy link
Author

The firewall entries of the opte port for instance1 look a lot more normal.

BRM44220011 #  /opt/oxide/opte/bin/opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     EPHEMERAL IPv4   FLOATING IPv4    IPv6 ADDRESS                             EXTERNAL IPv6                            FLOATING IPv6                            STATE   
opte0                            A8:40:25:FF:A6:83        172.30.2.5       None             172.20.26.3      None                                     None                                     None                                     running 
opte1                            A8:40:25:F3:E1:06        172.30.0.5       172.20.26.23                      None                                     None                                     None                                     running 
opte3                            A8:40:25:FA:29:E1        172.30.0.24      172.20.26.186                     None                                     None                                     None                                     running 
opte8                            A8:40:25:F7:09:D1        172.30.0.12      172.20.26.51                      None                                     None                                     None                                     running 

opte3-dump-layer-firewall-prov-time-16c-32m.log

@FelixMcFelix
Copy link
Contributor

FelixMcFelix commented Mar 7, 2024

So looking into the firewall stats on both sides using kstat -m xde -n opte3_firewall:

Instance 1 Instance 2
module: xde                             instance: 0
name:   opte3_firewall                  class:    net
        add_rule_called                 0
        crtime                          10037.794454369
        flow_ttl                        60
        flows                           0
        in_deny                         0
        in_lft_full                     0
        in_lft_hit                      171
        in_lft_miss                     109
        in_rule_match                   109
        in_rule_nomatch                 0
        in_rules                        26
        lft_capacity                    8096
        out_deny                        0
        out_lft_full                    0
        out_lft_hit                     134
        out_lft_miss                    192
        out_rule_match                  0
        out_rule_nomatch                192
        out_rules                       0
        remove_rule_called              0
        set_rules_called                2
        snaptime                        155957.526347067
module: xde                             instance: 0
name:   opte3_firewall                  class:    net
        add_rule_called                 0
        crtime                          10001.242892785
        flow_ttl                        60
        flows                           0
        in_deny                         0
        in_lft_full                     0
        in_lft_hit                      52
        in_lft_miss                     21
        in_rule_match                   21
        in_rule_nomatch                 0
        in_rules                        26
        lft_capacity                    8096
        out_deny                        0
        out_lft_full                    0
        out_lft_hit                     65
        out_lft_miss                    58
        out_rule_match                  0
        out_rule_nomatch                58
        out_rules                       0
        remove_rule_called              0
        set_rules_called                2
        snaptime                        48804.923530470

It doesn't look like a firewalling issue, which is supported by the default (DEF) deny inbound action having 0 hits. (I've opened oxidecomputer/opte#468 about compressing this output.)

Taking a look at V2P mappings opteadm dump-v2p under VNI 1508093, it looks like BRM42220014 is missing some entries:

kyle@KyleOxide scraps % git diff sled8.log sled16.log
diff --git a/sled8.log b/sled16.log
index 0eddfba..e41d136 100644
--- a/sled8.log
+++ b/sled16.log
@@ -4,15 +4,9 @@ VPC 1508093
 IPv4 mappings
 ----------------------------------------------------------------------
 VPC IP                   VPC MAC ADDR      UNDERLAY IP
-172.30.0.6               A8:40:25:FA:A2:20 fd00:1122:3344:105::1
-172.30.0.8               A8:40:25:FD:E4:2F fd00:1122:3344:105::1
 172.30.0.9               A8:40:25:FB:4B:4C fd00:1122:3344:106::1
 172.30.0.10              A8:40:25:FC:D7:FF fd00:1122:3344:106::1
-172.30.0.11              A8:40:25:F0:F6:95 fd00:1122:3344:105::1
 172.30.0.12              A8:40:25:F7:09:D1 fd00:1122:3344:103::1
-172.30.0.13              A8:40:25:F2:DD:C6 fd00:1122:3344:10a::1
-172.30.0.14              A8:40:25:FA:A8:4D fd00:1122:3344:101::1
-172.30.0.15              A8:40:25:F8:1C:AC fd00:1122:3344:106::1
 172.30.0.16              A8:40:25:F3:7D:A8 fd00:1122:3344:106::1
 172.30.0.17              A8:40:25:FB:E7:50 fd00:1122:3344:10a::1
 172.30.0.18              A8:40:25:F1:A9:EA fd00:1122:3344:105::1
@@ -21,10 +15,6 @@ VPC IP                   VPC MAC ADDR      UNDERLAY IP
 172.30.0.21              A8:40:25:F9:F8:1B fd00:1122:3344:108::1
 172.30.0.22              A8:40:25:F0:1C:50 fd00:1122:3344:105::1
 172.30.0.23              A8:40:25:F4:C8:59 fd00:1122:3344:109::1
-172.30.0.24              A8:40:25:FA:29:E1 fd00:1122:3344:103::1
-172.30.0.25              A8:40:25:F9:D1:DB fd00:1122:3344:105::1
-172.30.0.26              A8:40:25:F1:A3:87 fd00:1122:3344:105::1
-172.30.0.27              A8:40:25:FB:14:E4 fd00:1122:3344:10a::1
 192.168.32.5             A8:40:25:F7:90:73 fd00:1122:3344:101::1
 192.168.32.6             A8:40:25:FD:AD:A7 fd00:1122:3344:106::1
 192.168.32.7             A8:40:25:FC:E0:AE fd00:1122:3344:106::1
@@ -39,11 +29,11 @@ VPC IP                   VPC MAC ADDR      UNDERLAY IP
 192.168.32.16            A8:40:25:F3:F3:4C fd00:1122:3344:10b::1
 192.168.32.17            A8:40:25:F9:B2:26 fd00:1122:3344:109::1
 192.168.32.18            A8:40:25:F8:46:69 fd00:1122:3344:106::1
+192.168.32.19            A8:40:25:F8:55:40 fd00:1122:3344:103::1
 192.168.32.20            A8:40:25:F0:B4:99 fd00:1122:3344:106::1
 192.168.32.21            A8:40:25:F8:3D:31 fd00:1122:3344:10a::1
 192.168.32.22            A8:40:25:F0:B0:86 fd00:1122:3344:101::1
 192.168.32.23            A8:40:25:F2:F5:1C fd00:1122:3344:109::1
-192.168.32.24            A8:40:25:FA:D3:B7 fd00:1122:3344:108::1

 IPv6 mappings
 ----------------------------------------------------------------------

Specifically, BRM42220014 (sled 16) does not have a mapping on your VPC from 172.30.0.24 to fd00:1122:3344:103::1/BRM44220011 (sled 8). But the only OPTE-recorded drops I'm seeing are on the "gateway" layer – I'd expect they should be showing up on the overlay layer when the VPC lookup for the dest fails. E.g.:

https://github.com/oxidecomputer/opte/blob/b85995f92ae94cdc78b97b0a610c69e103e00423/lib/oxide-vpc/src/engine/overlay.rs#L291-L318

@askfongjojo
Copy link
Author

askfongjojo commented Mar 7, 2024

Here are the most recent start times of the sled-agent and dendrite services on BRM42220014 (sled 16):

BRM42220014 # svcs sled-agent
STATE          STIME    FMRI
online         21:53:17 svc:/oxide/sled-agent:default

root@oxz_switch1:~# svcs dendrite
STATE          STIME    FMRI
online         21:53:59 svc:/oxide/dendrite:default

BRM44220011 has not been rebooted and its sled-agent has been running since the last rack update:

BRM44220011 # svcs sled-agent
STATE          STIME    FMRI
online         1986     svc:/oxide/sled-agent:default

Instance 2 was created after the scrimlet/service restarts:

select id, time_created, time_state_updated, time_deleted from vmm where instance_id = '29e7866a-d504-4629-9ece-8bee96fbab73';
                   id                  |         time_created          |      time_state_updated       |         time_deleted
---------------------------------------+-------------------------------+-------------------------------+--------------------------------
  092e837e-d5af-4a44-835f-2fd56859166b | 2024-03-07 00:30:18.721837+00 | 2024-03-07 00:30:30.829001+00 | NULL
  f57b7848-ef16-4493-ad66-5435f0e74ac8 | 2024-03-07 00:00:59.858202+00 | 2024-03-07 00:27:43.759553+00 | 2024-03-07 00:27:45.674282+00

Instance 1 was created before the scrimlet reboots and remained up and running during the scrimlet/service restarts:

select id, time_created, time_state_updated, time_deleted from vmm where instance_id = 'c856f03c-f45a-4288-94fa-c68b3a283482';
                   id                  |         time_created          |      time_state_updated       |         time_deleted
---------------------------------------+-------------------------------+-------------------------------+--------------------------------
  0933dcfb-da57-4122-af70-485403a5cfbd | 2024-03-02 05:02:53.455631+00 | 2024-03-05 07:05:44.386334+00 | 2024-03-05 07:05:45.051707+00
  cd2e027c-78f8-4a3e-bcea-89a1c83cd295 | 2024-03-05 18:41:40.765275+00 | 2024-03-05 18:41:52.520954+00 | NULL
(2 rows)

@askfongjojo
Copy link
Author

I've checked that the v2p entries highlighted as missing on BRM42220014 correspond to instances created prior to the reboot. So apparently, the issue is more broadly a failure to backfill v2p entries that exist prior to the sled reboot. This seems to be an area for rpw so I'm reassigning the ticket to @internet-diglett.

@askfongjojo askfongjojo modified the milestones: 7, 8 Mar 8, 2024
@askfongjojo askfongjojo changed the title A certain pair of instances are unable to reach each other on private IPs New instances in a rebooted sled are unable to reach existing instances in other sleds on their private IPs Mar 8, 2024
@askfongjojo
Copy link
Author

askfongjojo commented Mar 8, 2024

To be clear, this issue is not a regression and has always been there because v2p mappings are created only during an instance start event. The saga/push approach is a linear way of broadcasting information and doesn't account for exceptions such as sled reboot/panic and sled outage (#4259). The issue is masked to some extent because we usually stop all running instances prior to planned sled reboots or let them fail (and eventually get destroyed) otherwise. We/customers could have run into it in the past during random sled panics but worked around it unknowingly by stopping/starting the unreachable instances.

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@davepacheco
Copy link
Collaborator

For the record: I asked whether this was a blocker for R8 (for delivery of "add sled"). We determined that it's likely not. That's because in R8 we'd be doing "add sled" during the upgrade maintenance window. Because of the way updates work today, all instances would be started after that point (even those that had been running prior to the window). So we shouldn't run into this just because of "add sled".

@internet-diglett
Copy link
Contributor

@davepacheco that seems correct. I don't see this causing any issues in that scenario,

@askfongjojo askfongjojo modified the milestones: 8, 9 May 7, 2024
internet-diglett added a commit that referenced this issue May 22, 2024
TODO
---
- [x] Extend db view to include probe v2p mappings
- [x] Update sagas to trigger rpw activation instead of directly
configuring v2p mappings
- [x] Test that the `delete` functionality cleans up v2p mappings

Related
---
Resolves #5214 
Resolves #4259 
Resolves #3107

- [x] Depends on oxidecomputer/opte#494
- [x] Depends on oxidecomputer/meta#409
- [x] Depends on oxidecomputer/maghemite#244

---------

Co-authored-by: Levon Tarver <[email protected]>
@internet-diglett
Copy link
Contributor

@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood.

@askfongjojo
Copy link
Author

I've checked that the issue is not reproducible on rack2 (which has #5568). I'll repeat my verifications once another related fix in this area (#5845) has landed.

@askfongjojo
Copy link
Author

Confirmed that the issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants