Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold boot should handle scrimlet sled-agent restarts #4592

Closed
rcgoodfellow opened this issue Dec 1, 2023 · 5 comments
Closed

Cold boot should handle scrimlet sled-agent restarts #4592

rcgoodfellow opened this issue Dec 1, 2023 · 5 comments
Assignees
Milestone

Comments

@rcgoodfellow
Copy link
Contributor

During an update of rack 2, we encountered the following.

As sled agents began to launch, there was a bug (introduced by yours truly) that prevented the agents from getting out of early bootstrap. A new field added to the early network config caused a deserialization error that prevented sled agents from fully starting up. To work around this error, we read the persistent early network config file kept by the bootstore in /pool/int, added the missing field, and serialized the file back to /pool/int. We then restarted sled-agent. This caused sled-agent to read the updated early network config, which it was now able to parse. We had also bumped the generation number of the config, which caused the bootstore protocol to propagate this new value to all the other sled-agents.

At this point, things started to move forward again. Sled agents were transitioning from bootstrap-agent to sled-agent. However, we then hit another roadblock, the switches were not fully initialized. The sled-agent we restarted was a scrimlet sled-agent. So restarting it took down the switch zone and everything in it. When the switch zone came back up, it came up without any configuration. The dendrite service was not listening on the underlay, links had not been configured, addresses had not been configured, etc.

After looking through logs and various different states in the system, we decided to restart the same sled agent again. It got much further this time, with configured links and various other dpd state. However, the system was still not coming up. There was one node in the cluster that had synchronized with an upstream NTP server and had already launched Nexus (presumably in a brief period where the network was fully set up). Other nodes in the cluster had not made any real progress forward. This was because their NTP zones had not reached synchronization yet. After looking around more, we discovered this was due to the fact that there were missing NAT entries on the switches, and some missing address entries.

It appears that there were NAT entries created before our scrimlet sled-agent restart, and the act of restarting that sled-agent took out the switch zone clobbering these entries. I believe these entries were created by a different sled-agent, one with a boundary NTP zone that needed NAT. So when we restarted the scrimlet sled-agent, it had no idea it had missing NAT entries to repopulate. For the missing address entries, these were uplink addresses. They were present in the uplink SMF service properties, but they had not been added to the ASIC via Dendrite as local addresses. Not sure how that happened.

The takeaway here is that we need to be able to handle scrimlet sled-agent restarts during cold boot and keep driving forward toward the system coming back online, not getting stuck in half-configured states.

@morlandi7 morlandi7 added this to the 5 milestone Dec 1, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Dec 5, 2023
@internet-diglett
Copy link
Contributor

@rcgoodfellow was #4857 sufficient to solve this, or do we also need the changes from #4822?

@rcgoodfellow
Copy link
Contributor Author

We've confirmed we're in good shape here on a reboot of the scrimlet. But we should also test restarting just the sled agent service.

@rcgoodfellow
Copy link
Contributor Author

This can now be closed, both scrimlet reboots and sled-agent restarts have been tested.

@askfongjojo
Copy link

askfongjojo commented Mar 6, 2024

Test1: reboot without any ongoing orchestration activities

  • ignition-cycle scrimlet0 from wicket on scrimlet1 (sled comes back up with ntp, crucible and CRDB recreated)
  • perform post-reboot checks:
    • can ssh to existing VMs
    • existing VMs can make outbound internet traffic (e.g. ping 1.1.1.1, apt-get install)
    • intra-VPC VM-to-VM (load generator and database) workload runs without hitting connectivity errors
    • can provision new instances
  • svcadm restart sled-agent and repeat the above checks

Test2: reboot with ongoing orchestration activities

  • kick off a script that stops a bunch of running instances

  • while script is running, ignition-cycle scrimlet0 from wicket on scrimlet1

  • perform post-reboot checks

    • same as Test 1, plus checking if the stop instance script can finish
  • kick off a script that deletes the stopped instances

  • while script is running, do svcadm restart sled-agent and repeat the above checks

  • Observations:

    • Some instances were stuck in "stopping" state and eventually moved to "stopped" state after crucible zones were recreated on scrimlet0. (the behavior is expected)
    • Instances/disks placed on scrimlet0 during reboot and crucible zone recreation hit 500 errors. (also expected)
    • The script that stopped existing instances hung and had to be canceled and rerun. (perhaps not expected but seems like an issue not specific to scrimlet reboot; will replicate it with a non-scrimlet)

Test3: reboot with in-progress vm-to-vm traffic and a guest OS image import

  • kick off the workload that has intra-VPC inter-VM traffic
  • kick off a create instance script that provisions new instances in a loop
  • start a guest OS image import on the console
  • while all the things above are in progress, ignition-cycle scrimlet0 from wicket on scrimlet1
  • perform post-reboot checks
    • same as Test 1, plus checking if the instance creation script and image import can finish
  • svcadm restart sled-agent and repeat the above checks

Test4: repeat test3 on scrimlet1

@askfongjojo
Copy link

I saw an issue after the scrimlet reboot #5214. I haven't lined up all the timeline events but the new instances were all newly created after the reboot testing. It may be related to the scrimlet cold boot testing, regardless, this ticket can stay closed while we have more specifiy things to track down in #5214.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants