Instance going into failed state after hitting instance stop timeout #5235

askfongjojo · 2024-03-08T23:48:36Z

We got a report in the field about an instance being marked failed at the end of a stop-instance request. The instance happened to be ephemeral in nature and the user's intent was to delete it anyway so there was no data loss. But it could be a bigger problem if the instance was meant to be kept around and powered up again for later use.

From the customer ticket, the suspected sequence of events was:

The instance began to come to a stop
Propolis successfully stopped the instance and destroyed the VMM
The instance runner began to execute its terminate function
In the intervening 25 minutes, an API request came to Nexus asking to stop the instance
Nexus asked sled agent to stop the instance; this did nothing and timed out because the instance runner was busy doing something in InstanceRunner::terminate_inner and so was not servicing new Nexus requests
Nexus's request to sled agent hit its 60-second client timeout, causing the instance to go to Failed
After this the user deleted the instance
Sled agent finally decided to tear down the Propolis zone and publish a state update to Nexus, producing the 404 Not Found we see in the Nexus logs

The time turned out to be spent on two back-to-back zone bundle creation for the instance in question and another instance on the same sled (which will be tracked in a separate issue). The problem reported here is about how sled-agent and Nexus interaction can be improved to avoid hitting the client timeout.

gjcolombo · 2024-03-09T00:34:30Z

See #5237 (I cross-posted with this issue; mea culpa) for more discussion of how this could be improved in sled agent.

gjcolombo · 2024-09-26T21:38:02Z

A timeout while stopping an instance will no longer send that instance to Failed (see SledAgentInstanceError::vmm_gone), so I think this can be resolved.

askfongjojo mentioned this issue Mar 9, 2024

Improve zone-bundle creation performance to reduce stop-instance response times and avoid timeout situations #5236

Open

askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024

askfongjojo added this to the 8 milestone Mar 9, 2024

askfongjojo modified the milestones: 8, 9 Apr 24, 2024

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

morlandi7 modified the milestones: 10, 11 Aug 14, 2024

gjcolombo closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance going into failed state after hitting instance stop timeout #5235

Instance going into failed state after hitting instance stop timeout #5235

askfongjojo commented Mar 8, 2024

gjcolombo commented Mar 9, 2024

gjcolombo commented Sep 26, 2024

Instance going into failed state after hitting instance stop timeout #5235

Instance going into failed state after hitting instance stop timeout #5235

Comments

askfongjojo commented Mar 8, 2024

gjcolombo commented Mar 9, 2024

gjcolombo commented Sep 26, 2024