Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance going into failed state after hitting instance stop timeout #5235

Closed
askfongjojo opened this issue Mar 8, 2024 · 2 comments
Closed
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

We got a report in the field about an instance being marked failed at the end of a stop-instance request. The instance happened to be ephemeral in nature and the user's intent was to delete it anyway so there was no data loss. But it could be a bigger problem if the instance was meant to be kept around and powered up again for later use.

From the customer ticket, the suspected sequence of events was:

  1. The instance began to come to a stop
  2. Propolis successfully stopped the instance and destroyed the VMM
  3. The instance runner began to execute its terminate function
  4. In the intervening 25 minutes, an API request came to Nexus asking to stop the instance
  5. Nexus asked sled agent to stop the instance; this did nothing and timed out because the instance runner was busy doing something in InstanceRunner::terminate_inner and so was not servicing new Nexus requests
  6. Nexus's request to sled agent hit its 60-second client timeout, causing the instance to go to Failed
  7. After this the user deleted the instance
  8. Sled agent finally decided to tear down the Propolis zone and publish a state update to Nexus, producing the 404 Not Found we see in the Nexus logs

The time turned out to be spent on two back-to-back zone bundle creation for the instance in question and another instance on the same sled (which will be tracked in a separate issue). The problem reported here is about how sled-agent and Nexus interaction can be improved to avoid hitting the client timeout.

@gjcolombo
Copy link
Contributor

See #5237 (I cross-posted with this issue; mea culpa) for more discussion of how this could be improved in sled agent.

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@askfongjojo askfongjojo added this to the 8 milestone Mar 9, 2024
@askfongjojo askfongjojo modified the milestones: 8, 9 Apr 24, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 1, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 14, 2024
@gjcolombo
Copy link
Contributor

A timeout while stopping an instance will no longer send that instance to Failed (see SledAgentInstanceError::vmm_gone), so I think this can be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

No branches or pull requests

3 participants