Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing Issue at Poseidon Restart leads to ignored Runners #598

Closed
mpass99 opened this issue May 22, 2024 · 0 comments · Fixed by #614
Closed

Timing Issue at Poseidon Restart leads to ignored Runners #598

mpass99 opened this issue May 22, 2024 · 0 comments · Fixed by #614
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Contributor

mpass99 commented May 22, 2024

Today we became aware of another event of idle runner count not matching the prewarming pool size.
Via our Poseidon Dashboard, we can trace back this deviation to the deployment of the 19th #465.

Evaluation
In Poseidon's logs we can follow the events:

  • Ansible connects
  • Poseidon gets notified about all allocations being stopped
  • Poseidon starts multiple Prewarming Pool Alert checks
  • Before the timeout of the checks runs out, Poseidon gets restarted
  • Poseidon recovers all environments and runner
  • It notices that 7 runners got lost (for whatever reason) and is Creating new runners at 11:22:04.40
  • It starts Watching Event Stream at 11:22:06.43
  • Only at 11:22:11.33, we see the first Runner started acknowledgment with a startup duration of 647 ms

Discussion
We have 2 seconds between the runners being requested and Poseidon being able to acknowledge new runners via the Event Stream. We see that the runners usually start in less than one second. Therefore, we assume that the 7 runners were started before Poseidon was ready to notice it.
Validation: In the Nomad UI, we can see 7 runners created on the 19th. All others were created today.
Preliminary Fix Suggestion: Recover the runners after starting to listen to the event stream.

Extra Question: Why did the Prewarming Pool Alert not catch this issue?
We have configured the Alert Threshold to 50% and we had most of the time 50% or more of the Prewarming Pool (8/15).

@mpass99 mpass99 added the bug Something isn't working label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant