Timing Issue at Poseidon Restart leads to ignored Runners #598

mpass99 · 2024-05-22T19:37:11Z

Today we became aware of another event of idle runner count not matching the prewarming pool size.
Via our Poseidon Dashboard, we can trace back this deviation to the deployment of the 19th #465.

Evaluation
In Poseidon's logs we can follow the events:

Ansible connects
Poseidon gets notified about all allocations being stopped
Poseidon starts multiple Prewarming Pool Alert checks
Before the timeout of the checks runs out, Poseidon gets restarted
Poseidon recovers all environments and runner
It notices that 7 runners got lost (for whatever reason) and is Creating new runners at 11:22:04.40
It starts Watching Event Stream at 11:22:06.43
Only at 11:22:11.33, we see the first Runner started acknowledgment with a startup duration of 647 ms

Discussion
We have 2 seconds between the runners being requested and Poseidon being able to acknowledge new runners via the Event Stream. We see that the runners usually start in less than one second. Therefore, we assume that the 7 runners were started before Poseidon was ready to notice it.
Validation: In the Nomad UI, we can see 7 runners created on the 19th. All others were created today.
Preliminary Fix Suggestion: Recover the runners after starting to listen to the event stream.

Extra Question: Why did the Prewarming Pool Alert not catch this issue?
We have configured the Alert Threshold to 50% and we had most of the time 50% or more of the Prewarming Pool (8/15).

The text was updated successfully, but these errors were encountered:

mpass99 added the bug Something isn't working label May 22, 2024

mpass99 mentioned this issue Jun 13, 2024

Fix Nomad runner recovery #614

Merged

MrSerth closed this as completed in #614 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timing Issue at Poseidon Restart leads to ignored Runners #598

Timing Issue at Poseidon Restart leads to ignored Runners #598

mpass99 commented May 22, 2024

Timing Issue at Poseidon Restart leads to ignored Runners #598

Timing Issue at Poseidon Restart leads to ignored Runners #598

Comments

mpass99 commented May 22, 2024