You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today we became aware of another event of idle runner count not matching the prewarming pool size.
Via our Poseidon Dashboard, we can trace back this deviation to the deployment of the 19th#465.
Evaluation
In Poseidon's logs we can follow the events:
Ansible connects
Poseidon gets notified about all allocations being stopped
Poseidon starts multiple Prewarming Pool Alert checks
Before the timeout of the checks runs out, Poseidon gets restarted
Poseidon recovers all environments and runner
It notices that 7 runners got lost (for whatever reason) and is Creating new runners at 11:22:04.40
It starts Watching Event Stream at 11:22:06.43
Only at 11:22:11.33, we see the first Runner started acknowledgment with a startup duration of 647 ms
Discussion
We have 2 seconds between the runners being requested and Poseidon being able to acknowledge new runners via the Event Stream. We see that the runners usually start in less than one second. Therefore, we assume that the 7 runners were started before Poseidon was ready to notice it.
Validation: In the Nomad UI, we can see 7 runners created on the 19th. All others were created today.
Preliminary Fix Suggestion: Recover the runners after starting to listen to the event stream.
Extra Question: Why did the Prewarming Pool Alert not catch this issue?
We have configured the Alert Threshold to 50% and we had most of the time 50% or more of the Prewarming Pool (8/15).
The text was updated successfully, but these errors were encountered:
Today we became aware of another event of idle runner count not matching the prewarming pool size.
Via our Poseidon Dashboard, we can trace back this deviation to the deployment of the 19th #465.
Evaluation
In Poseidon's logs we can follow the events:
Prewarming Pool Alert
checksCreating new runners
at11:22:04.40
Watching Event Stream
at11:22:06.43
11:22:11.33
, we see the firstRunner started
acknowledgment with a startup duration of 647 msDiscussion
We have 2 seconds between the runners being requested and Poseidon being able to acknowledge new runners via the Event Stream. We see that the runners usually start in less than one second. Therefore, we assume that the 7 runners were started before Poseidon was ready to notice it.
Validation: In the Nomad UI, we can see 7 runners created on the 19th. All others were created today.
Preliminary Fix Suggestion: Recover the runners after starting to listen to the event stream.
Extra Question: Why did the
Prewarming Pool Alert
not catch this issue?We have configured the Alert Threshold to 50% and we had most of the time 50% or more of the Prewarming Pool (8/15).
The text was updated successfully, but these errors were encountered: