ps cmd can show inaccurate state #1046

matt2e · 2024-03-08T03:39:03Z

Sometimes old runners are shown for deployments.

Working on getting steps to repro

alecthomas · 2024-03-08T03:43:40Z

There's a window in between when a runner is terminated and when it is cleaned up, could that be what you're seeing?

matt2e · 2024-03-08T05:19:33Z

I think the steps to repro are when a runner loses connection or doesn't end cleanly.

I'm imitating that by:

Commenting out the code in controller.go:900 that updates postgres when terminating a runner
Commenting out the code in runner.g:267,268 that updates the known state of the runner and deployment when terminating. So the soft kill happens but the internal state doesn't know about it
I think this would imitate a runner dying and then the controller having to react to it

Then bring up replicas and then reduce them down (so that we trigger the above pseudo runner crash)

ftl serve --recreate --idle-runners 0 --log-level=DEBUG <-- TBD: is idle-runners 0 a requirement to repro because we want the killed runner to have an active module?
ftl deploy examples/go/time -n 5
ftl ps -v
ftl update <name of time deployment> -n 1

After this:

ftl status: Shows 1 deployment, 1 runner
postgres: 5 rows, 4 dead, 1 assigned. All 5 have module_name = time. Normally module_name = null for dead runners
ftl ps -v shows five "live" runners:

DEPLOYMENT                               REPLICAS   STATE      RUNNER                      ENDPOINT                                          
time-c95eb3fe67-k9nbg                    1/1        live       R01HRE7P98ARASRQE0X48PK9NBG http://localhost:8897                             
time-c95eb3fe67-yt87e                    2/1        live       R01HRE7P96HQ793FYJBHAFYT87E http://localhost:8893                             
time-c95eb3fe67-d82gq                    3/1        live       R01HRE7P97EGH76H3RWARKD82GQ http://localhost:8895                             
time-c95eb3fe67-10qdb                    4/1        live       R01HRE7P97WCVYMMHP0P0210QDB http://localhost:8896                             
time-c95eb3fe67-18psv                    5/1        live       R01HRE7P970GCN1NS7YAR618PSV http://localhost:8894

fixes #1046 Runners that were found to have died (rather than cleanly being killed) end up with deployment_id & module_name not set to null. Changes: - removes all dead runners from showing in the `ps` command - correctly updates `deployment_id` and `module_name` to null in the following cases: - Runner terminates unexpectedly - Runner is stale

matt2e self-assigned this Mar 8, 2024

alecthomas mentioned this issue Mar 8, 2024

Dashboard #728

Open

matt2e mentioned this issue Mar 8, 2024

fix: ps cmd should ignore dead runners #1050

Merged

matt2e closed this as completed in #1050 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ps cmd can show inaccurate state #1046

ps cmd can show inaccurate state #1046

matt2e commented Mar 8, 2024

alecthomas commented Mar 8, 2024

matt2e commented Mar 8, 2024

ps cmd can show inaccurate state #1046

ps cmd can show inaccurate state #1046

Comments

matt2e commented Mar 8, 2024

alecthomas commented Mar 8, 2024

matt2e commented Mar 8, 2024