Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ps cmd can show inaccurate state #1046

Closed
matt2e opened this issue Mar 8, 2024 · 2 comments · Fixed by #1050
Closed

ps cmd can show inaccurate state #1046

matt2e opened this issue Mar 8, 2024 · 2 comments · Fixed by #1050
Assignees

Comments

@matt2e
Copy link
Collaborator

matt2e commented Mar 8, 2024

Sometimes old runners are shown for deployments.

Working on getting steps to repro

@matt2e matt2e self-assigned this Mar 8, 2024
@alecthomas alecthomas mentioned this issue Mar 8, 2024
@alecthomas
Copy link
Collaborator

There's a window in between when a runner is terminated and when it is cleaned up, could that be what you're seeing?

@matt2e
Copy link
Collaborator Author

matt2e commented Mar 8, 2024

I think the steps to repro are when a runner loses connection or doesn't end cleanly.

I'm imitating that by:

  • Commenting out the code in controller.go:900 that updates postgres when terminating a runner
  • Commenting out the code in runner.g:267,268 that updates the known state of the runner and deployment when terminating. So the soft kill happens but the internal state doesn't know about it
  • I think this would imitate a runner dying and then the controller having to react to it

Then bring up replicas and then reduce them down (so that we trigger the above pseudo runner crash)

  • ftl serve --recreate --idle-runners 0 --log-level=DEBUG <-- TBD: is idle-runners 0 a requirement to repro because we want the killed runner to have an active module?
  • ftl deploy examples/go/time -n 5
  • ftl ps -v
  • ftl update <name of time deployment> -n 1

After this:

  • ftl status: Shows 1 deployment, 1 runner
  • postgres: 5 rows, 4 dead, 1 assigned. All 5 have module_name = time. Normally module_name = null for dead runners
  • ftl ps -v shows five "live" runners:
DEPLOYMENT                               REPLICAS   STATE      RUNNER                      ENDPOINT                                          
time-c95eb3fe67-k9nbg                    1/1        live       R01HRE7P98ARASRQE0X48PK9NBG http://localhost:8897                             
time-c95eb3fe67-yt87e                    2/1        live       R01HRE7P96HQ793FYJBHAFYT87E http://localhost:8893                             
time-c95eb3fe67-d82gq                    3/1        live       R01HRE7P97EGH76H3RWARKD82GQ http://localhost:8895                             
time-c95eb3fe67-10qdb                    4/1        live       R01HRE7P97WCVYMMHP0P0210QDB http://localhost:8896                             
time-c95eb3fe67-18psv                    5/1        live       R01HRE7P970GCN1NS7YAR618PSV http://localhost:8894       

matt2e added a commit that referenced this issue Mar 12, 2024
fixes #1046

Runners that were found to have died (rather than cleanly being killed)
end up with deployment_id & module_name not set to null.
Changes:
- removes all dead runners from showing in the `ps` command
- correctly updates `deployment_id` and `module_name` to null in the
following cases:
    -  Runner terminates unexpectedly
    -  Runner is stale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants