Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lagoon cronjobs can hang forever #327

Open
smlx opened this issue Jun 20, 2024 · 2 comments · May be fixed by #405
Open

Lagoon cronjobs can hang forever #327

smlx opened this issue Jun 20, 2024 · 2 comments · May be fixed by #405
Labels
bug Something isn't working

Comments

@smlx
Copy link
Member

smlx commented Jun 20, 2024

Kubernetes doesn't set a runtime limit on cronjobs by default, which means that "stuck" cronjobs can run forever.

  • From an administrator's perspective, there is no point in pods sitting around doing nothing.
  • From a user's perspective, the stuck pods block any further executions of the cronjob. So it will appear that the cronjob has just stopped running:
Events:
  Type    Reason            Age                 From                Message
  ----    ------            ----                ----                -------
  Normal  JobAlreadyActive  45m (x28 over 28h)  cronjob-controller  Not starting job because prior execution is running and concurrency policy is Forbid

IMO Lagoon should set some reasonable time limit on cronjobs so that stuck pods don't sit around forever. This can be done by adding activeDeadlineSeconds to the Job template (docs).

What that reasonable limit is, is up for debate but I'd say something like 2h-4h would be reasonable? This default limit would also need to be added to Lagoon docs.

@smlx smlx added the bug Something isn't working label Jun 20, 2024
@shreddedbacon
Copy link
Member

Some users have long running cronjobs on purpose (dumb, yes). Whatever solution is implemented needs to be able to have the time adjustable with a sane default and guardrails in the event that implementing a 2-4h deadline impacts users negatively.

@smlx
Copy link
Member Author

smlx commented Dec 2, 2024

Right now what happens is that platform engineers manually kill cronjobs running longer than 24h so all that a hard-coded limit is doing will be encoding the existing platform behaviour.

@shreddedbacon shreddedbacon linked a pull request Dec 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants