-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32683 Fix issues with postmortem and container death #19296
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32683 Jirabot Action Result: |
53d82b9
to
99b7b5c
Compare
Mark I would value you reviewing this change as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as I can tell. I didn't see any issues.
helm/hpcc/templates/thor.yaml
Outdated
@@ -234,31 +241,36 @@ data: | |||
spec: | |||
{{- include "hpcc.placementsByJobTargetType" (dict "root" .root "job" $thorWorkerJobName "target" .me.name "type" "thor") | indent 10 }} | |||
serviceAccountName: hpcc-default | |||
terminationGracePeriodSeconds: {{ .terminationGracePeriodSeconds | default 60 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 rather than 600 used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will change it to 600.
@@ -0,0 +1,158 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be valuable to have a comment at the head of each of these bash scripts to describe what it is doing, and how it fits into the overall postmortem process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added comment blocks.
@ghalliday - see 2nd commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Approved.
(But I am not an expert in this area)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Approved.
(But I am not an expert in the area)
helm changes: - Consistently generate command via addCommandAndLifecycle, and collect added container information in lifeCycleCtx based on containers added to lifeCycleCtx - Add terminationGracePeriodSeconds option - Mount ephemeral directories in distinct subPaths. - Fix issues with postmortem's from different containers overwriting one another - Add addPostRunContainer to generate postrun container, to monitor running containers. script changes: - add container_watch.sh for postrun pod. - move all options handling into check_executes.sh code changes: - code grace time into k8s job file into, so that postjob clearup can also use - Clear wuid file (prevents spurious error association) Signed-off-by: Jake Smith <[email protected]>
69675a9
to
2e1e047
Compare
@ghalliday - now squashed. |
Jirabot Action Result: |
helm changes:
script changes:
code changes:
Type of change:
Checklist:
Smoketest:
Testing: