HPCC-32683 Fix issues with postmortem and container death #19296

jakesmith · 2024-11-13T18:03:48Z

helm changes:

Consistently generate command via addCommandAndLifecycle, and collect added container information in lifeCycleCtx based on containers added to lifeCycleCtx
Add terminationGracePeriodSeconds option
Mount ephemeral directories in distinct subPaths.
Fix issues with postmortem's from different containers overwriting one another
Add addPostRunContainer to generate postrun container, to monitor running containers.

script changes:

add container_watch.sh for postrun pod.
move all options handling into check_executes.sh

code changes:

code grace time into k8s job file into, so that postjob clearup can also use
Clear wuid file (prevents spurious error association)

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

github-actions · 2024-11-13T18:09:00Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32683

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

ghalliday · 2024-11-15T12:59:12Z

Mark I would value you reviewing this change as well.

ghalliday

Looks good as far as I can tell. I didn't see any issues.

ghalliday · 2024-11-15T14:22:16Z

helm/hpcc/templates/thor.yaml

@@ -234,31 +241,36 @@ data:
        spec:
          {{- include "hpcc.placementsByJobTargetType" (dict "root" .root "job" $thorWorkerJobName "target" .me.name "type" "thor") | indent 10 }}
          serviceAccountName: hpcc-default
+          terminationGracePeriodSeconds: {{ .terminationGracePeriodSeconds | default 60 }}


60 rather than 600 used elsewhere?

yes, will change it to 600.

ghalliday · 2024-11-28T14:21:48Z

initfiles/bin/container_watch.sh

@@ -0,0 +1,158 @@
+#!/bin/bash


Would be valuable to have a comment at the head of each of these bash scripts to describe what it is doing, and how it fits into the overall postmortem process.

Have added comment blocks.

jakesmith · 2024-12-03T20:05:23Z

@ghalliday - see 2nd commit.

mckellyln

Looks good to me.
Approved.
(But I am not an expert in this area)

typo

mckellyln

Looks good to me.
Approved.
(But I am not an expert in the area)

helm changes: - Consistently generate command via addCommandAndLifecycle, and collect added container information in lifeCycleCtx based on containers added to lifeCycleCtx - Add terminationGracePeriodSeconds option - Mount ephemeral directories in distinct subPaths. - Fix issues with postmortem's from different containers overwriting one another - Add addPostRunContainer to generate postrun container, to monitor running containers. script changes: - add container_watch.sh for postrun pod. - move all options handling into check_executes.sh code changes: - code grace time into k8s job file into, so that postjob clearup can also use - Clear wuid file (prevents spurious error association) Signed-off-by: Jake Smith <[email protected]>

jakesmith · 2024-12-04T17:31:11Z

@ghalliday - now squashed.

github-actions · 2024-12-09T16:44:48Z

Jirabot Action Result:
Added fix version: 9.10.0
Workflow Transition: 'Resolve issue'

jakesmith force-pushed the HPCC-32683-postrun branch 2 times, most recently from 53d82b9 to 99b7b5c Compare November 14, 2024 11:05

jakesmith requested a review from ghalliday November 14, 2024 11:29

ghalliday requested a review from mckellyln November 15, 2024 12:58

ghalliday reviewed Nov 28, 2024

View reviewed changes

jakesmith requested a review from ghalliday December 3, 2024 20:05

ghalliday approved these changes Dec 4, 2024

View reviewed changes

mckellyln previously requested changes Dec 4, 2024

View reviewed changes

mckellyln approved these changes Dec 4, 2024

View reviewed changes

jakesmith force-pushed the HPCC-32683-postrun branch from 69675a9 to 2e1e047 Compare December 4, 2024 17:30

ghalliday merged commit 6ffae6c into hpcc-systems:master Dec 9, 2024
47 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-32683 Fix issues with postmortem and container death #19296

HPCC-32683 Fix issues with postmortem and container death #19296

jakesmith commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024

ghalliday commented Nov 15, 2024

ghalliday left a comment

ghalliday Nov 15, 2024

jakesmith Dec 3, 2024

ghalliday Nov 28, 2024

jakesmith Dec 3, 2024

jakesmith commented Dec 3, 2024

mckellyln left a comment

mckellyln left a comment

jakesmith commented Dec 4, 2024

github-actions bot commented Dec 9, 2024

HPCC-32683 Fix issues with postmortem and container death #19296

HPCC-32683 Fix issues with postmortem and container death #19296

Conversation

jakesmith commented Nov 13, 2024 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Nov 13, 2024

ghalliday commented Nov 15, 2024

ghalliday left a comment

Choose a reason for hiding this comment

ghalliday Nov 15, 2024

Choose a reason for hiding this comment

jakesmith Dec 3, 2024

Choose a reason for hiding this comment

ghalliday Nov 28, 2024

Choose a reason for hiding this comment

jakesmith Dec 3, 2024

Choose a reason for hiding this comment

jakesmith commented Dec 3, 2024

mckellyln left a comment

Choose a reason for hiding this comment

mckellyln left a comment

Choose a reason for hiding this comment

jakesmith commented Dec 4, 2024

github-actions bot commented Dec 9, 2024

jakesmith commented Nov 13, 2024 •

edited

Loading