feature proposal: health-check script support #739

jose-d · 2024-08-13T10:11:39Z

jose-d
Aug 13, 2024

Hello;

wouldn't be possible/interesting to implement possibility to use user-defined worker-health-check?

High level goal:

get ability to early detect workers running on unhealthy nodes and avoid running jobs on such workers (and possibly exhaust their --max-fails budget )

Proposed usage:

hq worker start --health-check /dir/to/health-check_script.sh

where health-check_script.sh is executable/script expected to return 0 on healthy node.
health-check_script.sh would be executed before start of every(?) job by hq worker.

It is up to user to consider what to check. I believe [ availability of filesystem, IB interfaces, /dev/nvidia*, kerberos tickets, memory availability ...] could be subject of checks..

Motivation:

Recently, we faced some kerberos-related-glitch at Metacentrum, and despite having healthy nodes at other cluster, the job --max-fails amount was exhausted on faulty nodes. ( = faulty nodes ate our all jobs 😢 )

To be discussed:

does it make sense at all, isn't this already up to underlaying scheduler/manager?
when to call health-check? - at the job start, periodically, let user decide..?
instead of health-check, isn't better to create more generic job-prolog?

Inspiration

I'm coming from Slurm environment so indeed I had HealthCheckProgram in my mind. [1]

[1] https://slurm.schedmd.com/SUG14/node_health_check.pdf

spirali · 2024-08-13T10:19:02Z

spirali
Aug 13, 2024
Maintainer

What exactly should happen when the health check script fails? The task should be just rescheduled somewhere else, or the whole worker should be terminated?

3 replies

jose-d Aug 13, 2024
Author

How to deal with unhealthy worker..

.. print something into stderr and terminate worker? As it cannot do the job for us, it's just waste of accounted core-hours on underlying cluster?

Indeed one could wait for health-check to return "good values".. But this would introduce new state into worker, possibly bringing too much complexity?

spirali Aug 13, 2024
Maintainer

What can be done now without modifying code is: run a health check as part of a task and kill the worker from the task if health check fails. In such case, tasks will be rescheduled and they will not be counted as failed. The only problem is that when a worker fails, all running task increase its "crash" counter, so they are accounted as suspicious tasks that may crash workers. But it can be temporarily fixed to setting jobs --crash-limit to some high number.

However, I think that is a good idea to allow somehow signalize that we want to terminate a worker and it is not a fault of any particular task (so do not increase their crash counters).

spirali Sep 26, 2024
Maintainer

Hi,
we have released HQ v0.20.0 that does not increase crash counters when worker is stopped via "hq worker stop ...". So it is now safe to call "hq worker stop ID" in your healt check script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature proposal: health-check script support #739

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

feature proposal: health-check script support #739

jose-d Aug 13, 2024

Replies: 1 comment · 3 replies

spirali Aug 13, 2024 Maintainer

jose-d Aug 13, 2024 Author

spirali Aug 13, 2024 Maintainer

spirali Sep 26, 2024 Maintainer

jose-d
Aug 13, 2024

Replies: 1 comment 3 replies

spirali
Aug 13, 2024
Maintainer

jose-d Aug 13, 2024
Author

spirali Aug 13, 2024
Maintainer

spirali Sep 26, 2024
Maintainer