Replies: 1 comment 3 replies
-
What exactly should happen when the health check script fails? The task should be just rescheduled somewhere else, or the whole worker should be terminated? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello;
wouldn't be possible/interesting to implement possibility to use user-defined worker-health-check?
High level goal:
--max-fails
budget )Proposed usage:
where
health-check_script.sh
is executable/script expected to return 0 on healthy node.health-check_script.sh
would be executed before start of every(?) job by hq worker.It is up to user to consider what to check. I believe [ availability of filesystem, IB interfaces,
/dev/nvidia*
, kerberos tickets, memory availability ...] could be subject of checks..Motivation:
Recently, we faced some kerberos-related-glitch at Metacentrum, and despite having healthy nodes at other cluster, the job
--max-fails
amount was exhausted on faulty nodes. ( = faulty nodes ate our all jobs 😢 )To be discussed:
does it make sense at all, isn't this already up to underlaying scheduler/manager?
when to call health-check? - at the job start, periodically, let user decide..?
instead of health-check, isn't better to create more generic job-prolog?
Inspiration
I'm coming from Slurm environment so indeed I had
HealthCheckProgram
in my mind. [1][1] https://slurm.schedmd.com/SUG14/node_health_check.pdf
Beta Was this translation helpful? Give feedback.
All reactions