WIP fix: make a task engine job stoppable #444

chisholm · 2024-04-18T18:11:12Z

This PR shows an approach using process parallelism to stop jobs: the experiment is run in a child process with a SIGTERM handler which sets a flag in the task engine to cause it to stop in between steps. It can't be stopped in the middle of a long-running step that way though. So the parent process will send SIGTERM first, and if the child doesn't stop quickly enough, then a SIGKILL. The parent process can poll an endpoint for an instruction to stop the job, though we don't have such an endpoint now. There is some commented out code where that would happen.

Forcible stopping is not testable as-is on this branch, since we don't have endpoints to poll yet. There are no additional unit tests right now; as it is written, the child process is transparent to the existing test_run_task_engine unit test. So, much of the code is exercised already. Perhaps a test could be added which stops an experiment early rather than letting it run to completion, to test that part of it. That could be trickier to do...?

One thing I discovered is that mlflow has a special RunStatus value that seems applicable to a forcibly terminated run, "killed". But dioptra has no corresponding job status value. For now, it uses "failed". We might want to add one.

You also can't test it using the legacy mlflow job submission system. You must use the newTaskEngine endpoint. The MLproject system runs the commandline run-experiment tool, which I did not change. It could technically be changed to use the same child process and polling, but that tool is supposed to be able to run independently of the dioptra containers, so it wouldn't make sense for it to poll the restapi.

The python standard library has a "multiprocessing" module, which makes starting and monitoring a subprocess pretty nice and easy. So that's what I used for this.

fix: make a task engine job stoppable

e2005bf

chisholm requested a review from jkglasbrenner April 18, 2024 18:11

chisholm self-assigned this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP fix: make a task engine job stoppable #444

WIP fix: make a task engine job stoppable #444

chisholm commented Apr 18, 2024

WIP fix: make a task engine job stoppable #444

Are you sure you want to change the base?

WIP fix: make a task engine job stoppable #444

Conversation

chisholm commented Apr 18, 2024