WIP fix: make a task engine job stoppable #444
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR shows an approach using process parallelism to stop jobs: the experiment is run in a child process with a SIGTERM handler which sets a flag in the task engine to cause it to stop in between steps. It can't be stopped in the middle of a long-running step that way though. So the parent process will send SIGTERM first, and if the child doesn't stop quickly enough, then a SIGKILL. The parent process can poll an endpoint for an instruction to stop the job, though we don't have such an endpoint now. There is some commented out code where that would happen.
Forcible stopping is not testable as-is on this branch, since we don't have endpoints to poll yet. There are no additional unit tests right now; as it is written, the child process is transparent to the existing test_run_task_engine unit test. So, much of the code is exercised already. Perhaps a test could be added which stops an experiment early rather than letting it run to completion, to test that part of it. That could be trickier to do...?
One thing I discovered is that mlflow has a special RunStatus value that seems applicable to a forcibly terminated run, "killed". But dioptra has no corresponding job status value. For now, it uses "failed". We might want to add one.
You also can't test it using the legacy mlflow job submission system. You must use the newTaskEngine endpoint. The MLproject system runs the commandline run-experiment tool, which I did not change. It could technically be changed to use the same child process and polling, but that tool is supposed to be able to run independently of the dioptra containers, so it wouldn't make sense for it to poll the restapi.
The python standard library has a "multiprocessing" module, which makes starting and monitoring a subprocess pretty nice and easy. So that's what I used for this.