Fix deadlock with parallel SBY procs each with parallel tasks #245
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When multiple SBY processes run in parallel (from a Makefile or other job-server aware tool) and each SBY process runs tasks in parallel, each with enough tasks to be limited by the total job count, it is possible for the processes to race in such a way that every SBY process's helper process is in a blocking read from the job-server but a job-token would only become available as soon as any SBY process exits.
In that situation SBY doesn't actually need the job-token anymore and only previously requested it as there was opportunity for parallelism. It would immediatly return the token as soon as it is acquired. That's usually sufficient to deal with no-longer-needed-but-requested tokens, but when SBY is done, it needs to return the job-token held by the parent process ASAP which it can only do by actually exiting, so we need to interrupt the blocking read of SBY's helper process.
This could be done by sending a signal to the helper process, except that Python made the decision in 3.5 to have automatic EINTR retry loops around most system calls with no opt-out. That was part of the reason to go with this specifc helper process design that avoids interrupting a blocking read in the first place.
Using an exception raised from the signal handler instead might lose a token when the signal arrives after the read returns, but before the token is stored in a variable. You cannot recover from a lost token in the context of the job-server protocol, so that's not an option. (This can't happen with recent Python versions but that would depend on undocumented behavior that could plausibly change again.)
Thankfully the only case where we need to interrupt the read is when SBY is about to exit and will not request any further tokens. This allows us to use a signal handler that uses dup2 to close and replace the read-from fd with one that already is at EOF, making the next retry return immediatly. (If we'd need to interrupt a read and continue running we could also do this but the fd shuffling would be more involved.)