Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock with parallel SBY procs each with parallel tasks #245

Merged
merged 1 commit into from
Jul 17, 2023

Commits on Jul 17, 2023

  1. Fix deadlock with parallel SBY procs each with parallel tasks

    When multiple SBY processes run in parallel (from a Makefile or other
    job-server aware tool) and each SBY process runs tasks in parallel, each
    with enough tasks to be limited by the total job count, it is possible
    for the processes to race in such a way that every SBY process's helper
    process is in a blocking read from the job-server but a job-token would
    only become available as soon as any SBY process exits.
    
    In that situation SBY doesn't actually need the job-token anymore and
    only previously requested it as there was opportunity for parallelism.
    It would immediatly return the token as soon as it is acquired. That's
    usually sufficient to deal with no-longer-needed-but-requested tokens,
    but when SBY is done, it needs to return the job-token held by the
    parent process ASAP which it can only do by actually exiting, so we need
    to interrupt the blocking read of SBY's helper process.
    
    This could be done by sending a signal to the helper process, except
    that Python made the decision in 3.5 to have automatic EINTR retry loops
    around most system calls with no opt-out. That was part of the reason to
    go with this specifc helper process design that avoids interrupting a
    blocking read in the first place.
    
    Using an exception raised from the signal handler instead might lose a
    token when the signal arrives after the read returns, but before the
    token is stored in a variable. You cannot recover from a lost token in
    the context of the job-server protocol, so that's not an option. (This
    can't happen with recent Python versions but that would depend on
    undocumented behavior that could plausibly change again.)
    
    Thankfully the only case where we need to interrupt the read is when SBY
    is about to exit and will not request any further tokens. This allows us
    to use a signal handler that uses dup2 to close and replace the
    read-from fd with one that already is at EOF, making the next retry
    return immediatly. (If we'd need to interrupt a read and continue
    running we could also do this but the fd shuffling would be more
    involved.)
    jix committed Jul 17, 2023
    Configuration menu
    Copy the full SHA
    edbc054 View commit details
    Browse the repository at this point in the history