Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap compilation in a cross-os-process lock #13860

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

jonatanklosko
Copy link
Member

This wraps mix compile in a lock, such that only one compilation is running at a time. This helps to ensure that no duplicated compilation work is being done for the given target.

More specifically, we wrap tasks such as compile.all, compile.elixir, compile.erlang, compile.deps in the lock. Technically wrapping compile.all should be enough, but child tasks (such as compile.elixir) can still be invoked directly, hence we wrap each of them in the lock (lock calls can be nested safely).

As for the lock implementation, initially we planned to use Unix domain sockets, but it turned out to be non-ideal. Currently the :socket implementation on Windows has an important bug (but it's likely to be patched soon). The main reason we turned down this approach, though, is that the socket path is limited to 104 characters. This makes it too brittle, because even the temporary directory can be long (on macOS or via custom TMPDIR), and combined with additional naming we would need, we would be close to the limit.

Instead, the current implementation relies on TCP sockets and hard links. The module has a detailed comment outlining the algorithm. For the record, both implementations can be found in jonatanklosko/elixir_lock. The implementation in this PR has a different public API, tailored for the Mix use case.

Demo

elixircompilelock.mp4

@lukaszsamson
Copy link
Contributor

Nice to see my idea taken further and exploring different approaches @jonatanklosko.

My main concern is interaction with corporate firewalls and EDRs that may block socket open/bind. It would be troublesome if running elixir would require security approval on dev machines.

If anybody is interested here's my Unix domain socket based PoC I did back in April. The main issue was Windows compatibility. It would require OTP 26 and only stream mode works there - no datagram support yet. See https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ for details

Still, it would be nice if file locks were available but my PR adding that has been in limbo state since 2022.

@josevalim
Copy link
Member

@lukaszsamson @jonatanklosko tried the socket approach too but there are bugs in OTP 27.1, so that would require OTP 27.2+. There is also a 100 chars limit on the socket path but we may be able to work around that.

My main concern is interaction with corporate firewalls and EDRs that may block socket open/bind.

Do you think this will be the case? If we cannot open up a socket, then it means they cannot start a Phoenix server either?

@jonatanklosko
Copy link
Member Author

@lukaszsamson I was concerned with firewalls, because when I tried opening a TCP socket the windows firewall alert would pop up talking about networks access, but I realised it's because it was listening on 0.0.0.0 by default. Listening on loopback avoided the alert, which makes sense, because it skips the network. I don't know about other firewalls though, but then running any local server would be the same issue.

Copy link
Member

@whatyouhide whatyouhide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm looks solid on first glance, I left a pass of mostly code-related comments but I'll take a look at this again later 🙃

lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
@lukaszsamson
Copy link
Contributor

There are a few ways the lock or unlock may fail leaving lock files. Would a mix unlock command be needed?

@jonatanklosko
Copy link
Member Author

There are a few ways the lock or unlock may fail leaving lock files. Would a mix unlock command be needed?

Not just lock and unlock, the process may be abruptly terminated at any given point. The locking procedure is designed precisely to handle this and avoid race conditions while doing so. I went through a bunch of other ideas and I believe this one is resilient to failures.

To be more specific about your question, leaving the files behind is not an issue, because another process can check if the file is stale by probing the corresponding port.

Copy link
Member

@whatyouhide whatyouhide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another round of comments, fantastic work here. Also took the liberty of turning the algo into a draft of a Mermaid diagram, in case it helps others.

sequenceDiagram
  participant process
  participant port_P as port_P file (for port P)
  participant lock_n
  participant lock_np1 as lock_(n+1)

  process->>process: Open TCP socket on port P
  process->>port_P: Write TCP socket port

  loop Try to acquire lock through port P, starting at n=0
    process->>lock_n: Attempt to link port_P to lock_n
  
    alt Linking succeeds
      port_P->>lock_n: Link
      Note over port_P,lock_n: Process owns the lock

      process->>process: Remove all port_* files
      process->>process: Remove all lock_* files<br/>(except for lock_0)
    else Linking fails
      process->>lock_n: Connect to port in lock_n
      alt Connection succeeds
        lock_n->>process: Socket sends probe data
        Note over lock_n,process: This ensures that the socket is<br/>related to the Mix compilation, not<br/>just a random socket
        process-->>process: Wait for socket to close
        Note over process,process: Someone else has the lock
      else Connection fails (lock is stale)
        process->>lock_np1: Repeat loop with n+1
      end
    end
  end
Loading

lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Show resolved Hide resolved
lib/mix/lib/mix/lock.ex Show resolved Hide resolved
end

defp accept_loop(listen_socket) do
case :gen_tcp.accept(listen_socket) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any other error happens here, we'll get a nasty CaseClauseError. Some of those are transient (:system_limit?), some are not. Maybe we do a simple algorithm of:

  1. Accept
  2. If {:error, reason} and we don't know if reason is transient or not (basically catch all), then sleep for like a second and go back to 1.

With a max number of like 2 or 3 retries for now? I know that just rerunning mix compile would likely achieve the same, but it seems like a small price to pay for a potentially-better UX. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that if we fail to accept, it means the client fails to connect and assumes the socket is stale. So it's probably safer to crash the lock owner to avoid corruption? We could raise a more specific error to avoid CaseClauseError.

When we reach system_limit on open files/sockets it usually blows everything up anyway? Are there any other transient errors you worry about? I'm looking at the list in linux manual and I'm not sure if it makes sense to retry any of them (other than EINTR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accepting process is sort of holding its own little lock, so this confuses me a bit:

it means the client fails to connect and assumes the socket is stale

The client would fail to connect with :closed if the socket is closed, not with other errors, no?

Copy link
Member Author

@jonatanklosko jonatanklosko Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, I think these are not tied. The client is put in the accept queue and at that point they connect successfully, whether :gen_tcp.accept grabs it from the queue is separate.

But if the idea is to sleep and retry accepting, this heavily increases the chance that we don't send the probe data to the client on time, which is critical.

Comment on lines +227 to +229
# On Windows connecting to an unbound port takes a few seconds to
# fail, so instead we shortcut the check by attempting a listen,
# which succeeds or fails immediately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice trick!

lib/mix/lib/mix/lock.ex Show resolved Hide resolved
Comment on lines +265 to +273
names = File.ls!(path)

for "port_" <> _ = name <- names do
File.rm!(Path.join(path, name))
end

for "lock_" <> _ = name <- names, name != "lock_0" do
File.rm!(Path.join(path, name))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work? Nitpick, because your version works too 🙃

for file <- path |> Path.join("{port_,lock_}*") |> Path.wildcard(),
    Path.basename(file) != "lock_0",
    do: File.rm!(file)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to remove all port_* files first, only then all lock_* files. Not sure if wildcard guarantees the order, but even if it does, I'd rather be explicit about the order of operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't guarantee that as far as I know. Why do we need to delete all port files first? Might be worth leaving a comment.

Copy link
Member Author

@jonatanklosko jonatanklosko Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: we actually want all files in the directory, so semantically there's no need for wildcard.

Why do we need to delete all port files first?

This part from the doc comment:

# [...] By removing
# all port_P files we make sure that currently paused processes
# that are about to link port_P at lock_N will fail to link, since
# the port_P file will no longer exist (once lock_N is removed).

This is best seen on an example, so imagine this:

port_100 (A)
port_200 (B)
lock_0 (stale)
lock_1 (stale)
lock_2 (owned by A)

Process A has port 100 and owns lock_2.
Process B has port 200 and is paused right before File.ln("port_200", "lock_1").

Process A first moves port_100 to lock_0. And now wants to remove files.

Let's say process A were to remove lock_1 first. Then process B wakes up and successfully links. At this point Process B will overtake A. Contrarily, if we remove port_200 first, when process B wakes up, it will fail to link.

In other words, once we are sure we are the new owner, we remove port_* files to invalidate any peers attempting to also get the lock.

Does that make sense? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense. My head hurts, but it makes sense 😉 Let's maybe leave a comment here too saying that we need to delete these files in the specific order then, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the module comment already says they need to be deleted in order. There was also comment where take_over is called, but I moved it here, since it's more relevant here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfecto!

@jonatanklosko
Copy link
Member Author

@whatyouhide the diagram looks great :D

@whatyouhide
Copy link
Member

@jonatanklosko

the diagram looks great :D

Thanks! Is it accurate though? 😄

@jonatanklosko
Copy link
Member Author

Thanks! Is it accurate though?

That's what I meant!

Though after writing the example, I've just realised there's one missing code path where the "[Linking fails]" fails because the source file doesn't exist (the "invalidated" case). In that case we go back to loop with n=0. Similarly after "Someone else has the lock" we also go back to n=0.

:gen_tcp.close(socket)
{:error, :unexpected_port_owner}

{:error, :eintr} ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some research and it seems this is not needed. It's not mentioned in the manual nor Was I able to find any code explicitly checking for that result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the C code for Unix recv is here. There are no checks for EINTR, and the default error code path leads to returning erl_errno_id(saveErrno), which does recognise EINTR, so it seems to me that technically it could be returned.

I tried to test it by setting SIGALRM handler and sending the signal while :gen_tcp.recv is blocked, and that didn't interrupt the call. But then I realised it's probably that the signal is received to the main thread while recv runs in another.

lib/mix/lib/mix/lock.ex Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants