Wrap compilation in a cross-os-process lock #13860

jonatanklosko · 2024-09-26T16:26:47Z

This wraps mix compile in a lock, such that only one compilation is running at a time. This helps to ensure that no duplicated compilation work is being done for the given target.

More specifically, we wrap tasks such as compile.all, compile.elixir, compile.erlang, compile.deps in the lock. Technically wrapping compile.all should be enough, but child tasks (such as compile.elixir) can still be invoked directly, hence we wrap each of them in the lock (lock calls can be nested safely).

As for the lock implementation, initially we planned to use Unix domain sockets, but it turned out to be non-ideal. Currently the :socket implementation on Windows has an important bug (but it's likely to be patched soon). The main reason we turned down this approach, though, is that the socket path is limited to 104 characters. This makes it too brittle, because even the temporary directory can be long (on macOS or via custom TMPDIR), and combined with additional naming we would need, we would be close to the limit.

Instead, the current implementation relies on TCP sockets and hard links. The module has a detailed comment outlining the algorithm. For the record, both implementations can be found in jonatanklosko/elixir_lock. The implementation in this PR has a different public API, tailored for the Mix use case.

Demo

elixircompilelock.mp4

lukaszsamson · 2024-09-26T18:17:08Z

Nice to see my idea taken further and exploring different approaches @jonatanklosko.

My main concern is interaction with corporate firewalls and EDRs that may block socket open/bind. It would be troublesome if running elixir would require security approval on dev machines.

If anybody is interested here's my Unix domain socket based PoC I did back in April. The main issue was Windows compatibility. It would require OTP 26 and only stream mode works there - no datagram support yet. See https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ for details

Still, it would be nice if file locks were available but my PR adding that has been in limbo state since 2022.

josevalim · 2024-09-26T18:27:25Z

@lukaszsamson @jonatanklosko tried the socket approach too but there are bugs in OTP 27.1, so that would require OTP 27.2+. There is also a 100 chars limit on the socket path but we may be able to work around that.

My main concern is interaction with corporate firewalls and EDRs that may block socket open/bind.

Do you think this will be the case? If we cannot open up a socket, then it means they cannot start a Phoenix server either?

jonatanklosko · 2024-09-26T18:39:40Z

@lukaszsamson I was concerned with firewalls, because when I tried opening a TCP socket the windows firewall alert would pop up talking about networks access, but I realised it's because it was listening on 0.0.0.0 by default. Listening on loopback avoided the alert, which makes sense, because it skips the network. I don't know about other firewalls though, but then running any local server would be the same issue.

whatyouhide

Algorithm looks solid on first glance, I left a pass of mostly code-related comments but I'll take a look at this again later 🙃

lib/mix/lib/mix/lock.ex

Co-authored-by: Andrea Leopardi <[email protected]>

lib/mix/lib/mix/lock.ex

lukaszsamson · 2024-09-27T10:03:42Z

There are a few ways the lock or unlock may fail leaving lock files. Would a mix unlock command be needed?

jonatanklosko · 2024-09-27T10:09:15Z

There are a few ways the lock or unlock may fail leaving lock files. Would a mix unlock command be needed?

Not just lock and unlock, the process may be abruptly terminated at any given point. The locking procedure is designed precisely to handle this and avoid race conditions while doing so. I went through a bunch of other ideas and I believe this one is resilient to failures.

To be more specific about your question, leaving the files behind is not an issue, because another process can check if the file is stale by probing the corresponding port.

whatyouhide

Left another round of comments, fantastic work here. Also took the liberty of turning the algo into a draft of a Mermaid diagram, in case it helps others.

sequenceDiagram
  participant process
  participant port_P as port_P file (for port P)
  participant lock_n
  participant lock_np1 as lock_(n+1)

  process->>process: Open TCP socket on port P
  process->>port_P: Write TCP socket port

  loop Try to acquire lock through port P, starting at n=0
    process->>lock_n: Attempt to link port_P to lock_n
  
    alt Linking succeeds
      port_P->>lock_n: Link
      Note over port_P,lock_n: Process owns the lock

      process->>process: Remove all port_* files
      process->>process: Remove all lock_* files<br/>(except for lock_0)
    else Linking fails
      process->>lock_n: Connect to port in lock_n
      alt Connection succeeds
        lock_n->>process: Socket sends probe data
        Note over lock_n,process: This ensures that the socket is<br/>related to the Mix compilation, not<br/>just a random socket
        process-->>process: Wait for socket to close
        Note over process,process: Someone else has the lock
      else Connection fails (lock is stale)
        process->>lock_np1: Repeat loop with n+1
      end
    end
  end

lib/mix/lib/mix/lock.ex

whatyouhide · 2024-09-27T12:11:57Z

lib/mix/lib/mix/lock.ex

+ end
+
+ defp accept_loop(listen_socket) do
+ case :gen_tcp.accept(listen_socket) do


If any other error happens here, we'll get a nasty CaseClauseError. Some of those are transient (:system_limit?), some are not. Maybe we do a simple algorithm of:

Accept

If {:error, reason} and we don't know if reason is transient or not (basically catch all), then sleep for like a second and go back to 1.

With a max number of like 2 or 3 retries for now? I know that just rerunning mix compile would likely achieve the same, but it seems like a small price to pay for a potentially-better UX. Thoughts?

I'm a bit concerned that if we fail to accept, it means the client fails to connect and assumes the socket is stale. So it's probably safer to crash the lock owner to avoid corruption? We could raise a more specific error to avoid CaseClauseError.

When we reach system_limit on open files/sockets it usually blows everything up anyway? Are there any other transient errors you worry about? I'm looking at the list in linux manual and I'm not sure if it makes sense to retry any of them (other than EINTR).

The accepting process is sort of holding its own little lock, so this confuses me a bit:

it means the client fails to connect and assumes the socket is stale

The client would fail to connect with :closed if the socket is closed, not with other errors, no?

Ah you're right, I think these are not tied. The client is put in the accept queue and at that point they connect successfully, whether :gen_tcp.accept grabs it from the queue is separate.

But if the idea is to sleep and retry accepting, this heavily increases the chance that we don't send the probe data to the client on time, which is critical.

whatyouhide · 2024-09-27T12:14:19Z

lib/mix/lib/mix/lock.ex

+ # On Windows connecting to an unbound port takes a few seconds to
+ # fail, so instead we shortcut the check by attempting a listen,
+ # which succeeds or fails immediately


Oh, nice trick!

lib/mix/lib/mix/lock.ex

whatyouhide · 2024-09-27T12:18:23Z

lib/mix/lib/mix/lock.ex

+ names = File.ls!(path)
+
+ for "port_" <> _ = name <- names do
+ File.rm!(Path.join(path, name))
+ end
+
+ for "lock_" <> _ = name <- names, name != "lock_0" do
+ File.rm!(Path.join(path, name))
+ end


Would this work? Nitpick, because your version works too 🙃

for file <- path |> Path.join("{port_,lock_}*") |> Path.wildcard(), Path.basename(file) != "lock_0", do: File.rm!(file)

We want to remove all port_* files first, only then all lock_* files. Not sure if wildcard guarantees the order, but even if it does, I'd rather be explicit about the order of operations.

It doesn't guarantee that as far as I know. Why do we need to delete all port files first? Might be worth leaving a comment.

Sidenote: we actually want all files in the directory, so semantically there's no need for wildcard.

Why do we need to delete all port files first?

This part from the doc comment:

# [...] By removing # all port_P files we make sure that currently paused processes # that are about to link port_P at lock_N will fail to link, since # the port_P file will no longer exist (once lock_N is removed).

This is best seen on an example, so imagine this:

port_100 (A) port_200 (B) lock_0 (stale) lock_1 (stale) lock_2 (owned by A)

Process A has port 100 and owns lock_2.
Process B has port 200 and is paused right before File.ln("port_200", "lock_1").

Process A first moves port_100 to lock_0. And now wants to remove files.

Let's say process A were to remove lock_1 first. Then process B wakes up and successfully links. At this point Process B will overtake A. Contrarily, if we remove port_200 first, when process B wakes up, it will fail to link.

In other words, once we are sure we are the new owner, we remove port_* files to invalidate any peers attempting to also get the lock.

Does that make sense? :)

Yep, makes sense. My head hurts, but it makes sense 😉 Let's maybe leave a comment here too saying that we need to delete these files in the specific order then, wdyt?

FWIW the module comment already says they need to be deleted in order. There was also comment where take_over is called, but I moved it here, since it's more relevant here :)

Co-authored-by: Andrea Leopardi <[email protected]>

jonatanklosko · 2024-09-27T12:55:04Z

@whatyouhide the diagram looks great :D

whatyouhide · 2024-09-27T15:07:50Z

@jonatanklosko

the diagram looks great :D

Thanks! Is it accurate though? 😄

jonatanklosko · 2024-09-27T15:15:40Z

Thanks! Is it accurate though?

That's what I meant!

Though after writing the example, I've just realised there's one missing code path where the "[Linking fails]" fails because the source file doesn't exist (the "invalidated" case). In that case we go back to loop with n=0. Similarly after "Someone else has the lock" we also go back to n=0.

lukaszsamson · 2024-09-30T08:23:51Z

lib/mix/lib/mix/lock.ex

+ :gen_tcp.close(socket)
+ {:error, :unexpected_port_owner}
+
+ {:error, :eintr} ->


I did some research and it seems this is not needed. It's not mentioned in the manual nor Was I able to find any code explicitly checking for that result.

If I understand correctly the C code for Unix recv is here. There are no checks for EINTR, and the default error code path leads to returning erl_errno_id(saveErrno), which does recognise EINTR, so it seems to me that technically it could be returned.

I tried to test it by setting SIGALRM handler and sending the signal while :gen_tcp.recv is blocked, and that didn't interrupt the call. But then I realised it's probably that the signal is received to the main thread while recv runs in another.

lib/mix/lib/mix/lock.ex

Co-authored-by: Jean Klingler <[email protected]>

Wrap compilation in a cross-os-process lock

2dd64c6

whatyouhide reviewed Sep 27, 2024

View reviewed changes

jonatanklosko and others added 6 commits September 27, 2024 09:16

Apply suggestions from code review

745b610

Co-authored-by: Andrea Leopardi <[email protected]>

Use assertive file operations

308ceae

Improve error if opening the socket fails

48b6da6

Improve error paths

7ca97ce

Simplify pdict

76a455e

Up

90dd9d5