-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap compilation in a cross-os-process lock #13860
base: main
Are you sure you want to change the base?
Conversation
Nice to see my idea taken further and exploring different approaches @jonatanklosko. My main concern is interaction with corporate firewalls and EDRs that may block socket open/bind. It would be troublesome if running elixir would require security approval on dev machines. If anybody is interested here's my Unix domain socket based PoC I did back in April. The main issue was Windows compatibility. It would require OTP 26 and only stream mode works there - no datagram support yet. See https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ for details Still, it would be nice if file locks were available but my PR adding that has been in limbo state since 2022. |
@lukaszsamson @jonatanklosko tried the socket approach too but there are bugs in OTP 27.1, so that would require OTP 27.2+. There is also a 100 chars limit on the socket path but we may be able to work around that.
Do you think this will be the case? If we cannot open up a socket, then it means they cannot start a Phoenix server either? |
@lukaszsamson I was concerned with firewalls, because when I tried opening a TCP socket the windows firewall alert would pop up talking about networks access, but I realised it's because it was listening on 0.0.0.0 by default. Listening on loopback avoided the alert, which makes sense, because it skips the network. I don't know about other firewalls though, but then running any local server would be the same issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithm looks solid on first glance, I left a pass of mostly code-related comments but I'll take a look at this again later 🙃
There are a few ways the lock or unlock may fail leaving lock files. Would a |
Not just lock and unlock, the process may be abruptly terminated at any given point. The locking procedure is designed precisely to handle this and avoid race conditions while doing so. I went through a bunch of other ideas and I believe this one is resilient to failures. To be more specific about your question, leaving the files behind is not an issue, because another process can check if the file is stale by probing the corresponding port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left another round of comments, fantastic work here. Also took the liberty of turning the algo into a draft of a Mermaid diagram, in case it helps others.
sequenceDiagram
participant process
participant port_P as port_P file (for port P)
participant lock_n
participant lock_np1 as lock_(n+1)
process->>process: Open TCP socket on port P
process->>port_P: Write TCP socket port
loop Try to acquire lock through port P, starting at n=0
process->>lock_n: Attempt to link port_P to lock_n
alt Linking succeeds
port_P->>lock_n: Link
Note over port_P,lock_n: Process owns the lock
process->>process: Remove all port_* files
process->>process: Remove all lock_* files<br/>(except for lock_0)
else Linking fails
process->>lock_n: Connect to port in lock_n
alt Connection succeeds
lock_n->>process: Socket sends probe data
Note over lock_n,process: This ensures that the socket is<br/>related to the Mix compilation, not<br/>just a random socket
process-->>process: Wait for socket to close
Note over process,process: Someone else has the lock
else Connection fails (lock is stale)
process->>lock_np1: Repeat loop with n+1
end
end
end
end | ||
|
||
defp accept_loop(listen_socket) do | ||
case :gen_tcp.accept(listen_socket) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any other error happens here, we'll get a nasty CaseClauseError
. Some of those are transient (:system_limit
?), some are not. Maybe we do a simple algorithm of:
- Accept
- If
{:error, reason}
and we don't know if reason is transient or not (basically catch all), then sleep for like a second and go back to 1.
With a max number of like 2 or 3 retries for now? I know that just rerunning mix compile
would likely achieve the same, but it seems like a small price to pay for a potentially-better UX. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that if we fail to accept, it means the client fails to connect and assumes the socket is stale. So it's probably safer to crash the lock owner to avoid corruption? We could raise a more specific error to avoid CaseClauseError
.
When we reach system_limit
on open files/sockets it usually blows everything up anyway? Are there any other transient errors you worry about? I'm looking at the list in linux manual and I'm not sure if it makes sense to retry any of them (other than EINTR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accepting process is sort of holding its own little lock, so this confuses me a bit:
it means the client fails to connect and assumes the socket is stale
The client would fail to connect with :closed
if the socket is closed, not with other errors, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right, I think these are not tied. The client is put in the accept queue and at that point they connect successfully, whether :gen_tcp.accept
grabs it from the queue is separate.
But if the idea is to sleep and retry accepting, this heavily increases the chance that we don't send the probe data to the client on time, which is critical.
# On Windows connecting to an unbound port takes a few seconds to | ||
# fail, so instead we shortcut the check by attempting a listen, | ||
# which succeeds or fails immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice trick!
names = File.ls!(path) | ||
|
||
for "port_" <> _ = name <- names do | ||
File.rm!(Path.join(path, name)) | ||
end | ||
|
||
for "lock_" <> _ = name <- names, name != "lock_0" do | ||
File.rm!(Path.join(path, name)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work? Nitpick, because your version works too 🙃
for file <- path |> Path.join("{port_,lock_}*") |> Path.wildcard(),
Path.basename(file) != "lock_0",
do: File.rm!(file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to remove all port_* files first, only then all lock_* files. Not sure if wildcard guarantees the order, but even if it does, I'd rather be explicit about the order of operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't guarantee that as far as I know. Why do we need to delete all port files first? Might be worth leaving a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: we actually want all files in the directory, so semantically there's no need for wildcard.
Why do we need to delete all port files first?
This part from the doc comment:
# [...] By removing
# all port_P files we make sure that currently paused processes
# that are about to link port_P at lock_N will fail to link, since
# the port_P file will no longer exist (once lock_N is removed).
This is best seen on an example, so imagine this:
port_100 (A)
port_200 (B)
lock_0 (stale)
lock_1 (stale)
lock_2 (owned by A)
Process A has port 100 and owns lock_2.
Process B has port 200 and is paused right before File.ln("port_200", "lock_1")
.
Process A first moves port_100 to lock_0. And now wants to remove files.
Let's say process A were to remove lock_1 first. Then process B wakes up and successfully links. At this point Process B will overtake A. Contrarily, if we remove port_200 first, when process B wakes up, it will fail to link.
In other words, once we are sure we are the new owner, we remove port_* files to invalidate any peers attempting to also get the lock.
Does that make sense? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, makes sense. My head hurts, but it makes sense 😉 Let's maybe leave a comment here too saying that we need to delete these files in the specific order then, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the module comment already says they need to be deleted in order. There was also comment where take_over
is called, but I moved it here, since it's more relevant here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfecto!
Co-authored-by: Andrea Leopardi <[email protected]>
Co-authored-by: Andrea Leopardi <[email protected]>
@whatyouhide the diagram looks great :D |
Thanks! Is it accurate though? 😄 |
That's what I meant! Though after writing the example, I've just realised there's one missing code path where the "[Linking fails]" fails because the source file doesn't exist (the "invalidated" case). In that case we go back to loop with n=0. Similarly after "Someone else has the lock" we also go back to n=0. |
:gen_tcp.close(socket) | ||
{:error, :unexpected_port_owner} | ||
|
||
{:error, :eintr} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some research and it seems this is not needed. It's not mentioned in the manual nor Was I able to find any code explicitly checking for that result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly the C code for Unix recv is here. There are no checks for EINTR
, and the default error code path leads to returning erl_errno_id(saveErrno)
, which does recognise EINTR
, so it seems to me that technically it could be returned.
I tried to test it by setting SIGALRM handler and sending the signal while :gen_tcp.recv
is blocked, and that didn't interrupt the call. But then I realised it's probably that the signal is received to the main thread while recv runs in another.
Co-authored-by: Jean Klingler <[email protected]>
This wraps mix compile in a lock, such that only one compilation is running at a time. This helps to ensure that no duplicated compilation work is being done for the given target.
More specifically, we wrap tasks such as
compile.all
,compile.elixir
,compile.erlang
,compile.deps
in the lock. Technically wrappingcompile.all
should be enough, but child tasks (such ascompile.elixir
) can still be invoked directly, hence we wrap each of them in the lock (lock calls can be nested safely).As for the lock implementation, initially we planned to use Unix domain sockets, but it turned out to be non-ideal. Currently the
:socket
implementation on Windows has an important bug (but it's likely to be patched soon). The main reason we turned down this approach, though, is that the socket path is limited to 104 characters. This makes it too brittle, because even the temporary directory can be long (on macOS or via custom TMPDIR), and combined with additional naming we would need, we would be close to the limit.Instead, the current implementation relies on TCP sockets and hard links. The module has a detailed comment outlining the algorithm. For the record, both implementations can be found in jonatanklosko/elixir_lock. The implementation in this PR has a different public API, tailored for the Mix use case.
Demo
elixircompilelock.mp4