Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unix Domain Socket Listeners #109

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

clundquist-stripe
Copy link

@clundquist-stripe clundquist-stripe commented Aug 31, 2022

Based on: #60

This adds support for Unix Domain Socket listeners.

I'm not super thrilled about all the aspects here, but I wanted to open this sooner, rather than later:

  • Serializing the ruby class isn't super great
  • Abstracting all the tests to run against both listener family

# ip, host:port
(?:(?<host>[^:]+):(?<port>\d+)) |
# unix socket path
(?<path>[^,:]+)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this against abstract namespace sockets, but my wetware suggests it should work

raise "Invalid value for #{addr.inspect}: bind address must be of the form address:port[,flags...] or /path/to/unix/socket[,flags...]"
end

flags = $~["flags"].split(",").reject(&:empty?).map(&:downcase)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bind the match to a variable, instead of $~ but this was the original patch

@@ -26,4 +25,6 @@ Gem::Specification.new do |gem|
gem.add_development_dependency "minitest", "~> 5"
gem.add_development_dependency "mocha", "~> 1"
gem.add_development_dependency "subprocess", "~> 1"
gem.add_development_dependency "pry"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging interprocess communication is still hard

end

def family
"AF_INET"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tried making this Socket::AF_INET but there are some gotchyas. This eventually maps to socket.h and I vaguely recall these differing between systems (specifically OSX and Linux)

When dumping this state, the string "AF_INET" was more helpful than 2

require 'socket'
irb(main):002:0> Socket::AF_INET
=> 2

This also lets one import state between platforms (probably only helpful for debugging/dev)

(likewise for Unix#family below)

@@ -349,6 +349,7 @@ def self.prepare_child_environment(index)

ENV["EINHORN_FD_COUNT"] = Einhorn::State.bind_fds.length.to_s
Einhorn::State.bind_fds.each_with_index { |fd, i| ENV["EINHORN_FD_#{i}"] = fd.to_s }
Einhorn::State.bind.each_with_index { |bind, i| ENV["EINHORN_FD_FAMILY_#{i}"] = bind.family.to_s }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to_s here to match the line above, and because of the Socket::AF_INET thing above.

It is currently silly, and boils down to: "AF_UNIX".to_s

@@ -6,11 +6,11 @@ module SafeYAML
YAML.safe_load("---", permitted_classes: [])
rescue ArgumentError
def self.load(payload)
YAML.safe_load(payload, [Set, Symbol, Time], [], true)
YAML.safe_load(payload, [Set, Symbol, Time, Einhorn::Bind::Inet, Einhorn::Bind::Unix], [], true)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the line below feels bad. I can parse out the struct / array, but it may get dicey compatibility wise?
Granted the "downgrade case" here isn't possible.
(if I have a running Einhorn that has Einhorn::Bind:Unix, a downgrade of einhorn can't happen, without the class definition)

@@ -35,9 +35,28 @@ class UpgradeTests < EinhornIntegrationTestCase
@port = find_free_port
@server_program = File.join(@dir, "env_printer.rb")
@socket_path = File.join(@dir, "einhorn.sock")

mangler = ('a'..'z').to_a.shuffle[0,8].join
@unix_listener_socket_path = "unix-listener-einhorn-#{mangler}.sock"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in pwd rather than @dir because of unix domain socket length limitations.
I considered the abstract socket namespace, but didn't go with it, since I figured this was a more common pattern.

# exec the new einhorn with the same environment:
reexec_cmdline = "env VAR=a bundle exec --keep-file-descriptors einhorn"

with_running_einhorn(%W[einhorn -m manual -b #{@unix_listener_socket_path} --reexec-as=#{reexec_cmdline} -d #{@socket_path} -- ruby #{@server_program} VAR],
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicating every test was daunting, with the slight variation of wait_for_open_{socket, port}

@mperham
Copy link
Collaborator

mperham commented Aug 31, 2022

I have no idea what a Unix domain socket is. I have no idea why users would want this. Please educate and discuss before throwing a large pull request at an OSS maintainer.

@zanker-stripe
Copy link
Collaborator

Hey @mperham sorry, I think we had some crossed wires internally! You can ignore this for now, it should be able to get closed out.

@clundquist-stripe
Copy link
Author

clundquist-stripe commented Aug 31, 2022

While The Manual is terse to say the least, you probably know more than you think!
In fact, the einhorn pid already has a unix domain socket listener for the control socket that einhornsh connects to!

I'll aim to channel Julia Evan's Explanation, but may come up short.

From running the tests, we can see einhorn makes einhorn.sock:

clundquist       49353   0.0  0.1 34782536  28268 s000  S+    3:51PM   0:00.38 einhorn: einhorn -m manual -b 127.0.0.1:56065 --reexec-as=env VAR=b OINK=b bundle exec --keep-file-descriptors einhorn -d /var/folders/7w/_5yxq6ms4wlc6hv117xkws4r0000gn/T/env_printer20220830-49279-nmoaz/einhorn.sock -- ruby /var/folders/7w/_5yxq6ms4wlc6hv117xkws4r0000gn/T/env_printer20220830-49279-nmoaz/env_printer.rb VAR                

zooming in .../einhorn.sock is a unix domain socket!

and finding one in the wild here:

ps aux | grep einhorn
# ...
example+     689  0.0  0.0  83480  2860 ?        S    Aug26   0:00  |   |   \_ einhorn: ruby example-srv
sudo lsof -p 689
COMMAND PID      USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
# ...
einhorn 689 example    7u     IPv4              25204      0t0     TCP localhost:15555 (LISTEN)
einhorn 689 example    8u     unix 0xffff92d61056e640      0t0   25198 /tmp/einhorn-example-srv.sock type=STREAM
einhorn 689 example    9u     IPv4              25205      0t0     TCP localhost:15556 (LISTEN)

Indeed lsof shows us that FD 8 above is a unix domain socket!

So here we see some examples, but what are they? Why do we care about them? How can they help us solve real problems? All great questions! Let's take them one by one.

What are they?

At a high level, they're a file handle you can read and write from, that (should) have a process on the other side reading and writing back. Remember that, as the above lsof shows, the TCP listeners are also "just file handles where something is reading and writing back"

They're similar to a FIFO or named pipe, but offer some other properties around permissions, ordering, and bidirectional data flow. You can connect a Unix socket, much like with TCP and UDP sockets.

sudo strace -- ruby -e 'require "socket"; UNIXSocket.new("/tmp/einhorn-example-srv.sock")' |& grep -C3 connect
mprotect(0x7f0f4403b000, 4096, PROT_READ) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 5
fcntl(5, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
connect(5, {sa_family=AF_UNIX, sun_path="/tmp/einhorn-example-srv.sock"}, 110) = 0
fstat(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f0f47b68090}, {sa_handler=0x7f0f47ebf570, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f0f47b68090}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f0f47b68090}, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f0f47b68090}, 8) = 0

All this to say, they're a way to accept connections, from things that share the same filesystem [namespace].

Why do we care about them?

I think Julia answers this better, but in addition to her explanation, UDS listeners are a bit faster performance wise, and sometimes are a way to side step listening port conflicts. (While there is SO_REUSE_PORT in TCP, in general, only a single process can bind a listening port, causing the dreaded E_ADDR_IN_USE error when the second process tries to listen)

(where as with UDS, you could design things to listening at /var/run/example.<sha>.sock or /deploy/example/<sha>/run/example.sock rather than fighting over port 8080. In fact, Einhorn already does this for when you have multiple services running under einhorn for their distinct control sockets!)

How can they help us solve real problems?

Here is where we get to What I'm Really Trying To Do!
Spoiler: I'm hitting ephemeral port exhaustion

As our tests show, we basically always bind to TCP 127.0.0.1:@port. You've been doing this a minute, and I'm sure long ago you found Ruby (and many other languages), is Not Good at serving index.css. So, you put your favorite web server in front of your app, served the static assets from the web server, then forwarded the dynamic stuff to the app, which listened on localhost, boom, done!

As we both know, Ruby services generally pre-fork, and have a one-process-per-request model. When we're running O(10s) or even O(100s) of Ruby Workers, listening on localhost, this works fine. (The core thing here is, you'll only ever have concurrency equal to the number of app worker pids from einhorn's perspective, unless you have EventMachine wired up just right in your ruby app)

However, let's say we swapped out Ruby for something like Golang or Java. These languages are a bit easier to do threading and concurrency. nginx and proxy_pass will only use http/1.1, which means you use 1 File Descriptor per-conncurrent-request, likewise, if you're using Einhorn to front an L4/TCP service (haproxy, a database, redis, memcached, or something not HTTP), you'll use one file descriptor per-connection.

Let's stick to the nginx example.

sudo netstat -plant | grep 15555 | grep example | head
tcp        0      0 127.0.0.1:15555         127.0.0.1:47244         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:47272         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:50510         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:49380         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:47232         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:34398         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:50666         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:50550         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:49366         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:49556         ESTABLISHED 1387/example        
tcp        0      0 127.0.0.1:15555         127.0.0.1:48472         ESTABLISHED 1387/example
#...        

This is where you might end up. Each one of these is a connection from something like nginx to my example service, doing an HTTP request, and I have this many concurrent inflight requests.

One nitty gritty that is important here, is that A TCP Connection is identified by the UNIQUE 5 tuple
What is in this 5 tuple?

arbitrary citation

It includes a source IP address/port number, destination IP address/port number and the protocol in use.

So above, that's tcp, 127.0.0.1, 15555, 127.0.0.1, <FREE EPHEMERAL PORT>
This means we only have 1 "free variable" for this. because we're listening on the same protocol, port, and both things are connecting via 127.0.0.1!

We can tweak this a little bit with sysctl and net.ipv4.ip_local_port_range manual

but you run out of juice a bit before 64k, meaning you can only have ~64k unique 5 tuples when talking to yourself over localhost! This means that, once you run out of ephemeral ports (and the rest of the 5 tuple is the same) our child process can no longer accept connections!

For our web services, we've side stepped this by using Http/2 which allows multiplexing requests over the same connection, but for TCP/L4 like services, we can't do that, as those protocols don't have the same multiplexing.

All this to say, the goal here is to accept more than 64k connections per einhorn child group (since the 5 tuple will be shared with all siblings listening on the same FD)

@mperham
Copy link
Collaborator

mperham commented Sep 1, 2022

Thanks, I appreciate the explanation and it sounds like that really could be a big issue for use at scale!

If you wish to close this PR, go ahead or we can continue discussing a merge. I'd like a wiki page or .md doc written on the feature, explaining how to configure and use it. Your comment would be an excellent start for that doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants