Very slow WebSocket binary frame parsing with Firefox client #389

liamwhite · 2024-08-10T22:40:57Z

Test machine is a Ryzen 9 5950X
Elixir 1.15.7
Erlang/OTP 26
Linux 6.8.11

Firefox 126.0 (Bandit)	Chromium 125.0.6422.141 (Bandit)	Firefox 126.0 (Cowboy)	Chromium 125.0.6422.141 (Cowboy)

Cowboy is slightly slower than Bandit on Chromium, Bandit is much slower than Cowboy on Firefox. 100% repeatable. Follow-up of phoenixframework/phoenix_live_view#3386.

Demo program

Mix.install([:bandit, :websock_adapter])

defmodule ParseDemoServer do
  def init(options) do
    {:ok, options}
  end

  def handle_in({_data, [opcode: :binary]}, state) do
    {:reply, :ok, {:text, "ok"}, state}
  end

  def terminate(_reason, state) do
    {:ok, state}
  end
end

defmodule Router do
  use Plug.Router

  plug Plug.Logger
  plug :match
  plug :dispatch

  get "/" do
    conn
    |> put_resp_content_type("text/html")
    |> send_resp(200, """
    <!DOCTYPE html>
    <html>
    <body>
    <p id="performance"></p>
    <script>
    const size = 1 << 20; // 1 MiB
    async function nonCryptographicRandomBlock() {
      const key = await crypto.subtle.generateKey(
        {
          name: 'AES-CTR',
          length: 128
        },
        false,
        ['encrypt']
      );

      const buffer = new Uint8Array(size);
      const counter = new Uint8Array(16);

      return crypto.subtle.encrypt(
        {
          name: 'AES-CTR',
          counter: counter,
          length: 64
        },
        key,
        buffer
      );
    }

    const socket = new WebSocket('ws://localhost:4000/websocket');
    const performanceIndicator = document.getElementById('performance');

    let messageStart = performance.now();

    async function sendMessage() {
      const difference = performance.now() - messageStart;
      const throughput = (size / (difference * 1000));
      performanceIndicator.innerText = `Last request: ${Math.round(difference)} ms, ${Math.round(throughput)} MB/s`;

      socket.send(await nonCryptographicRandomBlock());
      messageStart = performance.now();
    }

    socket.addEventListener('open', sendMessage);
    socket.addEventListener('message', sendMessage);
    </script>
    </body>
    </html>
    """)
  end

  get "/websocket" do
    conn
    |> WebSockAdapter.upgrade(ParseDemoServer, [], timeout: 60_000)
    |> halt()
  end

  match _ do
    send_resp(conn, 404, "not found")
  end
end

require Logger
webserver = {Bandit, plug: Router, scheme: :http, port: 4000}
{:ok, _} = Supervisor.start_link([webserver], strategy: :one_for_one)
Logger.info("Plug now running on localhost:4000")
Process.sleep(:infinity)

liamwhite · 2024-08-11T02:19:12Z

bandit/lib/bandit/websocket/handler.ex

Line 38 in 6dd82d9

(state.buffer <> data)

eprof pointed me to this line. Some logging suggests there are massive amounts of binaries being created when uploading large frames. The surrounding logic (in this file and in Frame) is suboptimal. Ideally it should trickle enough bytes to construct the header, and then start collecting the payload length as iodata and then do a single conversion to avoid wasting memory.

Bandit will receive 1460 bytes in the handler on each transmission, the maximum allowed for TCP. Given a frame payload size of 1048576 bytes (1 MiB), this will result in 1048576/1460 = 719 calls. Each call will allocate a binary that is 1460 bytes longer than the previous. Applying the series formula for natural numbers, that should work out to (1460)(719)(719 + 1) / 2 bytes = 377906400 bytes of binaries allocated and garbage collected for receiving a 1MiB frame (though not necessarily all at the same time).

liamwhite · 2024-08-11T03:39:48Z

I built a demo of the iodata idea with very poor code quality - hopefully you can improve it.
liamwhite@2afa7d5

With that I can get Last request: 5ms, 210 MB/s in Firefox, and Last request: 4ms, 291 MB/s in Chromium. So this is an all-around speedup and Bandit can easily beat Cowboy with this change.

liamwhite · 2024-08-11T15:29:30Z

I also investigated a bottleneck with unmasking large frames -- while less severe, unfortunately I think these may not be solvable without help from a NIF

Right now Frame.mask/2 uses a tail recursive function to perform the mask.

Changing to a generator for <<h::32 <- words>>, into: <<>>, do: <<Bitwise.bxor(h, mask)::32>> improved throughput a little bit over the tail recursive solution.
Changing to :crypto.exor(words, :binary.copy(<<mask::32>>, size_in_words)) improved throughput a little bit over the generator solution at the expense of wasting a lot of memory.

I cannot reliably get more than 300MB/s of throughput and suspect it probably isn't possible without dropping to native.

mtrudel · 2024-08-12T14:41:10Z

The need to work with binaries when pattern matching against frames is a big performance here; in my experience to date it hasn't been too bad, mostly since the underlying system calls return more than a single packet's worth of data per recv call. I'm guessing that there's something to how FF sends on the client side that messes with that assumption, leading to exactly the performance issue you identify.

I like your fix! Could you open a PR for it and we can clean it up a bit & merge? Something similar would need to be done for HTTP/2 frames as well, but I think it makes sense to establish a pattern with WebSockets before copying it over there.

We've also mused about the idea of folding all of this upstream into ThousandIsland, whereby it could be implemented as a runtime configurable 'framer' module on a connection. More of a 2.0 thing though.

We've spent a bunch of effort golfing the performance of masking to about as good as we can make it at this point across all sizes and shapes of frames - I don't think that that's much of the issue here.

liamwhite · 2024-08-12T18:13:26Z

I don't think that that's much of the issue here.

My benchmarking suggests frame masking is by far the most significant remaining bottleneck to reaching higher throughput with socket frames. I'm open to data that would change my mind about that, of course.

mtrudel · 2024-08-12T18:25:05Z

I think we should focus on getting the iolist work in #390 landed then we can look at remaining perf issues. Truthfully it's been a while since I looked at the masking work. Maybe @moogle19 has some recollection?

liamwhite mentioned this issue Aug 12, 2024

Build iolist for WebSocket frame data #390

Merged

mtrudel closed this as completed in #390 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow WebSocket binary frame parsing with Firefox client #389

Very slow WebSocket binary frame parsing with Firefox client #389

liamwhite commented Aug 10, 2024 •

edited

Loading

liamwhite commented Aug 11, 2024

liamwhite commented Aug 11, 2024 •

edited

Loading

liamwhite commented Aug 11, 2024

mtrudel commented Aug 12, 2024

liamwhite commented Aug 12, 2024

mtrudel commented Aug 12, 2024

Very slow WebSocket binary frame parsing with Firefox client #389

Very slow WebSocket binary frame parsing with Firefox client #389

Comments

liamwhite commented Aug 10, 2024 • edited Loading

liamwhite commented Aug 11, 2024

liamwhite commented Aug 11, 2024 • edited Loading

liamwhite commented Aug 11, 2024

mtrudel commented Aug 12, 2024

liamwhite commented Aug 12, 2024

mtrudel commented Aug 12, 2024

liamwhite commented Aug 10, 2024 •

edited

Loading

liamwhite commented Aug 11, 2024 •

edited

Loading