Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow WebSocket binary frame parsing with Firefox client #389

Closed
liamwhite opened this issue Aug 10, 2024 · 6 comments · Fixed by #390
Closed

Very slow WebSocket binary frame parsing with Firefox client #389

liamwhite opened this issue Aug 10, 2024 · 6 comments · Fixed by #390

Comments

@liamwhite
Copy link
Contributor

liamwhite commented Aug 10, 2024

Test machine is a Ryzen 9 5950X
Elixir 1.15.7
Erlang/OTP 26
Linux 6.8.11

Firefox 126.0 (Bandit) Chromium 125.0.6422.141 (Bandit) Firefox 126.0 (Cowboy) Chromium 125.0.6422.141 (Cowboy)
image image image image

Cowboy is slightly slower than Bandit on Chromium, Bandit is much slower than Cowboy on Firefox. 100% repeatable. Follow-up of phoenixframework/phoenix_live_view#3386.

Demo program

Mix.install([:bandit, :websock_adapter])

defmodule ParseDemoServer do
  def init(options) do
    {:ok, options}
  end

  def handle_in({_data, [opcode: :binary]}, state) do
    {:reply, :ok, {:text, "ok"}, state}
  end

  def terminate(_reason, state) do
    {:ok, state}
  end
end

defmodule Router do
  use Plug.Router

  plug Plug.Logger
  plug :match
  plug :dispatch

  get "/" do
    conn
    |> put_resp_content_type("text/html")
    |> send_resp(200, """
    <!DOCTYPE html>
    <html>
    <body>
    <p id="performance"></p>
    <script>
    const size = 1 << 20; // 1 MiB
    async function nonCryptographicRandomBlock() {
      const key = await crypto.subtle.generateKey(
        {
          name: 'AES-CTR',
          length: 128
        },
        false,
        ['encrypt']
      );

      const buffer = new Uint8Array(size);
      const counter = new Uint8Array(16);

      return crypto.subtle.encrypt(
        {
          name: 'AES-CTR',
          counter: counter,
          length: 64
        },
        key,
        buffer
      );
    }

    const socket = new WebSocket('ws://localhost:4000/websocket');
    const performanceIndicator = document.getElementById('performance');

    let messageStart = performance.now();

    async function sendMessage() {
      const difference = performance.now() - messageStart;
      const throughput = (size / (difference * 1000));
      performanceIndicator.innerText = `Last request: ${Math.round(difference)} ms, ${Math.round(throughput)} MB/s`;

      socket.send(await nonCryptographicRandomBlock());
      messageStart = performance.now();
    }

    socket.addEventListener('open', sendMessage);
    socket.addEventListener('message', sendMessage);
    </script>
    </body>
    </html>
    """)
  end

  get "/websocket" do
    conn
    |> WebSockAdapter.upgrade(ParseDemoServer, [], timeout: 60_000)
    |> halt()
  end

  match _ do
    send_resp(conn, 404, "not found")
  end
end

require Logger
webserver = {Bandit, plug: Router, scheme: :http, port: 4000}
{:ok, _} = Supervisor.start_link([webserver], strategy: :one_for_one)
Logger.info("Plug now running on localhost:4000")
Process.sleep(:infinity)
@liamwhite
Copy link
Contributor Author

(state.buffer <> data)

eprof pointed me to this line. Some logging suggests there are massive amounts of binaries being created when uploading large frames. The surrounding logic (in this file and in Frame) is suboptimal. Ideally it should trickle enough bytes to construct the header, and then start collecting the payload length as iodata and then do a single conversion to avoid wasting memory.

Bandit will receive 1460 bytes in the handler on each transmission, the maximum allowed for TCP. Given a frame payload size of 1048576 bytes (1 MiB), this will result in 1048576/1460 = 719 calls. Each call will allocate a binary that is 1460 bytes longer than the previous. Applying the series formula for natural numbers, that should work out to (1460)(719)(719 + 1) / 2 bytes = 377906400 bytes of binaries allocated and garbage collected for receiving a 1MiB frame (though not necessarily all at the same time).

@liamwhite
Copy link
Contributor Author

liamwhite commented Aug 11, 2024

I built a demo of the iodata idea with very poor code quality - hopefully you can improve it.
liamwhite@2afa7d5

With that I can get Last request: 5ms, 210 MB/s in Firefox, and Last request: 4ms, 291 MB/s in Chromium. So this is an all-around speedup and Bandit can easily beat Cowboy with this change.

@liamwhite
Copy link
Contributor Author

I also investigated a bottleneck with unmasking large frames -- while less severe, unfortunately I think these may not be solvable without help from a NIF

Right now Frame.mask/2 uses a tail recursive function to perform the mask.

  • Changing to a generator for <<h::32 <- words>>, into: <<>>, do: <<Bitwise.bxor(h, mask)::32>> improved throughput a little bit over the tail recursive solution.
  • Changing to :crypto.exor(words, :binary.copy(<<mask::32>>, size_in_words)) improved throughput a little bit over the generator solution at the expense of wasting a lot of memory.

I cannot reliably get more than 300MB/s of throughput and suspect it probably isn't possible without dropping to native.

@mtrudel
Copy link
Owner

mtrudel commented Aug 12, 2024

The need to work with binaries when pattern matching against frames is a big performance here; in my experience to date it hasn't been too bad, mostly since the underlying system calls return more than a single packet's worth of data per recv call. I'm guessing that there's something to how FF sends on the client side that messes with that assumption, leading to exactly the performance issue you identify.

I like your fix! Could you open a PR for it and we can clean it up a bit & merge? Something similar would need to be done for HTTP/2 frames as well, but I think it makes sense to establish a pattern with WebSockets before copying it over there.

We've also mused about the idea of folding all of this upstream into ThousandIsland, whereby it could be implemented as a runtime configurable 'framer' module on a connection. More of a 2.0 thing though.

We've spent a bunch of effort golfing the performance of masking to about as good as we can make it at this point across all sizes and shapes of frames - I don't think that that's much of the issue here.

@liamwhite
Copy link
Contributor Author

I don't think that that's much of the issue here.

My benchmarking suggests frame masking is by far the most significant remaining bottleneck to reaching higher throughput with socket frames. I'm open to data that would change my mind about that, of course.

@mtrudel
Copy link
Owner

mtrudel commented Aug 12, 2024

I think we should focus on getting the iolist work in #390 landed then we can look at remaining perf issues. Truthfully it's been a while since I looked at the masking work. Maybe @moogle19 has some recollection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants