-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow WebSocket binary frame parsing with Firefox client #389
Comments
bandit/lib/bandit/websocket/handler.ex Line 38 in 6dd82d9
eprof pointed me to this line. Some logging suggests there are massive amounts of binaries being created when uploading large frames. The surrounding logic (in this file and in Frame) is suboptimal. Ideally it should trickle enough bytes to construct the header, and then start collecting the payload length as iodata and then do a single conversion to avoid wasting memory. Bandit will receive 1460 bytes in the handler on each transmission, the maximum allowed for TCP. Given a frame payload size of 1048576 bytes (1 MiB), this will result in 1048576/1460 = 719 calls. Each call will allocate a binary that is 1460 bytes longer than the previous. Applying the series formula for natural numbers, that should work out to (1460)(719)(719 + 1) / 2 bytes = 377906400 bytes of binaries allocated and garbage collected for receiving a 1MiB frame (though not necessarily all at the same time). |
I built a demo of the iodata idea with very poor code quality - hopefully you can improve it. With that I can get |
I also investigated a bottleneck with unmasking large frames -- while less severe, unfortunately I think these may not be solvable without help from a NIF Right now Frame.mask/2 uses a tail recursive function to perform the mask.
I cannot reliably get more than 300MB/s of throughput and suspect it probably isn't possible without dropping to native. |
The need to work with binaries when pattern matching against frames is a big performance here; in my experience to date it hasn't been too bad, mostly since the underlying system calls return more than a single packet's worth of data per I like your fix! Could you open a PR for it and we can clean it up a bit & merge? Something similar would need to be done for HTTP/2 frames as well, but I think it makes sense to establish a pattern with WebSockets before copying it over there. We've also mused about the idea of folding all of this upstream into ThousandIsland, whereby it could be implemented as a runtime configurable 'framer' module on a connection. More of a 2.0 thing though. We've spent a bunch of effort golfing the performance of masking to about as good as we can make it at this point across all sizes and shapes of frames - I don't think that that's much of the issue here. |
My benchmarking suggests frame masking is by far the most significant remaining bottleneck to reaching higher throughput with socket frames. I'm open to data that would change my mind about that, of course. |
Test machine is a Ryzen 9 5950X
Elixir 1.15.7
Erlang/OTP 26
Linux 6.8.11
Cowboy is slightly slower than Bandit on Chromium, Bandit is much slower than Cowboy on Firefox. 100% repeatable. Follow-up of phoenixframework/phoenix_live_view#3386.
Demo program
The text was updated successfully, but these errors were encountered: