[ScottHelme] Sockets - Under The Hood: Understanding Truncated Request Behaviour #20117

github-actions · 2023-10-16T14:10:35Z

I'm thoroughly pleased to be able to say that I finally understand the issue that's been bothering me on Report URI for a few weeks now, and this is the blog post that's going to explain everything!

Report URI

If you haven't read the first two blog posts related to this issue, it might help you to get some background on what I'm going to be talking about here:

Unravelling The Mystery Of Truncated POST Requests On Report URI

Processing Truncated Requests? A PHP Debugging Deep Dive

This weird behaviour of our application receiving and then trying to process truncated POST requests has lead me on quite a journey of learning, and I can finally explain what's happening!

The Story So Far

In the previous blog posts, I'd narrowed it down to what was happening, but not why it was happening, and I was still stumped. After patching PHP, I could see it was calling read() on the socket and reading until 0 was returned, but 0 was being returned before expected. Here's the log of the behaviour from the previous blog post.

Sep 29 20:23:31 test-server php: Accepted connection: 0ce264eae07198c3c59ae90d04127c39
Sep 29 20:23:31 test-server php: Read 16384 bytes (1024737 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (1008353 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (991969 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (975585 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (959201 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (942817 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (926433 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (910049 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (893665 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (877281 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (860897 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (844513 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (828129 bytes remaining)
Sep 29 20:23:31 test-server php: Read 16384 bytes (811745 bytes remaining)
Sep 29 20:23:31 test-server php: errno 0: Success
Sep 29 20:23:31 test-server php: Read 0 bytes (811745 bytes remaining)
Sep 29 20:23:31 test-server php: POST data reading complete. Size: 229376 bytes (out of 1041121 bytes)

I knew that Nginx was receiving the full request, so the problem was happening in the fastcgi_pass to PHP-FPM, but I couldn't wrap my head around the circumstances. I wanted to understand more on the underlying mechanics of how Nginx was talking to PHP and, more specifically, what was failing in that process.

bpftrace

In the last blog I had to get familiar with strace for the first time, and this time around, I'd asked for some assistance from Ignat Korchagin who kindly whipped up several bpftrace commands, and, could also reliably reproduce the issue. Hopefully, by monitoring things a little more closely between Nginx and PHP, I could gain an understanding of precisely what was happening.

After monitoring my test server when the issue was happening, I could see Nginx hitting EAGAIN ( source.-,EAGAIN,-Resource%20temporarily%20unavailable)) when trying to write to the socket, and I could see PHP getting EPIPE ( source when trying to read from the socket, but it still wasn't clear to me how that was occurring. The EAGAIN error is likely because the socket buffer is full when Nginx is sending, so it can't send any more data, and the EPIPE comes when PHP tries to read from the socket that has since been closed by Nginx. Here's the command.

$ bpftrace -e 'kretfunc:unix_stream_sendmsg /retval < 0/ { printf("%d\n", retval); }'
Attaching 1 probe...
-11
-11
[-11 repeated 20x]
-11
-11
-32
-32
[-32 repeated 12x]
-32
-32

The log reads like Nginx is trying to write to the socket which has a full buffer, and PHP is trying to read from a socket that's closed, but if PHP has called accept(), why would it not be able to read()? The accept() call is followed by read() calls in the PHP source almost immediately so it didn't make sense. I also saw 12x EPIPE (-32) errors coming from PHP and I had exactly 12x truncated payload errors in syslog using the same logging from the previous blog post.

Digging a little deeper, I wanted to see when the calls to connect() and accept() were being made.

$ bpftrace -e 'tracepoint:syscalls:sys_enter_connect /comm == "nginx"/ { printf("connect: %s\n", strftime("%H:%M:%S:%f", nsecs)); } tracepoint:syscalls:sys_enter_accept /comm == "php-fpm8.2"/ { printf("accept: %s\n", strftime("%H:%M:%S:%f", nsecs)); }'

After firing another batch of 20 requests, I could see all of the connect() calls from nginx happen right away, but the accept() calls from PHP were much more spaced out.

Attaching 2 probes...
connect: 10:42:27:244851
connect: 10:42:27:576388
connect: 10:42:27:643525
connect: 10:42:27:750282
connect: 10:42:27:923155
connect: 10:42:28:054478
connect: 10:42:28:419101
accept: 10:42:28:522173
connect: 10:42:28:716231
connect: 10:42:29:129984
connect: 10:42:29:146868
connect: 10:42:29:240262
connect: 10:42:29:307529
connect: 10:42:29:310247
accept: 10:42:29:523505
connect: 10:42:29:825400
connect: 10:42:29:832304
connect: 10:42:29:978008
connect: 10:42:30:155472
connect: 10:42:33:271992
connect: 10:42:35:891986
connect: 10:42:37:164266
accept: 10:42:37:257787
accept: 10:42:37:588539
accept: 10:42:38:534605
accept: 10:42:39:534456
accept: 10:42:47:260393
accept: 10:42:47:590909
accept: 10:42:48:537622
accept: 10:42:49:536993
accept: 10:42:57:263766
accept: 10:42:57:593434
accept: 10:42:58:540436
accept: 10:42:59:539849
accept: 10:43:07:267292
accept: 10:43:07:595731
accept: 10:43:08:543073
accept: 10:43:09:542913
accept: 10:43:17:271296
accept: 10:43:17:598086

I still wasn't clear on how this was working because PHP is calling accept() and read() very close together, so if PHP can accept, why is the read failing with a partial payload?!

Oct 12 10:42:47 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:47 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:48 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:49 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:57 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:57 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:58 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:42:59 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:07 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:07 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:08 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:09 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:17 test-server php: JSON decode failed: 1041121 229376
Oct 12 10:43:18 test-server php: message repeated 2 times: [ JSON decode failed: 1041121 229376]
Oct 12 10:43:19 test-server php: JSON decode failed: 1041121 229376

The truncated payload issued showing in syslog matched up to the late accept() calls from PHP! It was at this point I was prompted to look at the backlog parameter of listen() and it became clear that this behaviour can be explained by backlog, something I wasn't familiar with.

int listen(int sockfd , int backlog );

When calling listen(), the second parameter provided is backlog, which is described as follows in the docs:

The backlog argument defines the maximum length to which the

queue of pending connections for sockfd may grow.

So now we can have an application, like Nginx, call connect() and succeed but without the application on the other side, PHP, calling accept()! This is finally starting to make a little more sense with the behaviour I'm seeing. It also turns out that Nginx can write() until the receive buffer on the socket is full, and that is when it's going to hit EAGAIN. I confirmed this using strace:

cat nginx.strace | grep -E 'sendfile\(|connect\(|close\(|writev\('
75920 19:43:26 connect(3, {sa_family=AF_UNIX, sun_path="/run/php/php8.2-fpm.sock"}, 110) = 0
75920 19:43:26 writev(3, [{iov_base="\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\2[\5\0\t\0PATH_I"..., iov_len=648}], 1) = 648
75920 19:43:26 sendfile(3, 29, [0] => [32768], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [32768] => [65536], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [65536] => [98304], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [98304] => [131072], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [131072] => [163840], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [163840] => [196608], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = 8
75920 19:43:26 sendfile(3, 29, [196608] => [229376], 32768) = 32768
75920 19:43:26 writev(3, [{iov_base="\1\5\0\1\200\0\0\0", iov_len=8}], 1) = -1 EAGAIN (Resource temporarily unavailable)

I believe that Nginx is using writev() to send the fastcgi headers, and using sendfile() to send chunks of the body, which comes from this function in the code. It certainly seems like this is starting to make a little sense. Looking at the PHP config, I wanted to see what the backlog value was because it's not something I ever recall setting. I found the following in our config:

; Set listen(2) backlog.
; Default Value: 511 (-1 on Linux, FreeBSD and OpenBSD)
;listen.backlog = 511

With a default value of -1, I was guessing that backlog might be cast to an unsigned int by the kernel, which it is ( source).

if ((unsigned int)backlog > somaxconn)
	backlog = somaxconn;

You can also see that, as stated in the manual ( source), if the defined backlog value is greater than somaxconn, it will be silently truncated to that value. Because -1 will actually be 4294967295, by default, PHP will always set backlog to the value of somaxconn, effectively letting the system decide.

# cat /proc/sys/net/core/somaxconn
4096

This means we can have up to 4,096 connections pending in the backlog queue, which is quite a few!

The Theory

At this point, I believe something like the following is happening during the lifecycle of the request.

Nginx makes the connect() syscall and begins writing to the socket.
While PHP is able, it processes the initial requests. We see a small number of requests succeeding.
PHP becomes overloaded and stops processing new, inbound requests.
Nginx continues to make new connections and fill the socket buffer with data.
Once the receive buffer is full, Nginx will receive EAGAIN and likely wait for something like fastcgi_send_timeout ( source).
The send will timeout and Nginx will close the connection.
PHP becomes available and accepts the connection, starting to read until the receive buffer has been consumed.
Once the buffer has been read, PHP will get EPIPE and process the content read from the buffer.

Sounds logical, right? It should also be easy enough to test this now that I know what I'm aiming for so I tinkered with some of the config parameters in Nginx and it did indeed seem fastcgi_send_timeout was the main factor. By reducing the send timeout, I would see more requests failing to parse in general, and I assume that's because Nginx is timing out sooner so PHP has less time to get to the backlog.

Testing The Theory!

Now for the fun part, where you actually find out if you were right, or you were completely off the mark! I was looking into ways that I could reliably reproduce this and my first effort was to take a few steps. I reduced the number of child processes PHP can spawn to one, so it would be easy to keep it busy, and I reduced the send timeout on Nginx to three seconds so it would time out very quickly. With this basic setup I could send two requests sequentially and the first one would succeed and the second one would reliably fail, 100% of the time. I wanted something better than that though, something that would allow me to trigger this on a single request, and that's when João Poupino dropped this little gem in Slack!

This gives me the ability to talk directly to the PHP socket, using the FastCGI protocol, and completely cut Nginx out of the equation. The script also has a feature that allows you to send only a partial request, where it will randomly close() the socket somewhere during the transmission of the payload. With this, it should now be possible to reproduce this problem with only a single request. The moment of truth....

python3 send-fastcgi-request.py unix:/run/php/php8.2-fpm.sock post.data --partial
[102935] Sending request 1/1
[102936] Connecting to unix:/run/php/php8.2-fpm.sock...
[102936] Sending 37 FastCGI records...
[102936]   * Sent record 1/37 (8 + 8 bytes)
[102936]   * Sent record 2/37 (8 + 512 bytes)
[102936]   * Sent record 3/37 (8 + 0 bytes)
[102936]   * Sent record 4/37 (8 + 32768 bytes)
[102936]   * Sent record 5/37 (8 + 32768 bytes)
[102936]   * Sent record 6/37 (8 + 32768 bytes)
[102936]   * Sent record 7/37 (8 + 32768 bytes)
[102936]   * Sent record 8/37 (8 + 32768 bytes)
[102936]   * Sent record 9/37 (8 + 32768 bytes)
[102936]   * Sent record 10/37 (8 + 32768 bytes)
[102936]   * Sent record 11/37 (8 + 32768 bytes)
[102936]   * Sent record 12/37 (8 + 32768 bytes)
[102936]   * Sent record 13/37 (8 + 32768 bytes)
[102936]   * Sent record 14/37 (8 + 32768 bytes)
[102936]   * Sent record 15/37 (8 + 32768 bytes)
[102936]   * Sent record 16/37 (8 + 32768 bytes)
[102936]   * Sent record 17/37 (8 + 32768 bytes)
[102936]   * Sent record 18/37 (8 + 32768 bytes)
[102936]   * Sent record 19/37 (8 + 32768 bytes)
[102936]   * Sent record 20/37 (8 + 32768 bytes)
[102936]   * Sent record 21/37 (8 + 32768 bytes)
[102936]   * Sent record 22/37 (8 + 32768 bytes)
[102936]   * Sent record 23/37 (8 + 32768 bytes)
[102936]   * Sent record 24/37 (8 + 32768 bytes)
[102936] Closing socket after record 24!

The script started to send my large POST payload and decided to close the socket after sending only 24 out of the 37 chunks of the request. Time to check syslog and see if we can find what I'm hoping for and expecting.

Oct 13 13:33:57 test-server php: POST data reading complete. Size: 688128 bytes (out of 1069923 bytes)
Oct 13 13:34:07 test-server php: JSON decode failed: 1069923 688128

Finally! 🍾🥂

It feels awesome to get to the bottom of something that's been bugging you so much! I fixed the issue a whole two blog posts ago, and I could have moved on, but sometimes, I really just need to understand why something is happening, and honestly, it was quite a fun journey getting here. I owe a big thanks to João and Ignat, who collectively must have saved me countless days of effort that I might not even have had time for.

I've also skipped over a few of the debugging steps above because this was becoming another lengthy blog post, but I did, during and after debugging, spend a lot of time trawling through the Nginx source to understand fully what's happening and also just to confirm my understanding.

The Nginx behaviour of using writev() and sendfile() can be seen here: https://github.com/nginx/nginx/blob/9f8d60081cd4eefa5fcf0df275d784d621290b9b/src/os/unix/ngx_linux_sendfile_chain.c#L50

Nginx adds a timeout when receiving EAGAIN here, using fastcgi_send_timeout. I also noticed in the Nginx source that it makes use of epoll(), which explains earlier why I was only seeing a single EAGAIN per request: https://github.com/nginx/nginx/blob/9f8d60081cd4eefa5fcf0df275d784d621290b9b/src/http/ngx_http_upstream.c#L2093

The event timer expires in Nginx here: https://github.com/nginx/nginx/blob/9f8d60081cd4eefa5fcf0df275d784d621290b9b/src/event/ngx_event_timer.c#L54

I also spotted that PHP was respawning the child processes when I was testing and that they'd always have new PIDs after a test run. It seems that if the child is idle for too long, PHP will kill and respawn it, possibly based on this timeout value: https://github.com/php/php-src/blob/47c6b3bd452d2932af5c2f021b10ab2aaed01fb1/sapi/fpm/fpm/fpm_conf.c#L622

All in all, I can now say I'm happy that I understand everything that was happening with this issue and can finally put it to bed!

As a final closing note, just to nicely close things off, I got a response on the PHP bug that I filed in the previous blog post and it seems that this finding is indeed a bug and will be patched!

https://scotthelme.co.uk/sockets-under-the-hood/

The text was updated successfully, but these errors were encountered:

github-actions bot added the ScottHelme label Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ScottHelme] Sockets - Under The Hood: Understanding Truncated Request Behaviour #20117

[ScottHelme] Sockets - Under The Hood: Understanding Truncated Request Behaviour #20117

github-actions bot commented Oct 16, 2023

[ScottHelme] Sockets - Under The Hood: Understanding Truncated Request Behaviour #20117

[ScottHelme] Sockets - Under The Hood: Understanding Truncated Request Behaviour #20117

Comments

github-actions bot commented Oct 16, 2023

Report URI

The Story So Far

bpftrace

The Theory

Testing The Theory!