Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bin): don't allocate in server UDP recv path #2202

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented Oct 26, 2024

Previously the neqo-bin server would read a set of datagrams from the socket and allocate them:

let dgrams: Vec<Datagram> = dgrams.map(|d| d.to_owned()).collect();

This was done out of convenience, as handling Datagram<&[u8]>s, each borrowing from self.recv_buf, is hard to get right across multiple &mut self functions, that is here self.run, self.process and self.find_socket.

This commit combines self.process and self.find_socket and passes a socket index, instead of the read Datagrams from self.run to self.process, thus making the Rust borrow checker happy to handle borrowing Datagram<&[u8]>s
instead of owning Datagrams.


Follow-up to #2184.
Fixes #2190.
Hopefully speeds up #2199.

Copy link

github-actions bot commented Oct 26, 2024

Benchmark results

Performance differences relative to 05b4af9.

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [99.529 ns 99.865 ns 100.21 ns]
       change: [-0.1174% +0.3717% +0.8660%] (p = 0.15 > 0.05)

Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [117.76 ns 118.07 ns 118.41 ns]
       change: [-0.6885% +0.0671% +0.7042%] (p = 0.86 > 0.05)

Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) low mild
1 (1.00%) high mild
11 (11.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [117.04 ns 117.42 ns 117.90 ns]
       change: [-0.0271% +0.7334% +1.6988%] (p = 0.11 > 0.05)

Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [97.116 ns 97.271 ns 97.452 ns]
       change: [-0.8285% +0.0213% +0.7736%] (p = 0.96 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
6 (6.00%) high mild
5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [112.09 ms 112.14 ms 112.19 ms]
       change: [+0.2811% +0.3416% +0.4103%] (p = 0.00 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) low mild
1 (1.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.
       time:   [26.443 ms 27.478 ms 28.535 ms]
       change: [-3.4655% +1.9202% +7.4555%] (p = 0.51 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

transfer/pacing-true/varying-seeds: No change in performance detected.
       time:   [34.793 ms 36.454 ms 38.119 ms]
       change: [-3.5960% +2.6120% +9.8175%] (p = 0.45 > 0.05)
transfer/pacing-false/same-seed: No change in performance detected.
       time:   [25.374 ms 26.217 ms 27.080 ms]
       change: [-4.1916% -0.0816% +4.6652%] (p = 0.97 > 0.05)
transfer/pacing-true/same-seed: No change in performance detected.
       time:   [40.728 ms 42.847 ms 45.005 ms]
       change: [-6.5312% -0.0175% +7.3814%] (p = 0.99 > 0.05)
1-conn/1-100mb-resp/mtu-1500 (aka. Download)/client: No change in performance detected.
       time:   [867.39 ms 877.58 ms 888.07 ms]
       thrpt:  [112.60 MiB/s 113.95 MiB/s 115.29 MiB/s]
change:
       time:   [-1.6565% -0.1340% +1.4007%] (p = 0.87 > 0.05)
       thrpt:  [-1.3814% +0.1342% +1.6844%]
1-conn/10_000-parallel-1b-resp/mtu-1500 (aka. RPS)/client: Change within noise threshold.
       time:   [326.95 ms 330.95 ms 334.98 ms]
       thrpt:  [29.853 Kelem/s 30.216 Kelem/s 30.586 Kelem/s]
change:
       time:   [+0.2293% +1.7788% +3.2955%] (p = 0.03 < 0.05)
       thrpt:  [-3.1903% -1.7477% -0.2288%]
1-conn/1-1b-resp/mtu-1500 (aka. HPS)/client: Change within noise threshold.
       time:   [34.381 ms 34.577 ms 34.787 ms]
       thrpt:  [28.746  elem/s 28.921  elem/s 29.085  elem/s]
change:
       time:   [+0.3836% +1.2617% +2.1159%] (p = 0.00 < 0.05)
       thrpt:  [-2.0720% -1.2460% -0.3821%]

Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe

1-conn/1-100mb-resp/mtu-1500 (aka. Upload)/client: 💚 Performance has improved.
       time:   [1.6054 s 1.6208 s 1.6362 s]
       thrpt:  [61.118 MiB/s 61.697 MiB/s 62.290 MiB/s]
change:
       time:   [-8.9362% -7.7417% -6.5331%] (p = 0.00 < 0.05)
       thrpt:  [+6.9898% +8.3913% +9.8131%]
1-conn/1-100mb-resp/mtu-65536 (aka. Download)/client: 💚 Performance has improved.
       time:   [100.83 ms 101.12 ms 101.40 ms]
       thrpt:  [986.15 MiB/s 988.95 MiB/s 991.78 MiB/s]
change:
       time:   [-12.288% -10.127% -8.8506%] (p = 0.00 < 0.05)
       thrpt:  [+9.7100% +11.269% +14.009%]

Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low mild
1 (1.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-65536 (aka. RPS)/client: 💔 Performance has regressed.
       time:   [320.95 ms 323.75 ms 326.50 ms]
       thrpt:  [30.627 Kelem/s 30.888 Kelem/s 31.158 Kelem/s]
change:
       time:   [+1.0502% +2.3047% +3.6421%] (p = 0.00 < 0.05)
       thrpt:  [-3.5141% -2.2527% -1.0393%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

1-conn/1-1b-resp/mtu-65536 (aka. HPS)/client: Change within noise threshold.
       time:   [34.484 ms 34.661 ms 34.847 ms]
       thrpt:  [28.697  elem/s 28.851  elem/s 28.999  elem/s]
change:
       time:   [+0.7538% +1.4183% +2.1085%] (p = 0.00 < 0.05)
       thrpt:  [-2.0650% -1.3985% -0.7481%]

Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild

1-conn/1-100mb-resp/mtu-65536 (aka. Upload)/client: No change in performance detected.
       time:   [254.10 ms 265.28 ms 277.36 ms]
       thrpt:  [360.54 MiB/s 376.95 MiB/s 393.55 MiB/s]
change:
       time:   [-12.552% -5.3656% +1.3585%] (p = 0.16 > 0.05)
       thrpt:  [-1.3403% +5.6698% +14.354%]

Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing MTU Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 1504 105.5 ± 21.2 88.7 173.4 1.00
neqo msquic reno on 1504 214.3 ± 11.7 198.6 229.7 1.00
neqo msquic reno 1504 216.0 ± 14.9 201.1 251.7 1.00
neqo msquic cubic on 1504 227.6 ± 16.7 199.4 255.7 1.00
neqo msquic cubic 1504 213.1 ± 15.8 200.6 248.7 1.00
msquic neqo reno on 1504 696.4 ± 8.5 683.6 707.1 1.00
msquic neqo reno 1504 700.8 ± 10.9 676.6 712.9 1.00
msquic neqo cubic on 1504 710.3 ± 16.3 687.4 745.7 1.00
msquic neqo cubic 1504 688.3 ± 14.8 665.6 712.8 1.00
neqo neqo reno on 1504 424.2 ± 21.2 393.4 472.6 1.00
neqo neqo reno 1504 395.7 ± 8.2 386.1 413.8 1.00
neqo neqo cubic on 1504 417.3 ± 29.6 382.9 473.0 1.00
neqo neqo cubic 1504 409.2 ± 15.3 387.5 428.8 1.00
msquic msquic 65536 113.2 ± 28.5 90.9 175.0 1.00
neqo msquic reno on 65536 212.5 ± 13.2 197.5 231.3 1.00
neqo msquic reno 65536 216.0 ± 17.8 196.2 245.8 1.00
neqo msquic cubic on 65536 222.3 ± 16.4 193.7 257.9 1.00
neqo msquic cubic 65536 209.1 ± 12.4 199.5 234.9 1.00
msquic neqo reno on 65536 98.7 ± 27.4 80.8 196.6 1.00
msquic neqo reno 65536 94.8 ± 18.6 80.5 154.4 1.00
msquic neqo cubic on 65536 91.0 ± 19.1 81.2 172.7 1.00
msquic neqo cubic 65536 89.2 ± 15.5 80.9 157.9 1.00
neqo neqo reno on 65536 117.7 ± 14.5 97.4 145.8 1.00
neqo neqo reno 65536 118.8 ± 31.6 94.7 251.3 1.00
neqo neqo cubic on 65536 110.6 ± 13.3 95.9 132.3 1.00
neqo neqo cubic 65536 113.9 ± 14.0 94.8 148.6 1.00

⬇️ Download logs

Copy link

codecov bot commented Oct 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.36%. Comparing base (05b4af9) to head (786c616).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2202      +/-   ##
==========================================
- Coverage   95.39%   95.36%   -0.03%     
==========================================
  Files         112      112              
  Lines       36447    36447              
==========================================
- Hits        34768    34759       -9     
- Misses       1679     1688       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Oct 26, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Previously the `neqo-bin` server would read a set of
datagrams from the socket and allocate them:

``` rust
let dgrams: Vec<Datagram> = dgrams.map(|d| d.to_owned()).collect();
```

This was done out of convenience, as handling `Datagram<&[u8]>`s, each borrowing
from `self.recv_buf`, is hard to get right across multiple `&mut self`
functions, that is here `self.run`, `self.process` and `self.find_socket`.

This commit combines `self.process` and `self.find_socket` and passes a socket
index, instead of the read `Datagram`s from `self.run` to `self.process`, thus
making the Rust borrow checker happy to handle borrowing `Datagram<&[u8]>`s
instead of owning `Datagram`s.
Comment on lines 241 to 249
input_dgrams.iter_mut().flatten().next().map_or_else(
|| {
// Reading from the socket returned no datagrams. Don't try again.
ready_socket_index = None;
input_dgrams = None;
None
},
Some,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that clippy didn't pick this one up.

Suggested change
input_dgrams.iter_mut().flatten().next().map_or_else(
|| {
// Reading from the socket returned no datagrams. Don't try again.
ready_socket_index = None;
input_dgrams = None;
None
},
Some,
)
input_dgrams.iter_mut().flatten().next().or_else(
|| {
// Reading from the socket returned no datagrams. Don't try again.
ready_socket_index = None;
input_dgrams = None;
None
},
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that clippy didn't pick this one up.

And I am surprised that I didn't see this. 🤦 Thank you Martin!

d23aa74

Comment on lines 255 to 265
let ((_host, first_socket), rest) = self.sockets.split_first_mut().unwrap();
let socket = rest
.iter_mut()
.map(|(_host, socket)| socket)
.find(|socket| {
socket
.local_addr()
.ok()
.map_or(false, |socket_addr| socket_addr == dgram.source())
})
.unwrap_or(first_socket);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You inlined this, presumably to avoid having to pass &mut self to it. It's still a useful thing to have broken out. You can make a new function that takes &mut self.sockets and returns &mut Socket. Either that or you could go with Self::send(&mut self.sockets, &dgram).await? and have that function do the sending part as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Keeping this logic separate is simpler. I re-introduced find_socket in 786c616.

loop {
match self.server.process(dgram.take(), (self.now)()) {
let input_dgram = if let Some(d) = input_dgrams.iter_mut().flatten().next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this code to be a little unintuitive. You are taking a mutable iterator over the option, then flattening it. It's not clear that you are mutating the underlying iterator as a result of calling next().

Taking a step back, I think that this code is fairly simple:

You read from the indicated socket, process every datagram that it produces, and stop when the socket stops producing datagrams.

Would this structure work?

fn whatever(ready_socket_index: Option<usize>) -> Res<()> {
    let Some(inx) = ready_socket_index else {
        return Ok(());
    };

    let (host, socket) = self.sockets.get_mut(inx).unwrap();
    while let Some(input_dgrams) = socket.recv(*host, &mut self.recv_buf)? {
        for input in input_dgrams {
            match self.server.process(input, (self.now)()) {
                // see below for a note about sending.
                Output::Datagram(output) => Self::send(&mut self.sockets, &output).await?,
                Output::Callback(t) => {
                    self.timeout = Some(Box::pin(tokio::time::sleep(new_timeout)))
                }
                Output::None => {}
            }
        }
    }
    Ok(())
}

I think that's functionally equivalent to what you have, but a lot easier to read, at least for me.

I'm not a borrow checker, so I couldn't say if this works. There are a lot of references being held here. I can't see any obvious overlap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this code to be a little unintuitive.

Agreed. The complexity stems from the fact that neqo_transport::Server does not have a process_multiple_input function, thus having to handle individual output Datagrams while still buffering input Datagrams.

neqo_transport::Server does not have a process_multiple_input function, because the set of input Datagrams provided might each be for a different neqo_transport::Connection, thus each result in a Output::Datagram and thus process_multiple_input would need to return a set of output Datagrams, not just one Datagram.

If you think it is worth it, I can explore this pathway, i.e. adding process_multiple_input to neqo_transport::Server. Preferably in a follow-up pull request.


Addressing the concrete suggestion above:

    let Some(inx) = ready_socket_index else {
        return Ok(());
    };

process (or whatever above) might be called with ready_socket_index None, in which case it is expected to drive the output path only, i.e. not just return early.

    while let Some(input_dgrams) = socket.recv(*host, &mut self.recv_buf)? {
        for input in input_dgrams {
            match self.server.process(input, (self.now)()) {
                // see below for a note about sending.
                Output::Datagram(output) => Self::send(&mut self.sockets, &output).await?,
                Output::Callback(t) => {
                    self.timeout = Some(Box::pin(tokio::time::sleep(new_timeout)))
                }
                Output::None => {}
            }
        }
    }

If self.server.process returns Output::Datagram, one has to call it again, until it returns Output::Callback or Output::None. In the above suggestion, self.server.process is only called, if more input datagrams are available.

Comment on lines -227 to -239
/// Tries to find a socket, but then just falls back to sending from the first.
fn find_socket(&mut self, addr: SocketAddr) -> &mut crate::udp::Socket {
let ((_host, first_socket), rest) = self.sockets.split_first_mut().unwrap();
rest.iter_mut()
.map(|(_host, socket)| socket)
.find(|socket| {
socket
.local_addr()
.ok()
.map_or(false, |socket_addr| socket_addr == addr)
})
.unwrap_or(first_socket)
}
Copy link
Collaborator Author

@mxinden mxinden Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved outside of impl ServerRunner {}, i.e. below, as it no longer takes &mut self but instead socket: &mut [(SocketAddr, crate::udp::Socket)].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make server UDP receive path not allocate
2 participants