-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: don't split header and body across TCP packets #168
feat: don't split header and body across TCP packets #168
Conversation
Thank you for digging into this! This looks like a great improvement. Is there a reason why this needs to be configurable? We will always need to write a header so can't we just always reserve 12 bytes for it? |
Could we also achieve something similar by just changing |
This would mean an additional allocation per frame which is not great. But, we could introduce a new constructor for For this to work, we'd also have to change how Thoughts? |
You're right we can assume that if we have enough space it must be 12 bytes, but in that case we have to expose header::HEADER_SIZE to the outside. |
I am not sure I fully understand what you mean? Regardless, I think that #168 (comment) is an avenue worth exploring because it should avoid the need for any of these APIs by merging header + body into one chunk for data frames which should always result in one TCP packet, assuming it fits. |
So you propose to allocate directly into a new Frame constructor: allocate 12 bytes more for the header, encode the header into the buffer then copy the slice into the buffer at offset 12. Eventually it means the Stream API set_reserved_header_size() becomes useless, isn't it ? |
Correct! I'd have to look into the details but off the top of my head I am thinking:
The last point makes sure we don't allocate again in |
OK that's makes sense. I will try to work on it. In the same time I was wondering if we could avoid the copy of the buffer in poll_write() (or in Frame as you suggest) by implementing a second API on Stream similar to the Sink trait. For instance a start_send() method that takes ownership of a BytesMut and indicates whether there is some room at the start of the buffer for a Yamux header. With this we could avoid an allocation and a copy. What do you think about it ? |
Thank you @stormshield-pj50 for posting your findings here!
Can you add more details on the setup where you are seeing a 50% throughput increase? What is the latency and bandwidth between the two endpoints? Is it a LAN, WAN or routed over the Internet? How many TCP connections (or other flows) run concurrently? What is the lifetime of a stream on a single connection? |
Hi Max, |
70f6777
to
1c06435
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Couple of questions regarding ergonomics of these new APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not yet convinced that this is a feature worth supporting. Keep in mind, I might be missing something.
Using libp2p + Yamux + rustls, we tunnel a single iperf3 TCP connection between two local docker containers.
Do I understand correctly, that this is localhost networking only? Is this your primary use-case? If not, can you please provide numbers on a more realistic network environment (i.e. an environment with significant latency and bandwidth limitations)?
In my eyes this pull request introduces significant complexity. At this point I don't think this complexity is worth the improvement of a minor use-case (localhost networking) of this library.
Is it really "significant"? I think we should be able to implement this such that it is almost all contained within I think it makes perfect sense and is quite intuitive that a frame can be encoded into a single byte buffer. Not doing that is the weird thing, no? I agree that we may need to polish the API a bit more but as long as it is contained, I am happy to accept this. |
The first aim of the dev is to prevent sending two TCP packets on the wire each time some data is written to the Yamux stream (one packet for Yamux header and one packet for data). Thus we can save one useless standalone header packet and at least 14 (ETH) + 20 (IP) + 20 (TCP) = 54 bytes for each bunch of data written to the stream. Take a look at the start of the PR for a packet capture that show this. That also means less network syscalls and a more scalable code. Our local test is just for benchmarking purpose, however I think with or without latency the important thing is to limit the number of packets sent on the wire.
|
@stormshield-pj50 I'll have a play with this API next week, ideally, I'd like to encapsulate more of this in the Do you think we can refactor this such that it is completely an implementation detail of |
I think, if we can contain this optimisation mostly in |
@stormshield-pj50 I pushed a commit that outlines a different approach for using Unfortunately, Let me know if you are willing to continue down this path :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some more notes on what still needs work!
Yes zerocopy poor enum handling is a pity. But their TryFromBytes wip could make the job in theory.
Containing zerocopy into the Frame module is a good idea! However I won't be able to work on it this week, maybe next week. I'll finish the work and push a new commit. Thank you for your help! |
No rush from my end! Let me know if you have any questions on the code! |
Note: Having this would be helpful in debugging the TCP RTT issue in libp2p/test-plans#304. At the moment, header and body are always split into two TCP packets which makes this rather cumbersome to follow along! |
I spent some more cycles on this and I think this is now a quite clean implementation. Would appreciate another review @mxinden! The implementation is now entirely contained within |
@stormshield-pj50 Can you do me two final favours?
|
yamux/src/frame/header.rs
Outdated
} | ||
|
||
/// Decode a [`Header`] value. | ||
// Decode a [`Header`] value. | ||
pub fn decode(buf: &[u8; HEADER_SIZE]) -> Result<Header<()>, HeaderDecodeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the decoding should be done on the raw buffer, as we have introduced zerocopy and can work instead on the struct header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't really good use of the type system to allow construction of types that are in an invalid state. Thus, until zerocopy
can give us these guarantees, I'd rather have a dedicated decode
function. This somewhat ensures that we are not working with a Header
that has invalid values set.
Sure! I will do it today. |
Integration tests PR in rust-libp2p: libp2p/rust-libp2p#4598 |
On our side, our CI tests are all OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good from my end but also needs @mxinden's approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am fine moving forward here. The rational behind the change makes sense. Thank you for the continuous work.
While intuitively I expect this to have no performance drawback, I would still like to see some benchmarks. Would you mind:
- Testing the latest version in your setup and reporting the latest numbers?
- Running the
concurrent.rs
benchmarks, especially theunconstrained
, comparing this patch with latestmaster
, e.g. usingcritcmp
? - Ideally (optional) plug this into https://github.com/libp2p/test-plans/blob/master/perf/README.md#libp2p-performance-benchmarking.
fn poll_new_outbound(&mut self, cx: &mut Context<'_>) -> Poll<Result<Stream>> { | ||
fn poll_new_outbound(&mut self, cx: &Context<'_>) -> Poll<Result<Stream>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why diverge from the Future
signature here? In other words, why remove the mut
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggered by a clippy lint, we can put it back but given that it is an internal API, I wouldn't bother.
yamux/Cargo.toml
Outdated
|
||
[dev-dependencies] | ||
quickcheck = "1.0" | ||
futures = "0.3.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this dev-dependency needed? futures
is already imported above using 0.3.12
.
yamux/src/frame.rs
Outdated
use self::header::HEADER_SIZE; | ||
|
||
/// A Yamux message frame consisting of a single buffer with header followed by body. | ||
/// The header can be zerocopy parsed into a Header struct by calling header()/header_mut(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The header can be zerocopy parsed into a Header struct by calling header()/header_mut(). | |
/// The header can be zerocopy parsed into a Header struct by calling `Header::header` and `Header::header_mut`. |
yamux/src/frame/header.rs
Outdated
|
||
/// The message frame header. | ||
#[derive(Clone, Debug, PartialEq, Eq)] | ||
#[derive(Clone, PartialEq, Eq, FromBytes, AsBytes, FromZeroes)] | ||
#[repr(packed)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this attribute the struct is no longer aligned to the target machine architecture. Do I understand correctly that we don't expect many accesses to Header
fields and thus for this not to have a performance impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary for zerocopy
to do its job so I don't think we have a choice unless we want to trade zerocopy
for more allocations.
I'd take these with a grain of salt. When I worked on the refactoring to |
Great results! Looks like we actually measurably improved performance :) |
Results from our iperf3 CI test:
|
We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of |
You can't be sure that poll_write_vectored() is well implemented on the underlying stream, as the default implementation takes the first non empty buffer and call poll_write() on it. I tested it a couple of weeks ago before submitting this PR and it was working with a TcpStream but not with a TlsStream on top of it. |
But isn't that just a shortcoming of the |
On our side we won't spend more work time on this PR. We think the implementation is good enough:
|
Which TLS library are you using? |
tokio-rustls |
Thank you for writing down your view-points. If we didn't care about reduced allocations, we could easily keep the Unfortunately, despite being familiar with I cannot follow your reasoning why we shouldn't rely on Overall, I am hesitant to merge this as is knowing that there is an easier and more maintainable solution. Our time is limited and we need to carefully consider, what we take on as maintenance responsibility. Thank you for iterating on this solution with me. I wish we would have talked about If I find some free time next week, I might check how hard it would be change to |
I've opened issue with them to discuss the implementation: rustls/tokio-rustls#26. I think that properly implementing this could be a nice optimization that the entire async rustls ecosystem can benefit from which is another reason on why we shouldn't build a specialised implementation here in |
OK that's fine for us. However if you want to cover all Tokio ecosystem libp2p related, you should also take a look at Tokio Tunsgtenite WebSockets and also Yamux over Yamux as used by libp2p relay_v2. |
I am closing this as stale. For reference, I started working on correctly forwarding vectored writes but ran out of time. Here are the drafts / issues: |
Currently, header and body of a frame are written to the underlying IO source in two steps. Depending on how eager the underlying stream is about flushing data, this can lead to two TCP packets being sent instead of header and body ending up in one.
By refactoring the internal implementation of
Frame
, we can reference it as a single byte slice and as a result, only use a single call topoll_write
without additional allocations per frame.