-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework session keep-alive logic v2 #1359
Conversation
b73b657
to
01a6c16
Compare
no other changes except for `self` -> `session` rename and adding the new `session` argument
01a6c16
to
3fccc3a
Compare
3fccc3a
to
69f6153
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit weird that this is basically a state machine but we're not handling it with match
, the best thing for handling state machines in rust. Maybe you had it that way in another version? I've lost track a bit, sorry.
But I'm nit picking really. If it works then let's just get it fixed and release.
} | ||
PendingPong => { | ||
trace!("Sending Pong"); | ||
let _ = session.send_packet(PacketType::Pong, vec![0, 0, 0, 0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to flush this onto the wire after (successfully) sending? Although the 5 second margin might mask any sending delay, that isn't really the point of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be possible to flush the stream in principle (i.e. await the <Framed<TcpStream> as SinkExt<_>>::poll
method). However, as the sender_task
is implemented right now, there's only a one-directional channel (tx_connection
) to send data. It would require a redesign of the Session
to handle flushing.
I'd argue that it's a bit too much to put in this PR. In addition, flushing behavior (or rather the lack thereof) is the same as it was before this PR, so it's rather an orthogonal issue to the changed keep-alive sequence.
In general, I know far from enough on TCP and its implementations to tell what is required to make flushing work: For example, there's the TCP_NODELAY
socket option (i.e. socket.set_nodelay(true)
. Would we need to set that additionally?
The 5 second delay in particular is pretty arbitrary, maybe it would make sense to raise that a bit to account for short network drop-outs? If I remember correctly, I've observed the PongAck
within a few hundred milliseconds after sending the Pong
on a wired connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with everything you've said there. And I've also seen some relatively large pong-ack delays (which could be due to us or them or just network, I never investigated). Arguably unnecessary to invalidate the session over missing pong-ack. The next missing ping will sort things out for us anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I guess the remaining question is under which condition exactly we should time out:
- no Ping (120s + x) after the last Ping
- no Ping (60s + x) after sending the last Pong
- no Ping (60s + x) after receiving the last PongAck
- no PongAck x after sending the last Pong
where x accounts for any extra latency in the network round-trip and processing of keep-alive events.
This PR implements 3 + 4, whereas the old behaviour is 1. To be honest, I have no idea what would be best (presumably, that would depend on how the server handles timing of these events). Is your suggestion 1, 2 or 3 here?
In fact, I've also seen a few connection resets (spaced several hours apart) with this PR applied, but I didn't investigate exactly where they originated. Maybe, 5s is in fact too little margin for our timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- Does 3 + 4 not also imply 2?
- What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current
dev
?
First thought is that I agree with @kingosticks to not really care about the PongAcks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- Does 3 + 4 not also imply 2?
If there's not much latency, essentially yes. I think it's not quite the same regarding how the tolerance x
adds up and from which points exactly the timeouts are counted if some event (say the PongAck) arrives with significant delay (but I didn't really think this through in detail).
- What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current
dev
?
Current dev. #1357 has the same behavior as this PR; only the implementation is different.
First thought is that I agree with @kingosticks to not really care about the PongAcks.
My first thought was to use as much information about the state of the connection as possible, which is how I ended up here (i.e. time out shortly after missing PongAck instead of waiting for a next Ping). You may be right that this overcomplicates things.
In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
librespot-java also ignores PongAck, but if I understand the code correctly, it behaves like current librespot(-rs) dev
and send the Pong
immediately after receiving a Ping
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.
Though I stand by my earlier statement that correctness trumps simplicity, if the end result is the same then I would also say that Occam's razor applies.
I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.
Spotify is the weirdest; some APs respond and behave differently. I remember for some users there's also no autoplay flag being reported in the XML of the initial connection, yet for 99% it is. I would not even know how to get the autoplay state without it.
Due to the GeoIP thing it's hard if at all possible to make a manual connection to such APs unless you are actually there.
I'm not sure where you're missing a state machine here: Isn't the second half of
👍 I'll ask people over in the original issue to maybe also test this implementation. It should implement the exact same behavior as #1357 (and it does work for me), but of course there could be different bugs here. |
So with both #1357 and this one working, I leave the choice up to you. Which one do you prefer? |
Seeing as there is no response, could you maybe just make the final decision here instead? It would be really nice to get this one for the next version. |
we don't really know what the server expects and how quickly it usually reacts, so add some safety margin to avoid timing out too early
e0870a4
to
38f8d37
Compare
Sorry for the delay, I had a pretty busy week. Let's go with this PR then, I think the resulting code is easier to understand (and probably also easier to change, should we need to modify the keepalive sequence). I've just pushed a small update to increase the buffer before timing out a bit; just in case the previous delay of 5s after PongAck or expected Ping could also be caused by network hickups: This shouldn't impede our ability to detect connection loss much, but maybe avoid some spurious timeouts, in particular given that we know very little about the servers' behavior. |
Woop woop! Thanks a lot! 🙏🤗 |
Great! Merged! That should make v0.5.0 a wrap. Gonna try and make time this evening 🤞 |
Thank you @wisp3rwind. |
Another approach to #1357 using the ideas from #1357 (comment).
This looks like a big change, but much of it is just moving around code:
fn dispatch
is now a part ofDispatchTask
such that it has access to the fields ofDispatchTask
(namely, the timeout and state of the keepalive state machinery).This is a little different from @roderickvd's idea: I've kept the
KeepAliveState
enum instead of just theping_received
flag, and instead there's only a singleSleep
future around. The deadline of the latter is modified as required.EDIT: Split into a few more commits now to make this easier to review. For testing, keep-alive events are logged at
TRACE
level.Fixes #1340
Closes #1357