Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework session keep-alive logic v2 #1359

Merged
merged 7 commits into from
Oct 15, 2024

Conversation

wisp3rwind
Copy link
Contributor

@wisp3rwind wisp3rwind commented Oct 2, 2024

Another approach to #1357 using the ideas from #1357 (comment).

This looks like a big change, but much of it is just moving around code: fn dispatch is now a part of DispatchTask such that it has access to the fields of DispatchTask (namely, the timeout and state of the keepalive state machinery).

This is a little different from @roderickvd's idea: I've kept the KeepAliveState enum instead of just the ping_received flag, and instead there's only a single Sleep future around. The deadline of the latter is modified as required.

EDIT: Split into a few more commits now to make this easier to review. For testing, keep-alive events are logged at TRACE level.

Fixes #1340
Closes #1357

@wisp3rwind wisp3rwind marked this pull request as ready for review October 2, 2024 14:43
no other changes except for `self` -> `session` rename and adding the new
`session` argument
Copy link
Contributor

@kingosticks kingosticks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird that this is basically a state machine but we're not handling it with match, the best thing for handling state machines in rust. Maybe you had it that way in another version? I've lost track a bit, sorry.

But I'm nit picking really. If it works then let's just get it fixed and release.

core/src/session.rs Outdated Show resolved Hide resolved
}
PendingPong => {
trace!("Sending Pong");
let _ = session.send_packet(PacketType::Pong, vec![0, 0, 0, 0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to flush this onto the wire after (successfully) sending? Although the 5 second margin might mask any sending delay, that isn't really the point of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be possible to flush the stream in principle (i.e. await the <Framed<TcpStream> as SinkExt<_>>::poll method). However, as the sender_task is implemented right now, there's only a one-directional channel (tx_connection) to send data. It would require a redesign of the Session to handle flushing.

I'd argue that it's a bit too much to put in this PR. In addition, flushing behavior (or rather the lack thereof) is the same as it was before this PR, so it's rather an orthogonal issue to the changed keep-alive sequence.

In general, I know far from enough on TCP and its implementations to tell what is required to make flushing work: For example, there's the TCP_NODELAY socket option (i.e. socket.set_nodelay(true). Would we need to set that additionally?

The 5 second delay in particular is pretty arbitrary, maybe it would make sense to raise that a bit to account for short network drop-outs? If I remember correctly, I've observed the PongAck within a few hundred milliseconds after sending the Pong on a wired connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with everything you've said there. And I've also seen some relatively large pong-ack delays (which could be due to us or them or just network, I never investigated). Arguably unnecessary to invalidate the session over missing pong-ack. The next missing ping will sort things out for us anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I guess the remaining question is under which condition exactly we should time out:

  1. no Ping (120s + x) after the last Ping
  2. no Ping (60s + x) after sending the last Pong
  3. no Ping (60s + x) after receiving the last PongAck
  4. no PongAck x after sending the last Pong

where x accounts for any extra latency in the network round-trip and processing of keep-alive events.

This PR implements 3 + 4, whereas the old behaviour is 1. To be honest, I have no idea what would be best (presumably, that would depend on how the server handles timing of these events). Is your suggestion 1, 2 or 3 here?

In fact, I've also seen a few connection resets (spaced several hours apart) with this PR applied, but I didn't investigate exactly where they originated. Maybe, 5s is in fact too little margin for our timeouts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Does 3 + 4 not also imply 2?
  2. What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current dev?

First thought is that I agree with @kingosticks to not really care about the PongAcks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Does 3 + 4 not also imply 2?

If there's not much latency, essentially yes. I think it's not quite the same regarding how the tolerance x adds up and from which points exactly the timeouts are counted if some event (say the PongAck) arrives with significant delay (but I didn't really think this through in detail).

  1. What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current dev?

Current dev. #1357 has the same behavior as this PR; only the implementation is different.

First thought is that I agree with @kingosticks to not really care about the PongAcks.

My first thought was to use as much information about the state of the connection as possible, which is how I ended up here (i.e. time out shortly after missing PongAck instead of waiting for a next Ping). You may be right that this overcomplicates things.

In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librespot-java also ignores PongAck, but if I understand the code correctly, it behaves like current librespot(-rs) dev and send the Pong immediately after receiving a Ping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.

Though I stand by my earlier statement that correctness trumps simplicity, if the end result is the same then I would also say that Occam's razor applies.

I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.

Spotify is the weirdest; some APs respond and behave differently. I remember for some users there's also no autoplay flag being reported in the XML of the initial connection, yet for 99% it is. I would not even know how to get the autoplay state without it.

Due to the GeoIP thing it's hard if at all possible to make a manual connection to such APs unless you are actually there.

@wisp3rwind
Copy link
Contributor Author

It feels a bit weird that this is basically a state machine but we're not handling it with match, the best thing for handling state machines in rust. Maybe you had it that way in another version? I've lost track a bit, sorry.

I'm not sure where you're missing a state machine here: Isn't the second half of fn poll() in impl Future for DispatchTask of this PR roughly what you're describing? The state is being updated when receiving Ping and PongAck as well as when sending Pong. On each keep-alive related timer event, a match statement is used to take the appropriate action (send Pong or error out).

But I'm nit picking really. If it works then let's just get it fixed and release.

👍 I'll ask people over in the original issue to maybe also test this implementation. It should implement the exact same behavior as #1357 (and it does work for me), but of course there could be different bugs here.

@roderickvd
Copy link
Member

So with both #1357 and this one working, I leave the choice up to you. Which one do you prefer?

@Hibbins
Copy link

Hibbins commented Oct 14, 2024

So with both #1357 and this one working, I leave the choice up to you. Which one do you prefer?

Seeing as there is no response, could you maybe just make the final decision here instead? It would be really nice to get this one for the next version.

we don't really know what the server expects and how quickly it usually
reacts, so add some safety margin to avoid timing out too early
@wisp3rwind
Copy link
Contributor Author

wisp3rwind commented Oct 15, 2024

Sorry for the delay, I had a pretty busy week.

Let's go with this PR then, I think the resulting code is easier to understand (and probably also easier to change, should we need to modify the keepalive sequence).

I've just pushed a small update to increase the buffer before timing out a bit; just in case the previous delay of 5s after PongAck or expected Ping could also be caused by network hickups: This shouldn't impede our ability to detect connection loss much, but maybe avoid some spurious timeouts, in particular given that we know very little about the servers' behavior.

@Hibbins
Copy link

Hibbins commented Oct 15, 2024

Sorry for the delay, I had a pretty busy week.

Let's go with this PR then, I think the resulting code is easier to understand (and probably also easier to change, should we need to modify the keepalive sequence).

Woop woop! Thanks a lot! 🙏🤗

@roderickvd roderickvd merged commit ed766d2 into librespot-org:dev Oct 15, 2024
13 checks passed
@roderickvd
Copy link
Member

Great! Merged! That should make v0.5.0 a wrap. Gonna try and make time this evening 🤞

@kingosticks
Copy link
Contributor

Thank you @wisp3rwind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connection reset by peer (os error 104)
4 participants