Rework session keep-alive logic v2 #1359

wisp3rwind · 2024-10-02T14:43:56Z

Another approach to #1357 using the ideas from #1357 (comment).

This looks like a big change, but much of it is just moving around code: fn dispatch is now a part of DispatchTask such that it has access to the fields of DispatchTask (namely, the timeout and state of the keepalive state machinery).

This is a little different from @roderickvd's idea: I've kept the KeepAliveState enum instead of just the ping_received flag, and instead there's only a single Sleep future around. The deadline of the latter is modified as required.

EDIT: Split into a few more commits now to make this easier to review. For testing, keep-alive events are logged at TRACE level.

Fixes #1340
Closes #1357

no other changes except for `self` -> `session` rename and adding the new `session` argument

kingosticks

It feels a bit weird that this is basically a state machine but we're not handling it with match, the best thing for handling state machines in rust. Maybe you had it that way in another version? I've lost track a bit, sorry.

But I'm nit picking really. If it works then let's just get it fixed and release.

core/src/session.rs

kingosticks · 2024-10-05T14:24:53Z

core/src/session.rs

+                }
+                PendingPong => {
+                    trace!("Sending Pong");
+                    let _ = session.send_packet(PacketType::Pong, vec![0, 0, 0, 0]);


Is there a way to flush this onto the wire after (successfully) sending? Although the 5 second margin might mask any sending delay, that isn't really the point of it.

It seems to be possible to flush the stream in principle (i.e. await the <Framed<TcpStream> as SinkExt<_>>::poll method). However, as the sender_task is implemented right now, there's only a one-directional channel (tx_connection) to send data. It would require a redesign of the Session to handle flushing.

I'd argue that it's a bit too much to put in this PR. In addition, flushing behavior (or rather the lack thereof) is the same as it was before this PR, so it's rather an orthogonal issue to the changed keep-alive sequence.

In general, I know far from enough on TCP and its implementations to tell what is required to make flushing work: For example, there's the TCP_NODELAY socket option (i.e. socket.set_nodelay(true). Would we need to set that additionally?

The 5 second delay in particular is pretty arbitrary, maybe it would make sense to raise that a bit to account for short network drop-outs? If I remember correctly, I've observed the PongAck within a few hundred milliseconds after sending the Pong on a wired connection.

I agree with everything you've said there. And I've also seen some relatively large pong-ack delays (which could be due to us or them or just network, I never investigated). Arguably unnecessary to invalidate the session over missing pong-ack. The next missing ping will sort things out for us anyway.

Ok, so I guess the remaining question is under which condition exactly we should time out:

no Ping (120s + x) after the last Ping

no Ping (60s + x) after sending the last Pong

no Ping (60s + x) after receiving the last PongAck

no PongAck x after sending the last Pong

where x accounts for any extra latency in the network round-trip and processing of keep-alive events.

This PR implements 3 + 4, whereas the old behaviour is 1. To be honest, I have no idea what would be best (presumably, that would depend on how the server handles timing of these events). Is your suggestion 1, 2 or 3 here?

In fact, I've also seen a few connection resets (spaced several hours apart) with this PR applied, but I didn't investigate exactly where they originated. Maybe, 5s is in fact too little margin for our timeouts?

Two questions:

Does 3 + 4 not also imply 2?

What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current dev?

First thought is that I agree with @kingosticks to not really care about the PongAcks.

Two questions:

Does 3 + 4 not also imply 2?

If there's not much latency, essentially yes. I think it's not quite the same regarding how the tolerance x adds up and from which points exactly the timeouts are counted if some event (say the PongAck) arrives with significant delay (but I didn't really think this through in detail).

What do you mean with "old behaviour" - where, in Rework session keep-alive logic #1357 or in current dev?

Current dev. #1357 has the same behavior as this PR; only the implementation is different.

First thought is that I agree with @kingosticks to not really care about the PongAcks.

My first thought was to use as much information about the state of the connection as possible, which is how I ended up here (i.e. time out shortly after missing PongAck instead of waiting for a next Ping). You may be right that this overcomplicates things.

In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.

librespot-java also ignores PongAck, but if I understand the code correctly, it behaves like current librespot(-rs) dev and send the Pong immediately after receiving a Ping.

I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.

In the end, this might mean that just taking the first commit of #1357 and dropping all other commits of #1357 and #1359 is the way to go for simplicity.

Though I stand by my earlier statement that correctness trumps simplicity, if the end result is the same then I would also say that Occam's razor applies.

I really don't get why it's an issue only for some accounts. I disabled my pongs entirely as an experiment and while i did then manage to get connection reset, it was only a couple over a long period.

Spotify is the weirdest; some APs respond and behave differently. I remember for some users there's also no autoplay flag being reported in the XML of the initial connection, yet for 99% it is. I would not even know how to get the autoplay state without it.

Due to the GeoIP thing it's hard if at all possible to make a manual connection to such APs unless you are actually there.

wisp3rwind · 2024-10-06T10:52:57Z

It feels a bit weird that this is basically a state machine but we're not handling it with match, the best thing for handling state machines in rust. Maybe you had it that way in another version? I've lost track a bit, sorry.

I'm not sure where you're missing a state machine here: Isn't the second half of fn poll() in impl Future for DispatchTask of this PR roughly what you're describing? The state is being updated when receiving Ping and PongAck as well as when sending Pong. On each keep-alive related timer event, a match statement is used to take the appropriate action (send Pong or error out).

But I'm nit picking really. If it works then let's just get it fixed and release.

👍 I'll ask people over in the original issue to maybe also test this implementation. It should implement the exact same behavior as #1357 (and it does work for me), but of course there could be different bugs here.

roderickvd · 2024-10-12T16:45:10Z

So with both #1357 and this one working, I leave the choice up to you. Which one do you prefer?

Hibbins · 2024-10-14T16:59:32Z

So with both #1357 and this one working, I leave the choice up to you. Which one do you prefer?

Seeing as there is no response, could you maybe just make the final decision here instead? It would be really nice to get this one for the next version.

we don't really know what the server expects and how quickly it usually reacts, so add some safety margin to avoid timing out too early

wisp3rwind · 2024-10-15T09:17:23Z

Sorry for the delay, I had a pretty busy week.

Let's go with this PR then, I think the resulting code is easier to understand (and probably also easier to change, should we need to modify the keepalive sequence).

I've just pushed a small update to increase the buffer before timing out a bit; just in case the previous delay of 5s after PongAck or expected Ping could also be caused by network hickups: This shouldn't impede our ability to detect connection loss much, but maybe avoid some spurious timeouts, in particular given that we know very little about the servers' behavior.

Hibbins · 2024-10-15T09:19:38Z

Sorry for the delay, I had a pretty busy week.

Let's go with this PR then, I think the resulting code is easier to understand (and probably also easier to change, should we need to modify the keepalive sequence).

Woop woop! Thanks a lot! 🙏🤗

roderickvd · 2024-10-15T11:37:47Z

Great! Merged! That should make v0.5.0 a wrap. Gonna try and make time this evening 🤞

kingosticks · 2024-10-15T11:54:46Z

Thank you @wisp3rwind.

wisp3rwind marked this pull request as ready for review October 2, 2024 14:43

wisp3rwind force-pushed the pingpong-v2 branch from b73b657 to 01a6c16 Compare October 2, 2024 14:45

wisp3rwind added 2 commits October 2, 2024 16:49

session: Session::dispatch -> DispatchTask::dispatch

8e99a69

no other changes except for `self` -> `session` rename and adding the new `session` argument

session: DispatchTask is a regular struct instead of tuple struct

df6a09b

wisp3rwind force-pushed the pingpong-v2 branch from 01a6c16 to 3fccc3a Compare October 2, 2024 15:02

wisp3rwind added 3 commits October 2, 2024 18:08

session: rework keepalive logic to match other clients

97a757d

session: use pin projection instead of Boxing in DispatchTask

754c1b6

update Cargo.lock

69f6153

wisp3rwind force-pushed the pingpong-v2 branch from 3fccc3a to 69f6153 Compare October 2, 2024 16:09

linknum23 mentioned this pull request Oct 2, 2024

Spotify Subscription terminated error micro-nova/AmpliPi#958

Closed

kingosticks reviewed Oct 5, 2024

View reviewed changes

address review

beff7e9

wisp3rwind mentioned this pull request Oct 6, 2024

Connection reset by peer (os error 104) #1340

Closed

COMPsmith mentioned this pull request Oct 8, 2024

librespot Music stops abruptly when playing long playlists sweisgerber/docker-snapcast#12

Closed

star-glider mentioned this pull request Oct 11, 2024

Update cross-compile docker image: contrib/Dockerfile #1367

Merged

roderickvd mentioned this pull request Oct 13, 2024

Authentication failures #1308

Closed

wisp3rwind mentioned this pull request Oct 15, 2024

Rework session keep-alive logic #1357

Closed

increase keepalive timeouts

38f8d37

we don't really know what the server expects and how quickly it usually reacts, so add some safety margin to avoid timing out too early

wisp3rwind force-pushed the pingpong-v2 branch from e0870a4 to 38f8d37 Compare October 15, 2024 09:16

roderickvd merged commit ed766d2 into librespot-org:dev Oct 15, 2024
13 checks passed

Erriez mentioned this pull request Oct 20, 2024

[BUG] Error 404 on certain songs when used in conjunction with zotify kokarare1212/librespot-python#280

Open

wheredidthecostgo mentioned this pull request Oct 26, 2024

Spotify Connection Failing with (os error 104) balena-io-experimental/balena-sound#672

Closed

nedguy mentioned this pull request Jan 1, 2025

Spotify keeps crashing, not sure why hassio-addons/addon-spotify-connect#274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework session keep-alive logic v2 #1359

Rework session keep-alive logic v2 #1359

wisp3rwind commented Oct 2, 2024 •

edited

Loading

kingosticks left a comment •

edited

Loading

kingosticks Oct 5, 2024

wisp3rwind Oct 6, 2024

kingosticks Oct 6, 2024

wisp3rwind Oct 6, 2024

roderickvd Oct 6, 2024

wisp3rwind Oct 6, 2024

wisp3rwind Oct 6, 2024

kingosticks Oct 6, 2024

roderickvd Oct 7, 2024

wisp3rwind commented Oct 6, 2024

roderickvd commented Oct 12, 2024

Hibbins commented Oct 14, 2024

wisp3rwind commented Oct 15, 2024 •

edited

Loading

Hibbins commented Oct 15, 2024

roderickvd commented Oct 15, 2024

kingosticks commented Oct 15, 2024

Rework session keep-alive logic v2 #1359

Rework session keep-alive logic v2 #1359

Conversation

wisp3rwind commented Oct 2, 2024 • edited Loading

kingosticks left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wisp3rwind commented Oct 6, 2024

roderickvd commented Oct 12, 2024

Hibbins commented Oct 14, 2024

wisp3rwind commented Oct 15, 2024 • edited Loading

Hibbins commented Oct 15, 2024

roderickvd commented Oct 15, 2024

kingosticks commented Oct 15, 2024

wisp3rwind commented Oct 2, 2024 •

edited

Loading

kingosticks left a comment •

edited

Loading

wisp3rwind commented Oct 15, 2024 •

edited

Loading