Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VP8: do not forward RTP packets which payload contains a higher temporal layer than current one. #1009

Draft
wants to merge 3 commits into
base: v3
Choose a base branch
from

Conversation

jmillan
Copy link
Member

@jmillan jmillan commented Feb 3, 2023

Motivation: #989, we have seen that we are sometimes sending more BW than provided by BWE.

We verified that SimulcastConsumer is properly selecting the target spatial and temporal layers based on the available bitrate. We switch to the target spatial layer as soon as we receive a keyframe from the corresponding stream. Such spatial layer becomes the current layer. From this point onwards, for the upcoming packets for the given stream we basically let the codec implementation decide whether they should be forwarded (based on the current and target temporal layer) by calling RtpPacket::ProcessPayload() which ends up calling the codec implementation Process() method.

v3 is forwarding every old packet (which picureId is lower than the last forwarded), even those with a temporal layer greater than the current one.

Example:

  • Our current temporal layer is 0 as per BW constraints or app decision.
    • We should only send temporal layer 0 packets.
  • As long as the packets with temporal layer 1 come in order, they will be discarded, as they should.
  • Every packet with temporal layer 1 that is delayed from a previous forwarded one (this is, not arrived in order) will be forwarded to the endpoint.

This means that, as shown in #989, we are sending more BW than the current available indicated by BWE (or app) and thus:

  • We are wasting network and cpu resources.
  • We are potentially congesting an endpoint that cannot afford receiving such temporal layer 1.
  • We are making BWE misbehave by sending more BW than it announces (this is an observation from @sarumjanuch and makes a lot of sense).

Since we respect the codec keyframes in order to update layers, I'm pretty confident we are not incurring in any issue by dropping packets which temporal layer is higher than the current one.

Can you see any drawbacks here @ibc @vpalmisano @ggarber @jcague ?

Copy link
Member

@ibc ibc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a deeper look on Monday. Question: doesn't this affect H264 or VP9?

@jmillan
Copy link
Member Author

jmillan commented Feb 3, 2023

Will take a deeper look on Monday. Question: doesn't this affect H264 or VP9?

I presume so, yes. I'm looking for side effects for this changes before applying them elsewhere.

So far it works as expected.

@jmillan
Copy link
Member Author

jmillan commented Feb 6, 2023

This is the only delicate scenario that I can think of. Ie: We are forwarding temporal layer 1 and the following payloads arrive:

  1. [PID: 1, TID: 0] -> It's forwarded, temporal layer 0.
  2. [PID: 2, TID: 1] -> It's forwarded, temporal layer 1.
  3. Due to BW restrictions or app decision, target temporal layer is set to 0.
  4. [PID: 4, TID: 0] -> golden frame, arrived before [PID: 3, TID: 1] payload. It's forwarded and current temporal layer set to 0.
  5. [PID: 3, TID: 1] -> discarded, TID is greater that current temporal layer.

Observations:

  • Discarding PID: 3 will NOT generate any ACK from receiver since we already handle this in SimulcastConsumer. RTP packet containing PID: 4 has a sequence number one unit higher than the RTP packet containing PID: 2.
  • Since we only switch temporal layers with golden frames, the decoder, already has all it needs to decode the new frame.

@vpalmisano
Copy link
Contributor

  • [PID: 1, TID: 0] -> It's forwarded, temporal layer 0.
  • [PID: 2, TID: 1] -> It's forwarded, temporal layer 1.
  • Due to BW restrictions or app decision, target temporal layer is set to 0.
  • [PID: 4, TID: 0] -> golden frame, arrived before [PID: 3, TID: 1] payload. It's forwarded and current temporal layer set to 0.
  • [PID: 3, TID: 1] -> discarded, TID is greater that current temporal layer.

At this point if the app decides to set the temporal layer to 1 again:

  1. [PID: 5, TID: 0] -> the current temporal layer is not updated, so it will be discarded ?
  2. [PID: 6, TID: 1] -> current temporal layer is set to 1 and the packet it's forwarded

@@ -325,7 +325,7 @@ namespace RTC
// clang-format off
if (
this->payloadDescriptor->hasTlIndex &&
this->payloadDescriptor->tlIndex > context->GetCurrentTemporalLayer()
this->payloadDescriptor->tlIndex == context->GetTargetTemporalLayer()
Copy link
Contributor

@vpalmisano vpalmisano Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to keep the packets with temporal layer <= target ?

Suggested change
this->payloadDescriptor->tlIndex == context->GetTargetTemporalLayer()
this->payloadDescriptor->tlIndex <= context->GetTargetTemporalLayer()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must only upgrade the temporal layer with a packet whose temporal layer is the target one.

@ibc
Copy link
Member

ibc commented Feb 6, 2023

What about if we allow packets with higher layers be forwarded only if their PID value is between S_PID - 10 and S_PID being S_PID the PID of the packet we used to switch current layers?

@jmillan
Copy link
Member Author

jmillan commented Feb 6, 2023

At this point if the app decides to set the temporal layer to 1 again:
[PID: 5, TID: 0] -> the current temporal layer is not updated, so it will be discarded ?

Packets with lower TID than the current are always forwarded, since current temporal layer depends on them.

[PID: 6, TID: 1] -> current temporal layer is set to 1 and the packet it's forwarded

If this is packet can be used to change to temporal layer 1 then the current temporal layer is updated and packet is forwarded.

@jmillan
Copy link
Member Author

jmillan commented Feb 6, 2023

What about if we allow packets with higher layers be forwarded only if their PID value is between S_PID - 10 and S_PID being S_PID the PID of the packet we used to switch current layers?

I'd avoid creating this kind of logic before knowing that the current proposal does not work as expected.

@ibc
Copy link
Member

ibc commented Feb 6, 2023

[PID: 6, TID: 1] -> current temporal layer is set to 1 and the packet it's forwarded
If this is packet can be used to change to temporal layer 1 then the current temporal layer is updated and packet is forwarded.

You mean current temporal layer is changed in the Consumer or in the packet context? Or in both?

@jcague
Copy link
Contributor

jcague commented Feb 7, 2023

Wouldn't this solution create additional PLIs and keyframes if the network between the Sender and the SFU is lossy?

Example: Given a steady state: no layer switching, current TL is 0, we filter out packets with TID > 0. WIth this PR this could happen:

     Sender ---> SN: 0, PID: 0, TID: 0 -----> MS forwards it as SN: 0, PID: 0, TID: 0
 +-- Sender ---> SN: 1, PID: 1, TID: 1 --x (Packet is lost before Mediasoup)
 |   Sender ---> SN: 2, PID: 2, TID: 0 -----> MS forwards it as SN: 2, PID: 2, TID: 0
 +-> Sender ---> SN: 1, PID: 1, TID: 1 -----> MS filters it out as it belongs to TL: 1

In this case, the receiver will never receive the packet with SN: 1, and it will send NACKs with no retransmission (as we dropped the packet), and it will end up sending a PLI. And this would happen to all receivers.

Something like what @ibc mentions above would reduce its impact I think. That said, I think that forwarding old packets with higher temporal layers could also break the video stream at the receiver side, but I haven't been able to demonstrate it so far :).

@jmillan
Copy link
Member Author

jmillan commented Feb 7, 2023

Yes @jcague, that's true. Indeed I misinterpreted that when reading this exactly from @ggarber article. I'll do some local tests to verify a second concern I have, and comment here later.

@jmillan
Copy link
Member Author

jmillan commented Feb 7, 2023

Before going forward with this draft I want to make sure what's causing the problem for #989. There, we can see that some Consumers are continuously receiving the T1 layer when they should receive the T0 only, and that extra T1 traffic cannot happen due to old packets (which is what this branch aims to avoid).

I have created a branch which logs info about each outgoing VP8 payloads. We'll try to repro #989 in the next days and see what is really happening.

@ibc
Copy link
Member

ibc commented Feb 7, 2023

Idea: let old packets with temporal layer greater than current temporal layer pass to the consuming endpoint UNTIL a keyframe (for the target spatial layer) is sent to it. Once a keyframe is received by the consuming endpoint there is no reason for the endpoint to request any keyframe via PLI.

@jmillan
Copy link
Member Author

jmillan commented Feb 8, 2023

Idea: let old packets with temporal layer greater than current temporal layer pass to the consuming endpoint UNTIL a keyframe (for the target spatial layer) is sent to it. Once a keyframe is received by the consuming endpoint there is no reason for the endpoint to request any keyframe via PLI.

Yep, that would make it.

EDIT: Indeed I can't see at the moment how it would do it. Can you expose it within a flow in order to understand it. Ie: #1009 (comment)?

@ibc
Copy link
Member

ibc commented Feb 8, 2023

Yep, that would make it.

EDIT: Indeed I can't see at the moment how it would do it. Can you expose it within a flow in order to understand it. Ie: #1009 (comment)?

It was a conceptual idea. I cannot correlate it with real changes in current code. Also, now I think it doesn't make sense. Imagine we are in spatial layer 1 all the time and suddenly we make temporal layer 0 the preferred one. At this point the issue described in this ticket may happen depending on circumstances (packet loss etc) but there is no keyframe involved at all since we have never switched the spatial layer. So ignore it please.

@ggarber
Copy link
Contributor

ggarber commented Feb 8, 2023

Sorry I'm late to the party. I don't think this change will help with #989 but any improvement in the layers forwarding that could help mitigate the (infrequent) decoding video artefacts we have seen it is welcomed.

This change looks good to me. At east the idea described, I havent' reviewed the implementation.

I think in the example shared by @jcague the browser will look at the tl0PictureIndex and given that there are no gaps in that index it will continue decoding the layer 0 without problems even if there is a gap in the pictureID.

@ibc
Copy link
Member

ibc commented Feb 8, 2023

Example: Given a steady state: no layer switching, current TL is 0, we filter out packets with TID > 0. WIth this PR this could happen:

 Sender ---> SN: 0, PID: 0, TID: 0 -----> MS forwards it as SN: 0, PID: 0, TID: 0

+-- Sender ---> SN: 1, PID: 1, TID: 1 --x (Packet is lost before Mediasoup)
| Sender ---> SN: 2, PID: 2, TID: 0 -----> MS forwards it as SN: 2, PID: 2, TID: 0
+-> Sender ---> SN: 1, PID: 1, TID: 1 -----> MS filters it out as it belongs to TL: 1
In this case, the receiver will never receive the packet with SN: 1, and it will send NACKs with no retransmission (as we dropped the packet), and it will end up sending a PLI. And this would happen to all receivers.

I think the answer is no, but... in this scenario can the SimulcastConsumer know in advance (by looking at PID or something else) them at the lost packet with SN 1 belongs to TL:1 so it must be discarded and hence packet with SN:2 is sent to the consulting endpoint with SN:1 instead? Pretty sure this is not possible so here another crazy proposal:

In this scenario, when delayed SN:1 is finally received by SimulcastConsumer and the device has sent NACK for SN:1, could SimulcastConsumer send an empty packet with SN:1? I mean, a packet with no payload so it would be silently discarded by the consuming device.

@ibc
Copy link
Member

ibc commented Feb 8, 2023

@ggarber not sure if I follow, there is no nice approach here but just 2 proposals with their own drawbacks:

  1. Don't change anything and let old packets with higher TL be forwarded. This produces higher bitrate than expected and this PR aims to fix that. Let's not ignore this issue please.
  2. Don't let any old packer with higher TL pass. This will produce NACKs for sequence numbers that the SimulcastConsumer won't be able to satisfy and hence it will trigger PLI from consuming devices.

@ggarber
Copy link
Contributor

ggarber commented Feb 8, 2023

@ggarber not sure if I follow, there is no nice approach here but just 2 proposals with their own drawbacks:

  1. Don't change anything and let old packets with higher TL be forwarded. This produces higher bitrate than expected and this PR aims to fix that. Let's not ignore this issue please.

0.1% higher bitrate imo.

  1. Don't let any old packer with higher TL pass. This will produce NACKs for sequence numbers that the SimulcastConsumer won't be able to satisfy and hence it will trigger PLI from consuming devices.

I was suggesting to go with Option 2.

The PLIs in the receiver side are not triggered based on packet loss but on not having decodeable frames for 2secs. In this case there will be always decodeable frames from layer 0 so there shouldn't be any extra PLIs/keyframes.

@ibc
Copy link
Member

ibc commented Feb 8, 2023

I was suggesting to go with Option 2.
The PLIs in the receiver side are not triggered based on packet loss but on not having decodeable frames for 2secs. In this case there will be always decodeable frames from layer 0 so there shouldn't be any extra PLIs/keyframes.

Oh, it makes sense. Then... this PR is good (assuming code does what it's supposed to do), right?

@ggarber
Copy link
Contributor

ggarber commented Feb 8, 2023

I was suggesting to go with Option 2.
The PLIs in the receiver side are not triggered based on packet loss but on not having decodeable frames for 2secs. In this case there will be always decodeable frames from layer 0 so there shouldn't be any extra PLIs/keyframes.

Oh, it makes sense. Then... this PR is good (assuming code does what it's supposed to do), right?

I didn't check the implementation but the concept of the PR is good imo, yes.

@ibc
Copy link
Member

ibc commented Feb 8, 2023

I didn't check the implementation but the concept of the PR is good imo, yes.

Requires further testing. Not sure yet whether it addresses the intended issue.

@jmillan
Copy link
Member Author

jmillan commented Jan 17, 2024

Self note: Retest locally and prepare for merge.

@jmillan
Copy link
Member Author

jmillan commented Oct 15, 2024

@namello-gather, was this a concerning issue for your environment? If so, did it have a positive impact without drawbacks?

namello-gather added a commit to gathertown/mediasoup that referenced this pull request Oct 22, 2024
Merge mediasoup issue versatica#1009, don't forward VP8 layers if not requested.
@namello-gather
Copy link

Hi @jmillan sorry I'm just getting to this ping. We've been running this in production for about a month now. We have observed better performance when targeting lower temporal layers with slightly better bandwidth estimation numbers. No drawbacks have been observed, if anything, we've increased our ratings in our lower end network users pool.

@jmillan
Copy link
Member Author

jmillan commented Nov 22, 2024

Thanks for replying @namello-gather ,

if anything, we've increased our ratings in our lower end network users pool

What does this mean?

@ibc
Copy link
Member

ibc commented Nov 22, 2024

Should this PR fix issue #989? If so let's link it please.

@namello-gather
Copy link

Thanks for replying @namello-gather ,

if anything, we've increased our ratings in our lower end network users pool

What does this mean?

We bucket our rating system for our AV service by their general network capabilities, so our lower ratings have decreased and converted to higher ratings after releasing this to our user.

@jmillan
Copy link
Member Author

jmillan commented Nov 22, 2024

Nice, thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants