Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-pause capture when user switches captured content #4

Closed
eladalon1983 opened this issue Dec 20, 2022 · 103 comments · May be fixed by w3c/mediacapture-screen-share#289
Closed

Auto-pause capture when user switches captured content #4

eladalon1983 opened this issue Dec 20, 2022 · 103 comments · May be fixed by w3c/mediacapture-screen-share#289
Labels
enhancement New feature or request

Comments

@eladalon1983
Copy link
Member

eladalon1983 commented Dec 20, 2022

Both Chrome and Safari now allow the user to change what surface is captured.

I was too lazy for alt-text. I was too lazy for alt-text.

That's obviously great stuff. Can we make it better still?

  • What if an application intends to do some processing that depends on the captured content?
  • What if an application wants to set different constraints, for instance when capturing a window vs. when capturing a screen?
  • What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

So I propose that we add two things:

  1. A control which allows an application to instruct the browser - whenever the user changes what surface is shared, pause the track (by setting enabled to false).
  2. Fire an event whenever this happens. (With the general expectation that the application will apply new processing and then set enabled back to true.)

Possibly the two can be combined, by specifying that setting an event handler signals that the pausing behavior is desired (@alvestrand's idea).

Another natural extension of this idea is to also apply it when a captured tab undergoes cross-origin navigation of the top-level document. When that happens, some applications might wish to stop capture momentarily and (in-content) prompt the user - "do you still want to keep sharing?"

Relevant previous discussion here.

@eladalon1983
Copy link
Member Author

(Modulo that setting enabled to false does not stop frames being delivered, it delivers black frames instead. So possibly we need another pausing mechanism. But hopefully the general intention is clear, and we can discuss details after we align on the general vision.)

@youennf
Copy link

youennf commented Jan 27, 2023

One question here is whether pausing at track or at source level.
Pausing at source level makes more sense to me since it is more similar to track.muted than to track.enabled.

@eladalon1983
Copy link
Member Author

Thank you for migrating the discussion here.

Repeating my last comment on the topic:

Is there a reason, other than implementation complexity[*], for auto-pause to happen on sources rather than tracks?

[*] Normally a valid concern, but here I suspect the complexity is quite low, as pausing could be implemented by dropping frames where applicable. Later, the optimization could be added of not posting frames from the relevant source if all associated tracks are paused. Or wdyt?

I think the comparison to muted and enabled is a response to this. But I am not sure how it applies. From a Web developer's point of view, the ability to interact with tracks as though they were independent sounds useful. Imagine you wish to process one track but not the other; maybe you're applying background blur to a track posted remotely, but showing an unmodified local preview. Why would the latter track need to be auto-paused and unpaused?

@jan-ivar
Copy link
Member

Both Chrome and Safari now allow the user to change what surface is captured.

I applaud user-agent-centered features like this! — I like how they tend to keep the end-user in charge.

But before exposing APIs for apps to potentially wrest control over such user-initiated changes, I'd like to understand the concrete problems that require app intervention, since gating output on the app seems directly counter to the user's wishes to inject different input.

What needs cannot be met without pausing?

  • What if an application intends to do some processing that depends on the captured content?

What processing would this be? What app would e.g. blur screen capture?

  • What if an application wants to set different constraints, for instance when capturing a window vs. when capturing a screen?

What constraints would this be? Since a window may be resized by the end-user, and be larger or smaller than a monitor, an app that only reacts to such changes upon source-switching would seem to have the wrong idea.

  • What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

That application can add buttons for the user to do exactly that.

What if the user intends to save different surfaces to the same file, using this user agent feature?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Feb 20, 2023

counter to the user's wishes to inject different input

The user's wish will NOT be subverted; this control will just give the app time to adjust to the change that the user sprung on it. For example, if the user has interacted with the app and asked for the captured media to be written to a file, then when the user clicks share-this-tab-instead, auto-pausing will give the app time to (1) close the old file, (2) open a new file, and (3) start piping new frames to the new file.

What needs cannot be met without pausing?

Please read my earlier messages, as well as the previous paragraph of this message.

What processing would this be?

Cropping using Region Capture. If the user starts sharing a new tab, you might want to pause the capture until you can communicate with the other tab, obtain a CropTarget and apply it.

What constraints would this be?

Meet uses a different frame-rate when capturing tabs, windows and screens.

What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

That application can add buttons for the user to do exactly that.

Why burden the user?

What if the user intends to save different surfaces to the same file, using this user agent feature?

Then that user will use an app that saves everything to the same file. I gave you examples of how this API could be used. Nowhere did I say that we'll be scouring the Internet and blocklisting any app that does not comport itself to my example.

@eladalon1983
Copy link
Member Author

Jan-Ivar, have you perhaps missed that this new API is opt-in, and that the default behavior will remain to NOT auto-pause?

@eladalon1983
Copy link
Member Author

I've rewatched the WebRTC WG January interim, Jan-Ivar, and I see you debated the shape with me. Now you seem to have forgotten the use-cases and are confused about the proposal. What has changed? Could you try rewatching my presentation there?

@youennf
Copy link

youennf commented Feb 21, 2023

Following on today's call, here is some related feedback:

  1. It would be good to discuss the pros and cons of putting the API at source or track level. Having it at the source level is in general safer for the web app and simpler for the UA. Getting use cases benefiting of cloned tracks with one being paused and not the other would be useful.
  2. The proposed API shape could probably be reduced. For instance, events might not be well suited since they require an API to disable/enable firing them given their cost.

Assuming we want it at the track level, below API could fit the bill:

callback DisplaySurfaceChangeCallback = undefined(DisplaySurfaceChangeHandler handler);
interface DisplaySurfaceChangeHandler {
    // Call this method to suspend video frames for longer than the current callback task.
    undefined waitUntil(Promise<any> f);
};

partial interface MediaStreamTrack {
    // Callback should be executed after the corresponding onconfigurationchange event handlers.
    undefined setDisplaySurfaceChangeCallback(DisplaySurfaceChangeCallback? callback);
};

Additional note, we could even reduce the API shape with a callback of the form

callback DisplaySurfaceChangeCallback = Promise<any>();

@jan-ivar
Copy link
Member

jan-ivar commented Feb 22, 2023

The use case as presented would seem solved by a simple event fired before a page is navigated in a captured tab. Why can't a user agent do that? Then pausing isn't necessary, and the API is super simple.

If we accept pausing is necessary, my preference would be to:

  1. Reuse the track mute and unmute UA events for this (relax the "black frame" requirement for video by some ms)
  2. Add some means for JS to distinguish the cause of the UA mute
  3. Add track.unmute() and align it with Solve user agent camera/microphone double-mute mediacapture-extensions#39
  4. Put the opt-in on CaptureController.

This would accomplish several things:

  • Reuses the existing MediaStreamTrack API which needs to make sense for a lot of sources.
  • Keeps screen-capture specific controls to CaptureController
  • Pauses both video and audio tracks

@youennf
Copy link

youennf commented Feb 22, 2023

  1. Reuse the track mute and unmute UA events for this

That makes sense if we go with a source level API, let's discuss this important design point first.

@dontcallmedom-bot
Copy link

@dontcallmedom-bot
Copy link

This issue was mentioned in WEBRTCWG-2023-06-27 (Page 16)

@eladalon1983
Copy link
Member Author

As mentioned during the June meeting of the Screen Capture CG, the issue presented in this thread might be tackled along with a different one, using a mechanism other than auto-pause.

Other issue:

  1. Application calls getDisplayMedia() and obtains a MediaStream containing a video MediaStreamTrack vid.
  2. Application checks vid.getSettings().displaySurface and detects that the user chose to share a specific surface type (say 'monitor').
  3. User changes the capture to some other surface type (say 'window').
  4. Application attempts to call some method on vid, that behaves differently on the first surface type compared to the second. (Say throws an error.)

Reasonable applications deployed today are likely to only perform the check of surface type immediately after gDM resolves, and set cascading state based on it (modify user-facing controls that can later trigger step 4). Such applications would not have been written to check again before step 4, because dynamic switching across surface types did not yet exist.

The backwards-compatible solution I presented during the Screen Capture CG meeting is as follows:

const controller = new CaptureController();
controller.enableDynamicSwitching('switch', (event) => {
  const videoElement = document.getElementById('myVideoElement');
  videoElement.srcObject = event.mediaStream;
  // And/or outgoing peer connection; imagine whatever you find relevant.
});
const stream = await navigator.mediaDevices.getDisplayMedia({ controller });
...

The spec would then say that the user agent MAY constrain dynamic-switching options based on whether enableDynamicSwitching() was called (before gDM was called), thereby ensuring that cross-type surface-switching only happens to apps that can handle it gracefully.

Note how the solution provides auto-pause "for free" by virtue of creating a new stream (and new tracks).

@eladalon1983
Copy link
Member Author

During TPAC, in the joint SCCG / WebRTC WG meeting, I believe we started forming consensus on this topic (slides). Any more thoughts on the general API shape I presented, or shall we send a PR soon and iterate on specifics there?
Tagging interested parties: @jan-ivar, @youennf

@jan-ivar
Copy link
Member

I have some API nits (and agree we shouldn't overload addEventListener). I'd prefer:

const controller = new CaptureController();
controller.addEventListener("switch", ({stream}) => video.srcObject = stream);
controller.manualSwitching();
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

This is modeled on messagePort.start() which is "only needed when using EventTarget.addEventListener; it is implied when using onmessage", which makes this equivalent:

const controller = new CaptureController();
controller.onswitch = ({stream}) => video.srcObject = stream;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

I.e. manualSwitching() is implicit with onswitch.

While I'm coming around on the API, I don't like the idea of limiting this existing and useful end-user feature only to sites that opt-in. That seems like a loss for users. Hence the name manualSwitching() instead of enableDynamicSwitching().

I'd prefer if this new API came with language to encourage browsers to not gate the feature only on apps that opt-in to the event model.

I have reservations about abandoning the old injection model which had some unique advantages:

  1. Works with all websites without needing their support (user benefits)
  2. No need to replaceTrack downstream peer-connection sinks
  3. No risk of ending a MediaRecorder sink recording prematurely
  4. Difficult for apps to deny user switching (user benefits)

So I'm not convinced the event model is always better just because it wins solving the most advanced case (the tightly integrated crop-enabled VC+SLIDE app). For VC apps that DON'T have integrated SLIDE apps, it's not clear why the event API is needed (apart from audio which I propose a solution for below).

When it comes to audio (which I see as the lone flaw in the injection model for most VC sites), the following might work well enough in the injection model:

const stream = await navigator.mediaDevices.getDisplayMedia({audio: true});
video.srcObject = stream;
stream.onaddtrack = ({track}) => console.log(`audio added`);

The UA could subsequently mute the audio track as needed rather than remove it. Thoughts?

@beaufortfrancois
Copy link

Note that the 2 other occurrences I've found of doing something more in the setter onfoo are:

For the record, I'm not a huge fan of this as developers may read controller.onswitch = ... and think they can/should use controller.addEventListener('switch', ... as they've been educated to not use onfoo anymore and realize eventually their code doesn't work without understand why at first glance.

@eladalon1983
Copy link
Member Author

@jan-ivar, I think we have general alignment on the solution and are now discussing the specifics of the shape. That works for me. In the interest of clarity, I am splitting this response in two. This comment covers where we agree; the next comment covers an open issue for discussion.

Hence the name manualSwitching() instead of enableDynamicSwitching().

That name change works for me. Almost. I think enable... would make it clearer that it's a setter. So enableManualSwitching().

I'd prefer if this new API came with language to encourage browsers to not gate the feature only on apps that opt-in to the event model.

Works for me.

I have reservations about abandoning the old injection model which had some unique advantages:

  1. Works with all websites without needing their support (user benefits)
  2. No need to replaceTrack downstream peer-connection sinks
  3. No risk of ending a MediaRecorder sink recording prematurely

We are not currently planning to abandon the old model in Chrome's implementation. Thank you for documenting some of the reasons to retain it.

  1. Difficult for apps to deny user switching (user benefits)

In the interest of transparency, I'd like to remind you of surfaceSwitching. But please note that it's an orthogonal discussion. We are neither introducing surfaceSwitching nor deprecating it. The semantics of the new event listener would be that it has an effect if the user chooses to switch surfaces.

The UA could subsequently mute the audio track as needed rather than remove it. Thoughts?

I think that we have multiple challenges and a single API that solves them all, so I don't think we need an additional API that only solves a subset of the challenges (audio).

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 22, 2023

... it is implied when using onmessage"_, which makes this equivalent:

const controller = new CaptureController();
controller.onswitch = ({stream}) => video.srcObject = stream;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

I.e. manualSwitching() is implicit with onswitch.

This seems less than ideal for me. I think it's a footgun for developers when onevent and addEventListener behave differently. (See also François' comment). But before we do a deep dive on this particular detail...

... I wonder if we should eschew both onswitch and addEventListener to begin with, because allowing the registration of multiple event listeners seems like a footgun in itself.

  • Do the events have...
    • ...multiple distinct tracks with the same source?
    • ...multiple distinct tracks with identical but distinct sources?
    • ...track clones?
    • ...separate references to the same track?
  • What happens if the first event handler stops the tracks, and then the next event is fired with stopped tracks?
  • What if the developer enables manual switching and neglects to register an event handler?
  • What if the developer registers an event handler and neglects to enable manual switching?

It seems better to shape the API as enableManualSwitching(handler). This ensures exactly one handler, neatly solving all of the these issues.

@youennf
Copy link

youennf commented Sep 22, 2023

  1. Works with all websites without needing their support (user benefits)

Agreed, new tracks require extra work for the web developer, it is appealing to keep using the current track from this point of view.
This would require an addtional API to allow delaying the generation of frames, typically as part of the onswitch event.

As of the source mutating, configurationchange event may be good enough to tell the web application that cropTo might no longer work for instance. This would require deprecating BrowserCaptureMediaStreamTrack as well.

The new track approach main benefit is that it is future proof however the model evolves.
Another example that comes to mind is if switching from one window to several windows (hence several video tracks, with potentially several audio tracks?).
In that case, applications will anyway need to deal with new video tracks if they want to leverage the user selection.

stream.onaddtrack

I am not a big fan, given many websites do tend to recreate MediaStreams on the fly or clone them. In the future, we might want to transfer MediaStream as well.
The application would also need to stick to both CaptureController and the MediaStream for being notified of what is happening.
Since we already have CaptureController, I tend to prefer having a single event telling the application what happened all at once.

@eladalon1983
Copy link
Member Author

Youenn, could you please clarify which approach you are favoring here? (Note that I am not proposing abandoning the old model. So any application that has a reason to stick to it, can still do so.)

@jan-ivar
Copy link
Member

(answering multiple people)

I think enable... would make it clearer that it's a setter. So enableManualSwitching().

👍

... developers may read controller.onswitch = ... and think they can/should use controller.addEventListener('switch', ... ... and realize eventually their code doesn't work without understand why at first glance.

Yes, this literally happened to me with web workers! So I sympathize, but I did figure it out, and there IS precedence here, which may help devs... This might come down to one's view of onfoo vs addEventListener("foo", ) ...

as they've been educated to not use onfoo anymore

Where is this advice? I recall it years back, but worry it may be outdated. See § 7.4. Always add event handler attributes.

IMHO onfoo is fine in top-level app code ("top-level" = the level owning the target). E.g. with

const video = document.createElement('video');
video.srcObject = stream;
await new Promise(r => video.onloadedmetadata = r);

...there's no problem since video is not available outside this async function yet. Just avoid hiding it inside sub-functions and helper functions.

In this issue, I'm even hearing an argument for a single callback which (I don't agree with but) I suspect means controller.onswitch = f would likely satisfy 99% of cases.

... allowing the registration of multiple event listeners seems like a footgun in itself.

I think it's generally agreed that not allowing multiple listeners is the footgun. E.g. two areas (functions) of an app both set onfoo or setMy(callback) whatever, causing one to remove the other, a hard-to-debug action at a distance.

Lots of precedence says multiple events are OK. See §7.3. Don’t invent your own event listener-like infrastructure.

stream.onaddtrack

I am not a big fan, given many websites do tend to recreate MediaStreams on the fly or clone them. In the future, we might want to transfer MediaStream as well.

Sure, but the same is true of the MediaStream(s) managed by RTCRtpReceiver. But I find it interesting that the video.srcObject sink handles it seamlessly in all browsers. But I agree real-world sinks would likely need to take some action, e.g. with the track-based RTCPeerConnection:

stream.onaddtrack = pc.addTrack(track);

The application would also need to stick to both CaptureController and the MediaStream for being notified of what is happening.

This API solves the audio problem in the injection case, which doesn't require CaptureController. The question here probably comes down to whether we want to solve the audio injection case or not...

@jan-ivar
Copy link
Member

  1. Difficult for apps to deny user switching (user benefits)

In the interest of transparency, I'd like to remind you of surfaceSwitching.

Yes, but that merely "signals whether the application would like the user agent to offer the user an option to dynamically switch". I.e. UAs are free to ignore it.

@youennf
Copy link

youennf commented Sep 29, 2023

Topic was discussed yesterday and it seems there is some convergence. AIUI, the proposal is:

  • An event is fired (or a callback is executed) to tell web app that surface switching happens.
  • The UA makes sure to not provide any media to the old and new tracks until the event/callback is fired/executed.
  • The event/callback exposes new audio/video tracks depending on the user selection.
  • When firing/execution is done, the UA ends the old tracks.
  • The UA starts providing media on the new tracks.
    All these steps are synchronously executed.

@eladalon1983, @jan-ivar, can you clarify whether this is a correct summary?

If this is correct, we should consider removing step 4 (closing the old tracks). The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate.

This would allow easy migration in the simple cases (just do nothing for video, check for new audio).
We could have a simple CaptureController onswitch event, given it would not have any side effect.
We could in the the future extend the event with a waitUntil API if we think the web app needs more time to decide what to do before the UA starts generating media.

@eladalon1983
Copy link
Member Author

Thanks, Youenn; I believe this summary is mostly accurate. [*]

we should consider removing step 4 (closing the old tracks). The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate.

This would allow easy migration in the simple cases (just do nothing for video, check for new audio).

I don't think this would work, because the old track would not get any new frames, even if it's ended. I'm also not sure what problem it's solving - if an existing application is already updated to call enableManualSwitching(), and it already does work on the new audio track, why can't it also do work on the video track? (In contrast to backwards compatibility with existing applications that don't update their code and never call enableManualSwitching(); it's preferable to ensure that these just keep working.)

We could have a simple CaptureController onswitch event, given it would not have any side effect.

Minor objection on my part here. I don't think I'd block on it, but I'd want a chance to argue against it, at a time that would not distract us away from the main discussion.


[*] With some nits. For example, I don't currently see why it's necessary to mandate that no new frames be emitted on the new track before the old one is ended. But I think that's a minor point that can be discussed later.

@youennf
Copy link

youennf commented Sep 29, 2023

I don't think this would work, because the old track would not get any new frames, even if it's ended.

The point would be to delay sending frames to the old track until the event listeners are executed.
This gives enough time for the web app to stop the old track or continue using it.

I see two main benefits:

  1. We could remove enableManualSwitching(). The old track would stay in the legacy mode, as if enableManualSwitching() would not have been called. The new track would be created as if enableManualSwitching() would have been called.
  2. This is more convenient to web developers as they might only have to deal with audio, video setup will stay the same.

@eladalon1983
Copy link
Member Author

I think we still want to go with a callback rather than an event, so as to avoid the issues with multiple tracks with the same source, and the interaction between multiple listeners.

With that, I don't really see how we could avoid enableManualSwitching(), or any method/attribute for setting the callback.

Once a callback is set - using whichever method of attribute - there is no way for the UA to determine if the application wish to receive a new audio track and retain the old video track. Not without additional API surfaces, at any rate, and those are past MVP in my opinion. We need to fire a new video track, and we need to stop delivering frames to the old one (or else they effectively share a source).

I think we have a clear and simple solution in:

  • No callback set using enableManualSwitching() - old behavior.
  • Callback set using enableManualSwitching() - the new behavior when the user switches the shared surface is:
    1. Stop delivering frames on the old tracks.
    2. Invoke the callback with the new tracks. Frames might already be flowing on that track.
    3. End the old tracks. (TBD if an onended event should be fired.)

This is future-proof:

  • We can extend the API to treat navgiation as a change of surface.
  • We can extend the API to handle user interaction with the UA as change of surface. (I am NOT saying that we should. I am only saying that we have the flexibility to consider it.)
  • We can extend the API to allow apps to ask for different behavior for video/audio. (E.g. ask the UA to keep the injection-model for one but issue new tracks for the other.)

@youennf
Copy link

youennf commented Oct 2, 2024

My thinking is a sourceswitch event that fires whenever the user switches source (with no requirement to stop tracks)

The callback approach allows this as well.
I was not clear about it previously in this thread (sorry about that), setting the callback would not be a signal for the UA to go to the switch mode and stop the previous tracks.

Instead, we stick with the injection model for old tracks. The web page can stop the old tracks anyway. I am ok adding an option so that the web page tells the UA to stop the tracks (hence the various proposals I made on top of the callback).
We need though language that instructs that media is not flowing in the old tracks until the callback is executed.

Having a callback to deliver the stream is better since there is one place where you decide what to do with the new tracks (clone it, stop it...). And the spec can be made clear that MediaStreamTracks are not created if the callback is not set. This is more difficult with events.
And I do not really see a case for multiple event listeners for this switch case (web devs already have configuration change anyway).

@jan-ivar
Copy link
Member

jan-ivar commented Oct 3, 2024

The web page can stop the old tracks anyway

What enforces that the website can't keep both the old live injected track and the live new track? We need to specify this implicit action at a distance.

Having a callback to deliver the stream is better since there is one place where you decide what to do with the new tracks (clone it, stop it...)

If this means there's one place where you decide what happens with the old tracks (enforced by the aforementioned action at a distance), then I agree that might be a good reason for a callback.

Can we make it a settable attribute at least?

@youennf
Copy link

youennf commented Oct 3, 2024

What enforces that the website can't keep both the old live injected track and the live new track? We need to specify this implicit action at a distance.

I do not see any implicit action at a distance, the website can keep both
Is there an issue with that?

If this means there's one place where you decide what happens with the old tracks

In my mind, the default behavior (whether setting the callback or not) is that no track is being stopped by UA, the web page can deal with it by itself.

We can enrich the callback to make the UA stop the previous tracks, for instance:

  • captureController.setDisplaySurfaceChangeCallback(stream => { ... }, { mode: 'stop-previous-tracks' })
  • captureController.setDisplaySurfaceChangeCallback(stream => { ...; return 'stop-previous-tracks'; })
  • captureController.setDisplaySurfaceChangeCallback(stream => { ...; captureController.stopPreviousTracks();... })
  • captureController.setDisplaySurfaceChangeCallback(async stream => { await...; captureController.stopPreviousTracks();... });

It is a bit less straightforward to extend things with an event. And again, it is not really compelling to have several event listeners sharing the responsibility to stop the old tracks (or the new tracks).

Also, I could see a UA tailoring its UI based on the callback being registered (not showing the sharing audio check box if no audio was shared before the surface switching for instance).
This is sort of similar to MediaSession going with callbacks as action handlers.

Can we make it a settable attribute at least?

Ah, good point, I guess this would disallow option 1 above.

@dontcallmedom
Copy link
Member

(for the record - this was discussed in the joint SCCWG/WebRTC meeting last week)

@jan-ivar
Copy link
Member

jan-ivar commented Oct 3, 2024

And I do not really see a case for multiple event listeners for this switch case

A singular callback assumes a single downstream consumer. An app may have multiple consumers of an active screen-capture, e.g. a transmitter, a preview, and a recorder, each with distinct downstream needs.

Tracks can be cloned, but a CaptureController cannot. So this becomes a choke point. We don't want different parts of an app competing to set the same callback and overwrite each other.

The web platform tries hard to avoid single-consumer APIs. See § 7.3. Don’t invent your own event listener-like infrastructure, and requestVideoFrameCallback.

I think we need a good reason to deviate from these principles.

(web devs already have configuration change anyway).

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

And the spec can be made clear that MediaStreamTracks are not created if the callback is not set. This is more difficult with events.

This seems like a marginal optimization compared to a such a significant user action.

This is sort of similar to MediaSession going with callbacks as action handlers.

That's a fairly recent API with its own flaws. But it has a good reason: Many of its actions rely on the website to maintain a singular state. What's our reason?

@jan-ivar
Copy link
Member

jan-ivar commented Oct 3, 2024

Also, I could see a UA tailoring its UI based on the callback being registered (not showing the sharing audio check box if no audio was shared before the surface switching for instance).

We've gone around a few times on this point. Yes the absence of a callback might preclude the app handling audio, but the presence of a callback does not guarantee it.

But § 7.3. specifically mentions this point: for "an API which allows authors to start ... a process which generates notifications, use the existing event infrastructure"

@youennf
Copy link

youennf commented Oct 3, 2024

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

Just check for deviceId in track.getSettings(), no need for using source switch.
Source switch is about deciding whether to use the old tracks or the new tracks.

That's a fairly recent API with its own flaws.

I don't see how this particular flaw applies here.

MediaStreamTrack events are where you distribute the info.
CaptureController is a single place for mission critical information (running the callback may actually trigger a freeze of video frame generation).

@youennf
Copy link

youennf commented Oct 3, 2024

Discussed at editor's meeting and we will try to converge via a design document.

@youennf
Copy link

youennf commented Oct 3, 2024

Another proposal to consider, maybe it could help convergel:

  • Rely on configurationchange event for apps to decide whether to continue processing or not. This means that when there is a source switch, video frame sending to sinks will be suspended until configurationchange event listeners are called (on a per track basis). This ensures the injection model works.
  • Rely on a new notification mechanism (event/callback) to expose new tracks coming in (whether audio or video).
  • We leave it to the UA/OS for now to decide how tracks get stopped. Depending on the OS UX, the user expectations may actually be different. We progressively converge on defining this behaviour as we experiment..

@jan-ivar
Copy link
Member

jan-ivar commented Oct 3, 2024

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

Just check for deviceId in track.getSettings()

The spec says: "deviceIds are not exposed.". It's not listed in § 5.4 Constrainable Properties for Captured Display Surfaces.

CaptureController is a single place for mission critical information (running the callback may actually trigger a freeze of video frame generation).

Why is that critical? This is the kind of action at a (maybe not so much) distance we should document. This might justify a callback.

@youennf
Copy link

youennf commented Oct 3, 2024

The spec says: "deviceIds are not exposed."

Chrome and Safari are exposing deviceIds.

Wrt callback vs. event, let's rediscuss this when we know what signals we want to expose.
@tovepet is planning to create a design document we can all participate in to try reaching consensus on the underlying model.

@jan-ivar
Copy link
Member

jan-ivar commented Oct 8, 2024

Chrome and Safari are exposing deviceIds.

I've filed crbug 372252497 and webkit 281077.

Agree callback vs. event seems secondary.

The main question seems to be over allowing late decision vs. limiting injection to tracks returned from getDisplayMedia().

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Oct 8, 2024

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

I am a bit confused about the purpose of "new session tracks". Who would need them? The entire idea of a "session track" is that it follows the session wherever it goes, whatever the captured surface is, across user-initiated changes. If a developer needs multiple such session tracks, can't they just clone the original ones?

@youennf
Copy link

youennf commented Oct 9, 2024

I've filed crbug 372252497 and webkit 281077.

It seems useful information to provide, why not instead updating the spec?

@youennf
Copy link

youennf commented Oct 9, 2024

Thinking a bit more, I am not sure the separation between session tracks and surface tracks is helping shape this API.

Let's look at the following two scenarios:

  1. User agent U1 is exposing a switch surface UX and user clicks on it. User is expecting that the new surface content will be rendered where the past track content was rendered. It seems reasonable that the same track exposes the media content and that no new track is exposed: session track model seems well suited.

  2. User agent U2 is exposing an add/remove surface UX. User first adds a surface B and then removes the original surface A. I would think user is expecting the app to react to the new track somehow by using a new video element for rendering. This seems inline with the user agent exposing a new track for B and ending the original track A: surface track model seems well suited.

Given this, and given UX in that area is relatively new, I am not sure we can design an API that specifies a particular flow. Having an API that exposes new tracks and having a requirement that video frames of the switching surface do not get provided to sinks until some event/callback actually runs might be good enough for now. Plus some guidelines...

That said, AFAIK, the only thing UAs are doing right now is scenario 1 above. And this is what the initial message of this issue is describing.
Based on this, I would suggest trying to fix this and only this for now by specifying the requirement that video frames of the switching surface do not get provided to sinks until some event/callback actually runs. I would tend to use the configuration change/deviceId combo for that as it does not require new API surface.

@tovepet
Copy link

tovepet commented Oct 9, 2024

I have created a design document to provide for a more structured discussion of the different proposals (view-only): https://docs.google.com/document/d/16CUOJeuXimNPi4kZHOS9rF-WhMuVvOqOg9P--Dvqi_w/edit?usp=sharing

Edit permissions will be granted to members of the WebRTC working group upon request.

@jan-ivar
Copy link
Member

jan-ivar commented Oct 9, 2024

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

I am a bit confused about the purpose of "new session tracks". Who would need them?

To avoid confusion, I've defined a hybrid track to clarify what I mean. But it's really the surface track I'm questioning. [Edit 2: undid my edit to capture subtle differences]

I've written up the model I have in mind as the late decision model. PTAL (edit: links fixed)

@eladalon1983
Copy link
Member Author

Jan-Ivar, I am unsure what your current position is, given the edits. Do you withdraw your question about "new session tracks"? My position is that there is no benefit to including new objects in a fired event, if these objects are identical to objects we had before. (That is - new session tracks are identical to the originals, and so "new" ones are useless.)

@jan-ivar
Copy link
Member

Sorry for editing multiple times. Calling mine a hybrid track now, to distinguish it. It wasn't clear from the session track definition that its feature set would be limited, which would be backwards incompatible.

Do I understand correctly a driving goal of the session/surface split is to maintain the subclassing of MST?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Oct 11, 2024

It wasn't clear from the session track definition that its feature set would be limited, which would be backwards incompatible.

That's one vision of it, out of two - either (1) a normal, fully-fledged MediaStreamTrack, or (2) a reduced-feature-set MediaStreamTrack. But which is used is a secondary matter.

The core offering of a session track, as I understand @tovepet here, is that it addresses your (Jan-Ivar's) expressed desire, to be able to seamlessly transition between two models (injection, switch-track).

  • Your models allow "switching between models at any time."
  • Tove's model allows "using both or either at any time, concurrently or consecutively."

I believe it is objectively true, that Tove's model is more flexible.

@youennf
Copy link

youennf commented Oct 11, 2024

I am a bit lost in what we are trying to solve here.
Can we add a scope/use case section to the design document?

@eladalon1983
Copy link
Member Author

I am a bit lost in what we are trying to solve here. Can we add a scope/use case section to the design document?

We should also rely a bit more on an exploration of the uses cases, which I see only includes a single use case atm. I have taken the liberty to add 4 more.

@tovepet
Copy link

tovepet commented Oct 11, 2024

I have added a Scope section with the following bullets that I believe we want to solve in the first step:

  • Cross-type surface-switching
  • Add audio-sharing after the fact
  • Frame delivery is clearly separated for the captured surfaces before and after the switch.

Is this inline with what the rest of you think?

@youennf
Copy link

youennf commented Oct 12, 2024

  • Cross-type surface-switching

ok

  • Add audio-sharing after the fact

Is this already existing?
This seems like a difficult topic: multiple UX possibilities with different user intents, interaction with permissions, removal of audio sharing after the fact without removing display sharing...
This might deserve its own issue, though it is fine to keep it here as one of the longer term issue we want to solve in that space.

  • Frame delivery is clearly separated for the captured surfaces before and after the switch.

The first two points are user driven, this last one is already phrased in terms of API.
The question is what we are trying to achieve with this for the user.
How about:

  • Allow web applications to control whether and where sending video frames from a newly captured surface.

We have a solution when starting to capture and we want a solution when user decides to switch surface, whether same type or different type.

@tovepet
Copy link

tovepet commented Oct 21, 2024

After the discussions in the Captured Surface Switching - Working Doc, it looks like the most promising way to reach agreement is to continue with the existing injection model and add the following extensions:
auto-pause to allow applications time to set up the capture of a new surface and avoid transmitting frames associated with the wrong capture.
allowCrossTypeSurfaceSwitching option to enable cross-type surface-switching
alwaysReturnAudio option for getDisplayMedia to always return an audio track if audio is requested, so that audio be muted/unmuted by the user at their leisure.
Together, these three extensions should cover most of the use-cases we have discussed in this issue.

I have created three new issues for these extensions so that we can tackle them individually:
auto-pause: #15
allowCrossTypeSurfaceSwitching: #16
alwaysReturnAudio: #17

@eladalon1983
Copy link
Member Author

Burn the ships.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants