-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid circular definition of muted. #982
Comments
The problem with "fixing" these spec definitions that have been in place for years to try to better solve today's problems is that it is extremely difficult to update implementations that have followed the old definition for years. |
I would oppose any spec changes to the muted attribute and the corresponding events. |
@guidou, I understand the concerns. About the concerns, in this particular case, the change is about stopping to fire mute events in odd cases. As of a new attribute, would it mean new event listeners? Given audio/video stats API will allow to simulate these odd cases mute events, would it not be possible to advertise the use of JS polyfills for applications that would like to keep receiving these events? I'd like to avoid introducing a boolean which definition would, from the start, mention that this is for legacy applications and that we plan to obsolete it. |
Overall, I like this proposal. |
Good question. Any apps treating mute as fatal would already fail to interoperate. |
What can the user agent do on platforms where they get no advanced knowledge that frames will not be forthcoming? I see great value in giving the application as much clarity as the user agent can muster. We live in a world of open source operating systems and browsers. Hundreds of millions of people use video-conferencing tools every day. The vendors of these VC applications have large engineering teams. Having clear metrics on where various issues lie, allows these engineers to set out and fix issues in codebases beyond their own - to everyone's benefit. Specifying that user agents should not mute when they are not sure the issue is an explicit mute would be a step in the wrong direction. Having more fine-grained MuteReasons - as I had proposed elsewhere - would be a step in the right direction. |
We define APIs based on developer needs, not user agent needs. If the OS mutes, the user agent owns the problem of detecting that and conveying that as an "event" that happened. E.g. If the user agent has reason to believe lack of frames is instead due to an error, then ending the track may be more appropriate. If the user agent cannot tell whether the OS muted it or whether there was an error, that is its problem to solve. Punting hard questions like this to the webapp doesn't seem reasonable to me. The spec already defines muted and ended as separate events for this reason. Agreeing on these definitions is what we've committed to to having browsers interoperate.
An answer to this question would be helpful. |
Developers need to be able to debug their users issues. Even if those issues extend beyond the JS application. For a developer of a VC application, the user agent, the operating system, even the hardware - everything is in scope.
As mentioned in my previous message, the user agent might not be able to understand why they are not receiving new frames. Issuing a
Why frames are not arriving would not always be known.
Developers cannot afford to sit on their hands and pray that others would solve their problems. We live in a competitive world. He who solves their users problems promptly gains the prize of retaining his customers. Let's empower developers in their quest to serve our mutual users. ("Our mutual users" - shared by the browser and the Web app.)
You gave an example where you believe ending is better than muting. Even if I agreed, for the sake of argument, that this was correct - what about all other cases? Allow me to quote my colleague Guido: "We need to solve all use cases that arise in practice, not just the simplest one." |
Not all subsequent language align with muted being an intentional User Agent initiated change. In fact, I would argue that no language at all aligns with this. The word intentional does not appear anywhere in the spec. The muted/unmuted state of a track reflects whether the source provides any media at this moment. A MediaStreamTrack is muted when the source is temporarily unable to provide the track with data And Section 8 says the mute event is fired when The MediaStreamTrack object's source is temporarily unable to provide data, and the unmute event is fired when A MediaStreamTrack has been removed from this stream. Note that this event is not fired when the script directly modifies the tracks of a MediaStream. This makes it clear that the model is that muted means no media from the source to the track, and disabled means no data from the track to its consumers.
Maybe it was a mistake that the spec defined the muted attribute and the corresponding events the way it did years ago. But, mistaken or not, that's how it was defined.
In this case, Chromium just is applying the model defined in the main spec to remote tracks. The WebRTC spec indicates some cases in which the muted attribute should be set/unset, but AFAICT it does not say anywhere that this overrides the model defined in the original MediaStreamTrack specification. It also does not state a new definition of muted specific for WebRTC tracks and does not even list the muted/unmuted events in its [Event Summary section]. Shouldn't specs that override/redefine concepts inherited from other specs explicitly state it?. Until we make this more explicit in the WebRTC spec, my position is that https://crbug.com/941740 is not a spec-compliance bug in Chromium. If anything, it looks more like a spec bug in the WebRTC spec.
Maybe Firefox's behavior is the one in violation of the spec?
Can you clarify what this sentence means? Or is it a statement that if the UA detects a condition that should mute the track, then it should make sure the track does not receive any media? Either way, the change is not enough, since the concept of muted meaning no data from source to track is in many other places of the spec. Finally, I am opposed to an incompatible redefinition of the meaning of muted because experience shows that this type of change is difficult to deploy in practice and can lead to more interoperability issues. I am not opposed to a redefinition that provides a path for existing applications to use a newer, more useful definition, without making it impossible for applications to continue relying on the old definition. |
Yes, we are interested in introducing a new definition that can solve the multiple-mute problem (and even the single mute one), but in a way that doesn't break existing applications or that at least provides a path for existing applications to be easily updated to continue working.
In our experience, applications that break are the ones that are hard to think about in advance. We normally find out after rolling out the change. For example, when we implemented the requirement to wait for focus in getUserMedia() we thought nothing would break, and shortly after we started rolling out the change we received reports from some kiosk-like environments that broke because focus was impossible to obtain for those applications. We had to roll back the change.
Maybe that can be a solution. Support the old definition via stats and the new definition with muted. I'm not sure the stats spec in its current form supports this, but it's a valid possibility.
That wouldn't be ideal. It doesn't have to be the case here, though. |
I don't think user agents have needs other than the ones of their users (including developers).
To me all this sounds a lot like synthesizing events reactively from symptoms.
Yes. Chromium implements both according to the spec.
Already answered in a previous message. |
We define APIs so that developers can satisfy user needs for applications running on a specific UA. The UA and the OS are not friends. And the user has a direct relationship to both. When an OS-level mute is applied, and can only be rectified using the user's relationship with the OS, the user needs to know that it has to act in relation to the OS. If the OS offers API to the UA so that the UA can let the app developer satisfy the user's need (in this case: to unmute), the user's needs will be simpler to satisfy. The difference between muted and ended in our specs is that one is reversible, the other isn't. So anything that is not based on a clear signal that the source is gone and won't come back should be "muted", not "ended". "Reason to believe" sounds like "probable cause", not "clear signal". |
video deliveredframes can be used with a timer-based approach to shim existing Chromium muting events for video tracks. Alternatively, shipped rvfc can already be used to detect that frames are not flowing. audio deliveredframes can be used for microphone tracks, AudioWorklet might most probably expose 0 in case of missing audio frames. This approach does not require to create new APIs and allows web applications to fine tune their own detection heuristics. @guidou, do you think this migration path would work? |
That's not a migration path, that's a redefinition. |
This seems wrong. I've filed w3c/webrtc-pc#2915 on this. Let's discuss that there. I think I see now how we came to have this vague language. MediaCapture-main is trying to establish both a model for all sources, while simultaneously specifying camera and microphone sources explicitly. I think it needs to do a better job separating when it's doing one or the other. At its core, I think most people consider muting to be a conscious action based on intent. A reason, not a reaction. |
Correct me if I am wrong, but at the time that |
It's in the OP: "There can be several reasons for a MediaStreamTrack to be muted: the user pushing a physical mute button on the microphone, the user closing a laptop lid with an embedded camera, the user toggling a control in the operating system, the user clicking a mute button in the User Agent chrome, the User Agent (on behalf of the user) mutes, etc." The "etc." refers to other "reasons" ... "the User Agent initiates such a change", including "access may get stolen ... in case of an incoming phone call on mobile OS". I dunno when Safari implemented its pause, but I think it was fairly early? But I don't understand why it matters since it's common and desirable for specs to exist before implementations. Specs define implementations. When I said "most people" I meant outside of WebRTC. Muting is a verb, a function. |
Could you link to it please? This issue is getting long. Please give an example of an application relying on Chrome's behavior and what action it takes. E.g. is it showing the user a message that "things are broken and no-one can hear you, please wait, maybe"? |
The answer is that, in our experience, applications that break with this type of change are the ones that are hard to think about in advance. We normally find out after rolling out the change. For example, when we implemented the requirement to wait for focus in getUserMedia() we thought nothing would break, and shortly after we started rolling out the change we received reports from some kiosk-like environments that broke because focus was impossible to obtain for those applications. IMO, the bar for changing a definition that has been in place for years both in spec and implementations should be very high, even if the proposed change is obviously better. |
I agree. And even if we could come to an agreement - it does not appear to come quick nor easy. Now, @jan-ivar has recently posted something I wholeheartedly agree with:
In the spirit of these wise words, I propose we now proceed with one of the backwards-compatible proposals currently under discussions, such as MuteReason or MediaSession. (Full disclosure - I have a strong preference for the former.) |
It seems we all agree this definition would be better. A path forward has been described, via a shim of current Chrome behaviour.
This would solidify a model of muted being open ended and loosely defined.
We need to make MediaSession and MediaStreamTrack consistent, let's do that whatever we decide here. |
Even with the proposal here "mute" would still cover both OS-based and UA-based muting. Letting the Web app know which it is does not make it open ended or loosely defined. Carving out an "unspecified" for hardware, issues, or anything we might not be thinking of, does not solidify the model; later migration would be equally challenging then as it is now.
I am not opposed to dedicated events, but they seem to be less elegant a solution, given the possibility of multiple concurrent mutes. Conversely, a single mute state with multiple reasons, allows observing the transition from empty set to non-empty set, which is great for apps that only care about that. |
It seems to me that even with the greatest selection of mute reasons imaginable, there is likely to be the case of "this source is producing silence and I don't know why". I think that's a reasonable description of the cases where Chrome currently mutes and other browsers have not chosen to mute. Note: I'm unclear about whether Chrome fires mute events on "no signal" in audio. If we do, I think the signal Chrome is reacting to on audio is digital silence (all zeroes), which is different from "no speech detected" - there's always some noise in real audio. |
It is hard to make progress without precisely knowing how/when Chrome is firing mute events on capture tracks. That seems valuable information to provide to the web page. For video, MediaStreamTrack stats is hopefully sufficient to detect these malfunctioning cases. |
When the |
What is "an upstream entity such as the OS or UA" distinct from, when all muting is "UA" by definition? This seems to be the definition problem we're having. Turning the question around: When the In other browsers, apps could detect this (e.g. using stats once implemented):
In Chromium, apps cannot, because Chromium circularly masks the symptom, making malfunction indistinguishable from "OS or UA" mute. This problem seems unique to Chromium, as does the need for a new mute-reason API to resolve it. |
To make progress, I think we should leave the UA vs. OS muting discussion out of this particular issue. This can be resolved orthogonally to this discussion. The proposal is something like:
Other than requiring changes in UAs, I do not see any drawback. Am I missing something? |
I don't see the justification for this. It draws a distinction between "malfunction" and "non-malfunction" that seems unwarranted and unenforceable (if an user unplugs the camera, it's a mute event; if the cat bites off the camera cable, it's a malfunction????)
Since 1 is unjustified, 2 is unreasonable. Also, shims don't belong in the spec.
Since I don't see the point of the change, I don't see any advantage in making it. |
When the referenced text says the UA "initiates such a change", I believe it is referring to the steps to mute the MediaStreamTrack JS object which only the UA can modify, i.e. the steps to make the muting visible to the web app. Do read the previous sentence about UA should expose this information to the app. Also read all the examples, they're full of things that happened that was not "initiated by the UA" (laptop lid closing, incoming phone call, etc). The only thing initiated by the UA is firing the event, it is reactive, not proactive.
This does not make it less confusing. It begs the question: why is it muted? Even under this definition, my reading is still that the UA should detect mute on a higher layer - including reasons of malfunction, the "etc" is really a catch-all - and then initiate the exposure of the mute event. My understanding is Chromium is spec-compliant both with and without this sentence changed. In other words, today mute means "I'm not getting any frames despite the track being enabled". This makes sense to know whether or not you care about the reason. And because we haven't exposed the reason yet, people haven't been allowed to care about why yet. So from a web developer POV, the use case this solves is still valid and it is backwards compatible not to change it. If we add the reason, then apps that do care about why have enough information to make the distinction, solving both the use case of caring and the use case of not caring, without causing backwards compat issues. Finally let's ask yourselves, what value does it bring to developers to pretend a malfunctioning track is not mute? |
No, it should be an ended event, the OS knows the camera is gone and most probably surfaces an error to the UA.
The OS API will tell whether the camera capture is failing or device disappeared.
Mal functioning though is not about not getting any frames, as can be illustrated with BT microphones where drops may happen frequently while still getting sometime some audio. I think these two signals would best be exposed independently. I wonder whether adding a malFunctioning boolean, maybe with corresponding events might be a way forward for w3c/mediacapture-extensions#39, plus being more explicit about what muted means for capture tracks. |
Adding a mutedAndWeKnowWhy event (with a "reason" parameter) and leaving the current "muted" as-is would definitely be a reasonable way forward. |
Using |
Using
I understand the usefulness of conveying the mal functioning information to the web app (though it is unclear whether there is agreement on what malfunctioning actually means). |
I would prefer a straight answer to my straight question. Exercises in turning questions around do not help us make progress, as this very thread demonstrates. Recall the question:
This question originally referred to Youenn's preceding message. Ironically, it now also refers to Jan-Ivar's subsequent message which sought to brush the very question aside! In this message Jan-Ivar also suggested:
The world is asynchronous. It is not possible to definitively correlate the presence/absence of recent frames with the presence/absence of recent mute events. Or if it is possible - please demonstrate how. This was the question. It warrants our attention. |
@youennf What would you suggest would be a good way to expose reasonably detailed mute reasons be? Phrased differently, how would |
The current plan is to use That said, as it is right now, togglemicrophone would provide a boolean value to muted tracks that we thought might be sufficient in the short term. Hence why I would not concentrate on this topic right now. Instead, I'd like to first validate that this minimal MediaSession API is good enough:
|
What is the right place to discuss the Media Session proposal? |
What is the right place to discuss the Media Session proposal? See w3c/mediasession#312, w3c/mediasession#307, w3c/mediasession#279 and w3c/mediasession#278. There is a plan to add support for screen share (w3c/mediasession#306).
When we discussed this particular topic, one idea was to add a member to MediaSessionActionDetails, like a deviceId. But it was unclear whether it was useful enough in the short term to work on it. Filing an issue in MediaSession repo might be a good idea to keep track of this. Similarly, we might want to add a state to MediaSessionActionDetails (to know whether action is about muting or unmuting). I'll probably work on this once the basic PRs are all landed. |
Note that the w3c/mediasession API does not admit of the existence of multiple microphones. So "the current plan" should be phrased differently - there is no WG consensus for any particular plan. |
Thanks. I left a couple of comments in some of the issues. I think a solution based on MediaStreamTrack directly would be more suitable for the VC use cases, since applications already have access to the tracks they are playing. |
This issue was discussed in WebRTC August 27 2024 meeting – 27 August 2024 (Moving Forward with Mute) |
This definition is backwards: "If live samples are not made available to the MediaStreamTrack it is muted".
Mute causes lack of frames, not the other way around: If a MediaStreamTrack is muted, no live samples are made available to it.
All subsequent language and examples align with muted being an intentional User Agent initiated change:
Crucially, the "change" of state (not just the event) is initiated by the User Agent.
This has caused confusion in implementations. E.g. @youennf replied in w3c/mediacapture-extensions#39 (comment):
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake. For example: crbug 941740 implements mute on remote tracks reactively based on (lack of) input, violating the WebRTC spec and causing web compat issues. Doing the same on capture tracks seems like a bug, and should be a violation of this spec, but is attributed to the aforementioned line in the spec.
Firefox fires
mute
as explained in the OP of w3c/mediacapture-extensions#39 (comment) (behind a pref) but never reactively from symptoms.Proposal:
Replace the confusing sentence with "If a MediaStreamTrack is muted, no live samples are made available to it."
The text was updated successfully, but these errors were encountered: