Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASoC: SOF: core/Intel: Follow the pause_supported flag from topology to disable or keep the pause support enabled #5041

Open
wants to merge 2 commits into
base: topic/sof-dev
Choose a base branch
from

Conversation

ujfalusi
Copy link
Collaborator

@ujfalusi ujfalusi commented Jun 5, 2024

Hi,

new token set is added by thesofproject/sof#9202 to indicate that the pause operation is supported or not on the given PCM device.

Use this information in hda-pcm.c to disable the pause support or keep it enabled.

The PAUSE/RESUME support is not widely used and thus tested, we have corner cases that received no testing for years, like a recent report: #5035

The new flags can be treated as 'hints' for platforms to disable the PAUSE support.

Reasoning: End-users are using media servers to use audio (Pulseaudio,
Pipewire, CRAS, etc) and they never use pause operation. Media players
using ALSA directly have support for PCMs without pause support.

The pause/resume has never been tested in a same depth as other use cases
and we have corner cases that receives virtually no testing at all, like
suspending while the stream is paused.
We do not test the pause/resume in a more complex environment either and
it might have hidden issues.

@ujfalusi ujfalusi changed the title ASoC: SOF: core/Intel: FOllow the hint from topology to disable or keep the pause support enabeld ASoC: SOF: core/Intel: Follow the hint from topology to disable or keep the pause support enabeld Jun 5, 2024
@ujfalusi
Copy link
Collaborator Author

ujfalusi commented Jun 5, 2024

Yes, with this PR the PAUSE is going to be mostly disabled for Intel platforms regardless of IPC version.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typoe in title "enabeld"

Patch looks good but I wonder if this is ok to change the behaviour for existing systems -- this will go all the way back to CML/CFL systems (and not just Intel).. It's possible somebody is happily using SOF on a system with apps that use pause, and this kernel PR could break their use-case when they upgrade the kernel.

Staged introduction (or disable_pause token in tplg) would be more incremental.

@plbossart
Copy link
Member

I am not aligned with the approach, sorry.

I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.

It's also incorrect to say it's not been tested. What has not been tested is suspend while paused which is a second-order corner case.

The net effect of this topology token would be to COMPLETELY disable PAUSE, including in our CI tests. That's different to disabling it for releases, and making an educated decision to avoid spending cycles on problematic issues with no ROI.

The topology approach has zero flexibiliity, it's all or nothing.

I wouldn't mind if this topology-based approach was contingent on a 'default n' kernel Kconfig, which would only be used by developers and CI tests. I am not ok with completely disabling this PAUSE capability for tests.

@ujfalusi
Copy link
Collaborator Author

ujfalusi commented Jun 6, 2024

Typoe in title "enabeld"

Patch looks good but I wonder if this is ok to change the behaviour for existing systems -- this will go all the way back to CML/CFL systems (and not just Intel).

Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked.
The only solution so far for the freeze is my #5040, for both IPC3 and IPC4.

It's possible somebody is happily using SOF on a system with apps that use pause, and this kernel PR could break their use-case when they upgrade the kernel.

Yes, everything is a possibility, true.

Staged introduction (or disable_pause token in tplg) would be more incremental.

Initially I wanted to have opt-out for pause/resume but #5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.

@ujfalusi ujfalusi force-pushed the peter/sof/pr/pause-resume-token-01 branch from 91a1557 to c6e6b47 Compare June 7, 2024 07:49
@ujfalusi
Copy link
Collaborator Author

ujfalusi commented Jun 7, 2024

Changes since v1:

  • add force_pause_support module parameter (to hda-pcm.c) to ignore the pasue_supported flag from topology and to keep the pause enabled. Iow, if the force_pause_support is set to true then there is no change in behavior

@plbossart
Copy link
Member

@fredoh9 @marc-hb @ssavati can we add this to the CI configuration before merging this:

options snd_sof_intel_hda_common force_pause_support=true

plbossart
plbossart previously approved these changes Jun 7, 2024
Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's make sure CI is ready before merging this, otherwise we're going to get a slew of errors across the board.

@ujfalusi
Copy link
Collaborator Author

ujfalusi commented Jun 7, 2024

Changes since v2:

  • drop thehda_ prefix for the module parameter name (no idea how it keeps coming back - removed like four times while playing with it...)

plbossart
plbossart previously approved these changes Jun 7, 2024
ranj063
ranj063 previously approved these changes Jun 13, 2024
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 13, 2024

but #5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.

#5035 is a pretty specific corner case. Simpler pause/resume cases have been regularly tested for a long time with very low failure rates.

@@ -240,6 +245,14 @@ int hda_dsp_pcm_open(struct snd_sof_dev *sdev,
if (hda_always_enable_dmi_l1 && direction == SNDRV_PCM_STREAM_CAPTURE)
runtime->hw.info &= ~SNDRV_PCM_INFO_PAUSE;

/*
* Follow the hint (unless a module parameter is set to ignore it) and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called a "hint" when it's always followed? Because older kernel versions ignore it?

This could/should be clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is documented in previous patch as 'hint' as the flag is defaulting to false for all platforms, it is only handled by Intel code and older topologies will have PAUSE disabled also.
This is intentional decision, but I rather call it as a hint in SOF context than a hard rule, I don't expect NXP/AMD/Mediatek to disable PAUSE support, but they can if they also set to follow this hint.

Copy link
Collaborator

@marc-hb marc-hb Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have spent more time reading the commit message. A commit message is not a substitute for documentation though, is there or will there be any? Somewhere in ALSA user space maybe?

A "hint" is indeed something that may or may not have an effect depending on other factors. However this particular "hint" will be 100% deterministic on every device! It will be 100% effective on some devices and 100% ignored on other devices. It is a "hint" only from the perspective of the developer (=you) looking at all devices at once. But it is not a hint from the perspective of any final user. I don't think this should be called a "hint", it's not necessary and quite misleading. It's just a parameter that is not supported in every configuration. Ignoring some (e.g. incompatible) parameter in some configurations is not so unusual and not called a "hint".

I don't expect NXP/AMD/Mediatek to disable PAUSE support, but they can if they also set to follow this hint.

What would be great is to print a warning in all configurations where the parameter is set and ignored, is that possible? Silently ignoring user input is extremely time consuming.

@ujfalusi ujfalusi dismissed stale reviews from ranj063 and plbossart via e5e16f5 July 12, 2024 16:24
@ujfalusi ujfalusi force-pushed the peter/sof/pr/pause-resume-token-01 branch from bb8730d to e5e16f5 Compare July 12, 2024 16:24
@ujfalusi
Copy link
Collaborator Author

Changes since v3:

  • Reword commit messages and comments to omit the 'hint' word.

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of corrections needed @ujfalusi but no objections on the directions.

@@ -332,6 +332,14 @@ struct snd_sof_pcm_stream {
struct work_struct period_elapsed_work;
struct snd_soc_dapm_widget_list *list; /* list of connected DAPM widgets */
bool d0i3_compatible; /* DSP can be in D0I3 when this pcm is opened */
/*
* Flag for the core and platform code about PCM pause/resume support.
* Note: ENd-users are using audio via media servers (Pulseaudio, Pipewire,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray capital letter: End-users

* Flag for the core and platform code about PCM pause/resume support.
* Note: ENd-users are using audio via media servers (Pulseaudio, Pipewire,
* CRAS, etc) and media servers do not use pause/resume.
* Media players using the ALSA directly have means to handle PCM devices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the ALSA API directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or 'using ALSA directly' ?

include/uapi/sound/sof/tokens.h Show resolved Hide resolved
sound/soc/sof/intel/hda-pcm.c Show resolved Hide resolved
@ujfalusi ujfalusi force-pushed the peter/sof/pr/pause-resume-token-01 branch from e5e16f5 to 783785a Compare July 16, 2024 12:09
A new tokens are added to topology:
1202: SOF_TKN_STREAM_PLAYBACK_PAUSE_SUPPORTED
1203: SOF_TKN_STREAM_CAPTURE_PAUSE_SUPPORTED

The snd_sof_pcm_stream.pause_supported is updated based on the new token
for platform code that the PAUSE support should or should not be
advertised to user space.

If the token does not exist then the PAUSE is advised to be disabled.

Reasoning: End-users are using media servers for audio (Pulseaudio,
Pipewire, CRAS, etc) and they never use pause operation. Media players
using ALSA directly have support for PCMs without pause support.

The pause/resume has never been tested in a same depth as other use cases
and we have corner cases that receives virtually no testing at all, like
suspending while the stream is paused.
We do not test the pause/resume in a more complex environment either and
it might have hidden issues.

Note: it is up to the platform code to use this flag to disable or enable
the pause support.

Signed-off-by: Peter Ujfalusi <[email protected]>
…USE support

If the stream's pause_supported flag is false then mask out the PAUSE
support, so user space will be prevented to use it.

Introduce a module parameter to ignore the pause_supported flag, named as
force_pause_support to allow testing of the PAUSE feature.

Signed-off-by: Peter Ujfalusi <[email protected]>
@ujfalusi
Copy link
Collaborator Author

Changes since v4:

  • Fix the commit message and comment in the first patch

@marc-hb marc-hb changed the title ASoC: SOF: core/Intel: Follow the hint from topology to disable or keep the pause support enabeld ASoC: SOF: core/Intel: Follow the pause_supported flag from topology to disable or keep the pause support enabeld Jul 18, 2024
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked.

This does not "show" or prove anything: it's only a corner case that regular users are incredibly unlikely to hit - much less likely than regular "pause" which is already unused.

We actually have a test script for this corner case but it's never been working properly:

On the other hand, regular pause/release (without suspend) has been regularly tested since forever and it has been mostly passing. There were some frequent false positives caused by bugs in the test code but I fixed all that a few days ago in rewrite thesofproject/sof-test#1218 and pause/release testing has been rock solid since.

I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.

Indeed: since my test code rewrite in 1218, only LNL has been failing consistently. Guess what: LNL is the only platform where we ALSO have other, persistent and hard to fix failures - but with much lower reproduction rates. Coincidence? I don't think so.

It's very common (and very useful) to stress test systems past supported use cases to merely increase reproduction rates of issues also happening in regular use cases and this has what pause/release has been successfully doing for a long time - and even better now.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

BTW what is the aplay/arecord user interface expectation after this is merged? I mean for regular users, not for CI or experts with force_pause_support=true or something.

https://sof-ci.01.org/linuxpr/PR5041/build4054/devicetest/index.html?model=GLK_BOB_DA7219-ipc3&testcase=check-pause-resume-playback-10 and others show this:

t=150 ms: aplay: (0/10) Found volume ### | xx%, recording for 178 ms
t=234 ms: aplay: (0/10) Found   === PAUSE ===  ,  pausing for 195 ms
t=404 ms: aplay: (1/10) Found volume ### | xx%, recording for 161 ms
t=404 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=575 ms: aplay: (1/10) Found   === PAUSE ===  ,  pausing for 128 ms
t=746 ms: aplay: (2/10) Found volume ### | xx%, recording for 151 ms
t=746 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=831 ms: aplay: (2/10) Found   === PAUSE ===  ,  pausing for 186 ms
t=1002 ms: aplay: (3/10) Found volume ### | xx%, recording for 179 ms
t=1002 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1258 ms: aplay: (3/10) Found   === PAUSE ===  ,  pausing for 124 ms
t=1343 ms: aplay: (4/10) Found volume ### | xx%, recording for 150 ms
t=1343 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1514 ms: aplay: (4/10) Found   === PAUSE ===  ,  pausing for 175 ms
t=1684 ms: aplay: (5/10) Found volume ### | xx%, recording for 111 ms
t=1684 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
...

This is really weird: why does aplay/arecord display PAUSE when there is no PAUSE support?

Maybe ALSA userspace should be involved in this too?

@ujfalusi
Copy link
Collaborator Author

ujfalusi commented Jul 19, 2024

@marc-hb, something is not quite right in the expect script in case-lib/apause.exp, not sure what to be honest.

It is pretty interesting what aplay/arecord does.
Pause is supported

  • if you run w/ -v or with one -v and pause is supported
    After the first SPACE it will print === PAUSED === with the cursor moved to column 71:
    image
    After releasing the pause the === PAUSED === remain, but the cursor appears to be at the start of the line:
    image
  • if you have -vv then the === PAUSED === replaces the volume bar
  • In every cases it is using the same line

Pause is not supported

  • if you run w/ -v or with one -v and pause is supported
    Each SPACE will result a print and a new line (PAUSE command ignored (no hw support))
    image
  • if you have -vv then it will still print lines and the volume meter is moving as the last line:
    image

But it certainly not printing === PAUSED ===, so the script is to blame for the false report.

@ujfalusi
Copy link
Collaborator Author

@marc-hb, I think I have found a plausible reason...
aplay prints out the === PAUSE === unconditionally before trying to pause. If pause is not supported then it will write PAUSE command ignored (no hw support)\n which will replace the previously printed === PAUSE ===
expect might be catching this.
https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1645
https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1610

So the expect script is OK.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 20, 2024

In every cases it is using the same line

Yes, aplay uses \r instead of \n to constantly overwrite its previous output and not scroll. The script should be ready for that. If you think it's not please file an new sof-test bug and I'll try to fix ASAP.

Generally speaking, Expect (or anything reading from a pipe) has no concept of "overwriting" or "scrolling"; only terminals do. Other programs reading from a pipe only see a stream of characters, some of them happening to be \r and \n.

aplay prints out the === PAUSE === unconditionally before trying to pause. If pause is not supported then it will write PAUSE command ignored (no hw support) which will replace the previously printed === PAUSE ===
expect might be catching this.

Thanks for the investigation!

So the expect script is OK.

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused.

Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.

@ujfalusi
Copy link
Collaborator Author

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused.

Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.

I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal

This might avoid confusing the script (might need to handle the not supported case and abort?): alsa-project/alsa-utils#271

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 22, 2024

I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal

Anyone?

This might avoid confusing the script (might need to handle the not supported case and abort?): alsa-project/alsa-utils#271

Very useful but it won't help systems and people who don't update ALSA user space regularly.

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused. Not sure how to fix that.

Wait: it shouldn't be too hard to catch PAUSE command ignored and fail or skip even after entering a "confused" state. It only needs to happen before the confusion leads to another, earlier and misleading failure...

@plbossart
Copy link
Member

@ujfalusi @marc-hb I am not able to follow what the thread is about. Is our CI able to deal with this or not?

The agreement was that the pause support would be forced on all CI devices, so what is the issue? We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?

What I don't get is whether all CI devices are ready for this PR to be merged, adding a kernel parameter requires some changes in the setup.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 24, 2024

We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?

It's not our problem in theory. In practice, we still don't have consistent device configuration.

marc-hb added a commit to marc-hb/sof-test that referenced this pull request Aug 1, 2024
Fail fast and avoid timeouts.

This is especially important considering pause will soon be disabled by
default:
- thesofproject/linux#5041

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Aug 1, 2024

Is our CI able to deal with this or not?

It will be after small sof-test fix thesofproject/sof-test#1226 is merged (DONE). That fix will let us tell the difference between misconfigured devices versus well configured devices actually failing the test. @ujfalusi please review.

marc-hb added a commit to thesofproject/sof-test that referenced this pull request Aug 1, 2024
Fail fast and avoid timeouts.

This is especially important considering pause will soon be disabled by
default:
- thesofproject/linux#5041

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Aug 6, 2024

SOFCI TEST

@marc-hb marc-hb changed the title ASoC: SOF: core/Intel: Follow the pause_supported flag from topology to disable or keep the pause support enabeld ASoC: SOF: core/Intel: Follow the pause_supported flag from topology to disable or keep the pause support enabled Aug 9, 2024
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi can you rewrite the commit message. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants