Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soundwire: intel_bus_common: enable interrupts before exiting reset #5106

Conversation

plbossart
Copy link
Member

The existing code enables the Cadence IP interrupts after the bus reset sequence. The problem with this sequence is that it might be pre-empted, giving SoundWire devices time to sync and report as ATTACHED before the interrupts are enabled. In that case, the Cadence IP will not detect a state change and will not throw an interrupt to proceed with the enumeration of a Device0.

In our overnight stress tests, we observed that a slight sub-millisecond delay in enabling interrupts after the reset was correlated with detection failures. This problem is more prevalent on the LunarLake silicon, likely due to SOC integration changes, but it was observed on earlier generations as well.

This patch reverts the sequence, with the interrupts enabled before performing the bus reset. This removes the race condition and makes sure the Cadence IP is able to detect the presence of a Device0 in all cases.

The existing code enables the Cadence IP interrupts after the bus
reset sequence. The problem with this sequence is that it might be
pre-empted, giving SoundWire devices time to sync and report as
ATTACHED before the interrupts are enabled. In that case, the Cadence
IP will not detect a state change and will not throw an interrupt to
proceed with the enumeration of a Device0.

In our overnight stress tests, we observed that a slight
sub-millisecond delay in enabling interrupts after the reset was
correlated with detection failures. This problem is more prevalent on
the LunarLake silicon, likely due to SOC integration changes, but it
was observed on earlier generations as well.

This patch reverts the sequence, with the interrupts enabled before
performing the bus reset. This removes the race condition and makes
sure the Cadence IP is able to detect the presence of a Device0 in all
cases.

Signed-off-by: Pierre-Louis Bossart <[email protected]>
…Attached

This is for CI only, this will help detect cases where Device0 was not
enumerated with the interrupt in the delayed work.

Signed-off-by: Pierre-Louis Bossart <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 17, 2024

multiple-pause-resume failed on all LNL models in https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html

At least one of the failures looks like #5109 which I just filed.

Everything else is either passing or missing.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 17, 2024

SOFCI TEST

@plbossart
Copy link
Member Author

@marc-hb we can ignore the pause/release issues, this isn't related to this PR at all. What is changed here is the SoundWire bus startup, it will only be influenced by pm_runtime transitions.

@plbossart
Copy link
Member Author

@bardliao can you take this PR and let it run overnight on your device to see if this confirms my results that we don't see the enumeration issue?

@bardliao
Copy link
Collaborator

@bardliao can you take this PR and let it run overnight on your device to see if this confirms my results that we don't see the enumeration issue?

I can only run the test on a CI device as I work remotely this week. Currently all LNLM_SDW_AIOC devices are reserved. I will check if I can reserve one later.

@plbossart
Copy link
Member Author

don't worry about it @bardliao I'll run a test myself then. We can re-run more tests next week.

@bardliao
Copy link
Collaborator

I ran a test last night, it passed 853 iterations before my network disconnected. I am running another test now.

@bardliao
Copy link
Collaborator

I didn't get "Peripheral 0" error, but got NACK. It failed at the 19th iteration

===========================>>
[  453.786241] kernel: soundwire_intel soundwire_intel.link.0: Msg NACK received, cmd 2
[  453.786426] kernel: soundwire_intel soundwire_intel.link.0: Msg NACKed for Slave 0
[  453.786447] kernel: soundwire sdw-master-0-0: trf on Slave 0 failed:-5 read addr 50 count 6
[  453.786529] kernel: soundwire sdw-master-0-0: DEVID read fail:-5
<<===========================

log2.txt

@plbossart
Copy link
Member Author

NACK means the command isn't accepted by the Peripheral, that's usually a sign of low-level electrical issues.

But it's also possible that by enabling interrupts earlier, we might get some weird signals on the bus and then the interrupt and command channels might be disrupted.

So we should retry and see if this happens again... I was hoping to merge this quickly but that's a red flag.

@bardliao
Copy link
Collaborator

I retested and it passed more than 1500 iterations.

@plbossart
Copy link
Member Author

ok let's merge then and see what happens. We can always revert.

@plbossart
Copy link
Member Author

Humm let's redo the tests first, we seem to have a regression on CML? https://sof-ci.01.org/linuxpr/PR5106/build4064/devicetest/index.html?model=CML_RVP_SDW-ipc3&testcase=multiple-pipeline-all

@plbossart
Copy link
Member Author

SOFCI TEST

@plbossart
Copy link
Member Author

the CML problem didn't re-occur, let's merge then and see what the daily tests show

@plbossart plbossart merged commit ed7746e into thesofproject:topic/sof-dev Jul 18, 2024
11 of 15 checks passed
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

Humm let's redo the tests first, we seem to have a regression on CML?

Only if you count a bug filed in 2021 as a "regression" :-D

@bardliao
Copy link
Collaborator

BTW, I am still running the test on a LNLM_SDW_AIOC device. It passed more than 1950 iterations so far. I will stop it before off work today.

@bardliao
Copy link
Collaborator

It passed more than 2500 iterations. I will stop here and release the device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants