Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious failures in PCI tests caused by pci-tracing nvram option #491

Open
oohal opened this issue May 6, 2019 · 0 comments
Open

spurious failures in PCI tests caused by pci-tracing nvram option #491

oohal opened this issue May 6, 2019 · 0 comments
Assignees
Labels

Comments

@oohal
Copy link
Contributor

oohal commented May 6, 2019

Copied from an internal bug report:

Failure message:

======================================================================
FAIL: runTest (testcases.OpTestPCI.PcieLinkErrorsSkiroot)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/afs/rchland.ibm.com/projects/esw/dvt/git/apolloprep/op-test-framework/testcases/OpTestPCI.py", line 191, in runTest
self.pcie_link_errors()
File "/afs/rchland.ibm.com/projects/esw/dvt/git/apolloprep/op-test-framework/testcases/OpTestPCI.py", line 80, in pcie_link_errors
self.assertTrue( len(total_entries) == 0, "pcie link down/timeout Errors in OPAL log:\n%s" % msg)
AssertionError: pcie link down/timeout Errors in OPAL log:
[ 118.408432659,3] PHB#0030[8:0]: TRACE: Timeout waiting for link up.

Boot log (from a different boot, same failure failure mode):

# cat /sys/firmware/opal/msglog | grep 'PHB#0030'
[  113.328283004,7] PHB#0030:00:00.0 PCI: Registered PHB
[  113.328284700,6] PHB#0030[8:0]: Found /pciex@620c3c0000000 @0x620c3c0000000
[  113.328286204,6] PHB#0030[8:0]:   M32 [0x000620c000000000..0x000620c07fffffff]
[  113.328287413,6] PHB#0030[8:0]:   M64 [0x0006200000000000..0x0006203fffffffff]
[  113.328289896,7] PHB#0030[8:0]: Core revision 0xa40002
[  113.328291181,7] PHB#0030[8:0]: Found 512 max PEs and 4096 IRQs
[  113.328292319,6] PHB#0030[8:0]: Max link speed: GEN3
[  113.328293163,7] PHB#0030[8:0]: Override lane equalization settings:
[  113.328294156,7] PHB#0030[8:0]:   0x5454545454545454 0x5454545454545454
[  113.328295401,7] PHB#0030[8:0]:   0x5454545454545454 0x5454545454545454
[  113.328296521,7] PHB#0030[8:0]:   0x7777777777777777 0x7777777777777777
[  113.328412018,7] PHB#0030[8:0]: Initializing PHB4...
[  113.328413302,7] PHB#0030[8:0]: Default system config: 0x4410000040400000
[  113.328414715,7] PHB#0030[8:0]: New system config    : 0x4410000030400000
[  113.328415956,7] PHB#0030[8:0]: Initial PHB CRESET is 0xe000000000000000
[  113.328417754,7] PHB#0030[8:0]: Waiting for DLP PG reset to complete...
[  113.328931741,7] PHB#0030[8:0]: Reset state SRC_ID: 0ff8000000000000
[  113.329255294,7] PHB#0030[8:0]: Initialization complete
[  113.345615008,7] PHB#0030[8:0]: FRESET: Starts
[  113.345810088,7] PHB#0030[8:0]: FRESET: Prepare for link down
[  113.345824006,7] PHB#0030[8:0]: FRESET: Assert skipped
[  113.345835344,7] PHB#0030[8:0]: FRESET: Deassert
[  113.345852423,3] PHB#0030[8:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
[  116.345848015,3] PHB#0030[8:0]: TRACE: Timeout waiting for link up.
[  116.345949079,7] PHB#0030[8:0]: LINK: Start polling
[  117.346501203,7] PHB#0030[8:0]: LINK: No in-band presence
[  117.350916395,7] PHB#0030:00:00.0 Link down
[  117.351057838,7] PHB#0030:00:00.0 Scanning (upstream only)...
[  117.351118551,7] PHB#0030:00:00.0 Found VID:1014 DEV:04c1 TYP:4 MF- BR+ EX+
[  117.351343054,7] DT-SLOT: PHB#0030:00:00.0 Finding slot
[  117.351360918,7] DT-SLOT: PHB#0030:00:00.0 Slot found /ibm,pcie-slots/root-complex@8,0
[  117.413082227,5] PHB#0030:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:00..00 SLOT=SLOT2
[  122.326533258,7] PHB#0030[8:0]: Purging all IODA tables...

My diagnoses:

Clearing the pci-tracing nvram option is the fix.

If the trace ends due to a timeout we print the final status at PR_ERROR. The trace timeout error strings are similar to the strings printed when there's a link training failure so the test fails spuriously when pci-tracing is enabled.

For reference, the two messages printed are:

PHBERR(p, "TRACE: Timeout waiting for link up.\n");
PHBERR(p, "LINK: Timeout waiting for link up\n");

The TRACE: messages indicates that the pci-tracing ended with the link being down. This isn't really an error and op-test shouldn't flag it as such. The second is only printed if we detect an electrical link (i.e there's a card in the slot), but timeout waiting for the link finish training.

I'll bump the log level of the TRACE: message to PR_NOTICE (lvl 5), but op-test should be made a more robust against this sort of spurious failure too.

Looks like the test itself in testcases/OpTestPCI.py:322 is just scanning for the string "Timeout waiting for". This should be made more specific to the sort of errors the test is looking for.

@sammj sammj added the bug label May 6, 2019
sammj added a commit to sammj/op-test-framework that referenced this issue May 6, 2019
When a sufficient amount of debugging has been enabled[0] Skiboot will
produce messages of the form
	PHB#0030[8:0]: TRACE: Timeout waiting for link up
This is harmless and only represents an empty slot so filter it out on
all platforms.

Fixes open-power#491

[0] nvram -p ibm,skiboot --update-config pci-tracing=true

Signed-off-by: Samuel Mendoza-Jonas <[email protected]>
@sammj sammj self-assigned this May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants