spurious failures in PCI tests caused by pci-tracing nvram option #491

oohal · 2019-05-06T03:46:25Z

Copied from an internal bug report:

Failure message:

======================================================================
FAIL: runTest (testcases.OpTestPCI.PcieLinkErrorsSkiroot)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/afs/rchland.ibm.com/projects/esw/dvt/git/apolloprep/op-test-framework/testcases/OpTestPCI.py", line 191, in runTest
self.pcie_link_errors()
File "/afs/rchland.ibm.com/projects/esw/dvt/git/apolloprep/op-test-framework/testcases/OpTestPCI.py", line 80, in pcie_link_errors
self.assertTrue( len(total_entries) == 0, "pcie link down/timeout Errors in OPAL log:\n%s" % msg)
AssertionError: pcie link down/timeout Errors in OPAL log:
[ 118.408432659,3] PHB#0030[8:0]: TRACE: Timeout waiting for link up.

Boot log (from a different boot, same failure failure mode):

# cat /sys/firmware/opal/msglog | grep 'PHB#0030'
[  113.328283004,7] PHB#0030:00:00.0 PCI: Registered PHB
[  113.328284700,6] PHB#0030[8:0]: Found /pciex@620c3c0000000 @0x620c3c0000000
[  113.328286204,6] PHB#0030[8:0]:   M32 [0x000620c000000000..0x000620c07fffffff]
[  113.328287413,6] PHB#0030[8:0]:   M64 [0x0006200000000000..0x0006203fffffffff]
[  113.328289896,7] PHB#0030[8:0]: Core revision 0xa40002
[  113.328291181,7] PHB#0030[8:0]: Found 512 max PEs and 4096 IRQs
[  113.328292319,6] PHB#0030[8:0]: Max link speed: GEN3
[  113.328293163,7] PHB#0030[8:0]: Override lane equalization settings:
[  113.328294156,7] PHB#0030[8:0]:   0x5454545454545454 0x5454545454545454
[  113.328295401,7] PHB#0030[8:0]:   0x5454545454545454 0x5454545454545454
[  113.328296521,7] PHB#0030[8:0]:   0x7777777777777777 0x7777777777777777
[  113.328412018,7] PHB#0030[8:0]: Initializing PHB4...
[  113.328413302,7] PHB#0030[8:0]: Default system config: 0x4410000040400000
[  113.328414715,7] PHB#0030[8:0]: New system config    : 0x4410000030400000
[  113.328415956,7] PHB#0030[8:0]: Initial PHB CRESET is 0xe000000000000000
[  113.328417754,7] PHB#0030[8:0]: Waiting for DLP PG reset to complete...
[  113.328931741,7] PHB#0030[8:0]: Reset state SRC_ID: 0ff8000000000000
[  113.329255294,7] PHB#0030[8:0]: Initialization complete
[  113.345615008,7] PHB#0030[8:0]: FRESET: Starts
[  113.345810088,7] PHB#0030[8:0]: FRESET: Prepare for link down
[  113.345824006,7] PHB#0030[8:0]: FRESET: Assert skipped
[  113.345835344,7] PHB#0030[8:0]: FRESET: Deassert
[  113.345852423,3] PHB#0030[8:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
[  116.345848015,3] PHB#0030[8:0]: TRACE: Timeout waiting for link up.
[  116.345949079,7] PHB#0030[8:0]: LINK: Start polling
[  117.346501203,7] PHB#0030[8:0]: LINK: No in-band presence
[  117.350916395,7] PHB#0030:00:00.0 Link down
[  117.351057838,7] PHB#0030:00:00.0 Scanning (upstream only)...
[  117.351118551,7] PHB#0030:00:00.0 Found VID:1014 DEV:04c1 TYP:4 MF- BR+ EX+
[  117.351343054,7] DT-SLOT: PHB#0030:00:00.0 Finding slot
[  117.351360918,7] DT-SLOT: PHB#0030:00:00.0 Slot found /ibm,pcie-slots/root-complex@8,0
[  117.413082227,5] PHB#0030:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:00..00 SLOT=SLOT2
[  122.326533258,7] PHB#0030[8:0]: Purging all IODA tables...

My diagnoses:

Clearing the pci-tracing nvram option is the fix.

If the trace ends due to a timeout we print the final status at PR_ERROR. The trace timeout error strings are similar to the strings printed when there's a link training failure so the test fails spuriously when pci-tracing is enabled.

For reference, the two messages printed are:

PHBERR(p, "TRACE: Timeout waiting for link up.\n");
PHBERR(p, "LINK: Timeout waiting for link up\n");

The TRACE: messages indicates that the pci-tracing ended with the link being down. This isn't really an error and op-test shouldn't flag it as such. The second is only printed if we detect an electrical link (i.e there's a card in the slot), but timeout waiting for the link finish training.

I'll bump the log level of the TRACE: message to PR_NOTICE (lvl 5), but op-test should be made a more robust against this sort of spurious failure too.

Looks like the test itself in testcases/OpTestPCI.py:322 is just scanning for the string "Timeout waiting for". This should be made more specific to the sort of errors the test is looking for.

The text was updated successfully, but these errors were encountered:

When a sufficient amount of debugging has been enabled[0] Skiboot will produce messages of the form PHB#0030[8:0]: TRACE: Timeout waiting for link up This is harmless and only represents an empty slot so filter it out on all platforms. Fixes open-power#491 [0] nvram -p ibm,skiboot --update-config pci-tracing=true Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

sammj added the bug label May 6, 2019

oohal mentioned this issue May 6, 2019

NVRAM option awareness in op-test #492

Closed

sammj mentioned this issue May 6, 2019

OpTestPCI: Ignore harmless timeout message. #493

Closed

sammj self-assigned this May 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spurious failures in PCI tests caused by pci-tracing nvram option #491

spurious failures in PCI tests caused by pci-tracing nvram option #491

oohal commented May 6, 2019 •

edited

Loading

spurious failures in PCI tests caused by pci-tracing nvram option #491

spurious failures in PCI tests caused by pci-tracing nvram option #491

Comments

oohal commented May 6, 2019 • edited Loading

oohal commented May 6, 2019 •

edited

Loading