Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_tests.sh locks host machine #10

Closed
BoneGoat opened this issue Feb 1, 2021 · 9 comments
Closed

run_tests.sh locks host machine #10

BoneGoat opened this issue Feb 1, 2021 · 9 comments

Comments

@BoneGoat
Copy link

BoneGoat commented Feb 1, 2021

Hi,

I'm running the NiteFury on a RockPi4 (ARM). I had to recompile everything but after that step the kernel module loaded and everything seemed fine. When I ran run_tests.sh the whole machine locked up. Lights are still blinking on the NiteFury but the machine is non responsive. What may cause this?

lspci -vv

01:00.0 Serial controller: Xilinx Corporation Device 7024 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 234
Region 0: Memory at fa000000 (32-bit, non-prefetchable) [disabled] [size=1M]
Region 1: Memory at fa100000 (32-bit, non-prefetchable) [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00

sudo ./load_driver.sh

Loading xdma driver...
The Kernel module installed correctly and the xmda devices were recognized.
DONE

dmesg

[ 7641.378840] xdma: loading out-of-tree module taints kernel.
[ 7641.382338] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2017.1.47
[ 7641.382351] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, sgdma_timeout: 10 sec.
[ 7641.383695] xdma:xdma_device_open: xdma device 0000:01:00.0, 0xffffffc0ed638800.
[ 7641.383731] xdma 0000:01:00.0: enabling device (0000 -> 0002)
[ 7641.383757] xdma:pci_check_extended_tag: 0xffffffc0ed638800 EXT_TAG disabled.
[ 7641.383765] xdma:pci_check_extended_tag: pdev 0xffffffc0ed638800, xdev 0xffffffc0e8568000, config bar UNKNOWN.
[ 7641.383931] xdma:map_single_bar: BAR0 at 0xfa000000 mapped at 0xffffff800e400000, length=1048576(/1048576)
[ 7641.383969] xdma:map_single_bar: BAR1 at 0xfa100000 mapped at 0xffffff800bf80000, length=65536(/65536)
[ 7641.383980] xdma:map_bars: config bar 1, pos 1.
[ 7641.383987] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[ 7641.384198] xdma:probe_one: 0000:01:00.0 xdma0, pdev 0xffffffc0ed638800, xdev 0xffffffc0dd202000, 0xffffffc0e8568000, usr 16, ch 1,1.
[ 7641.409170] xdma:cdev_xvc_init: xcdev 0xffffffc0dd203b88, bar 0, offset 0x40000.

sudo ./run_test.sh

Info: Number of enabled h2c channels = 1
Info: Number of enabled c2h channels = 1
Info: The PCIe DMA core is memory mapped.
Info: Running PCIe DMA memory mapped write read test
transfer size: 1024
transfer count: 1
Info: Writing to h2c channel 0 at address offset 0.
Info: Wait for current transactions to complete.

After this the system becomes unresponsive.

@RHSResearchLLC
Copy link
Owner

This usually has 2 causes:

  1. Old XDMA driver. Don't use the code thats part of this repo. I should probably delete it. You should use the XDMA code from Xilinx GitHub repo here: https://github.com/Xilinx/dma_ip_drivers/tree/master/XDMA/linux-kernel

  2. Defective Litefury: If the DDR isn't responding for some reason, it hangs everything. There isn't an AXI timeout.

If you try the new driver and it doesn't solve the problem let me know, I'll send you another Litefury. The units are fully tested before they ship, but stuff happens.

@RHSResearchLLC
Copy link
Owner

Also, the best way to get support is via emailing me ([email protected]).

@BoneGoat
Copy link
Author

BoneGoat commented Feb 2, 2021

Thanks for your reply! I built the driver from the repo you posted and it no longer hangs when running run_tests.sh. It does however complete with errors, should I be concerned?

Info: Number of enabled h2c channels = 1
Info: Number of enabled c2h channels = 1
Info: The PCIe DMA core is memory mapped.
Info: Running PCIe DMA memory mapped write read test
transfer size: 1024
transfer count: 1
Info: Writing to h2c channel 0 at address offset 0.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 5.287372
Info: Writing to h2c channel 0 at address offset 1024.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.657091
Info: Writing to h2c channel 0 at address offset 2048.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.771014
Info: Writing to h2c channel 0 at address offset 3072.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.803708
Info: Reading from c2h channel 0 at address offset 0.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 3.376251
Info: Reading from c2h channel 0 at address offset 1024.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 2.740687
Info: Reading from c2h channel 0 at address offset 2048.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 4.069499
Info: Reading from c2h channel 0 at address offset 3072.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 2.467690
Info: Checking data integrity.
data/output_datafile0_4K.bin data/datafile0_4K.bin differ: char 53, line 1
Error: The data written did not match the data that was read.
address range: 0 - 1024
write data file: data/datafile0_4K.bin
read data file: data/output_datafile0_4K.bin
data/output_datafile1_4K.bin data/datafile1_4K.bin differ: char 53, line 2
Error: The data written did not match the data that was read.
address range: 1024 - 2048
write data file: data/datafile1_4K.bin
read data file: data/output_datafile1_4K.bin
data/output_datafile2_4K.bin data/datafile2_4K.bin differ: char 54, line 2
Error: The data written did not match the data that was read.
address range: 2048 - 3072
write data file: data/datafile2_4K.bin
read data file: data/output_datafile2_4K.bin
data/output_datafile3_4K.bin data/datafile3_4K.bin differ: char 53, line 2
Error: The data written did not match the data that was read.
address range: 3072 - 4096
write data file: data/datafile3_4K.bin
read data file: data/output_datafile3_4K.bin
Error: Test completed with Errors.
Error: Test completed with Errors.

dmesg:
[ 870.983951] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2020.1.8
[ 870.983965] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[ 870.985805] xdma:xdma_device_open: xdma device 0000:01:00.0, 0xffffffc0ed678800.
[ 870.985876] xdma 0000:01:00.0: enabling device (0000 -> 0002)
[ 870.986124] xdma:map_single_bar: BAR0 at 0xfa000000 mapped at 0xffffff800e400000, length=1048576(/1048576)
[ 870.986220] xdma:map_single_bar: BAR1 at 0xfa100000 mapped at 0xffffff800bec0000, length=65536(/65536)
[ 870.986241] xdma:map_bars: config bar 1, pos 1.
[ 870.986259] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[ 870.986728] xdma:pci_keep_intx_enabled: 0000:01:00.0: clear INTX_DISABLE, 0x406 -> 0x6.
[ 870.986873] xdma:probe_one: 0000:01:00.0 xdma0, pdev 0xffffffc0ed678800, xdev 0xffffffc0eb376000, 0xffffffc0eb374000, usr 16, ch 1,1.
[ 870.997049] xdma:cdev_xvc_init: xcdev 0xffffffc0eb377b88, bar 0, offset 0x40000.

@RHSResearchLLC
Copy link
Owner

RHSResearchLLC commented Feb 2, 2021 via email

@BoneGoat
Copy link
Author

BoneGoat commented Feb 8, 2021

Hi, I tried the dma-test-2.py with the following results:

#0
Sent in 2190.97900390625 milliseconds (490.07399070719003 MBPS)
Traceback (most recent call last):
File "dma-test-2.py", line 81, in
main()
File "dma-test-2.py", line 68, in main
mem_test_random()
File "dma-test-2.py", line 41, in mem_test_random
rx_data.append(os.pread(fd_c2h, TRANSFER_SIZE, page * TRANSFER_SIZE))
OSError: [Errno 512] Unknown error 512

Running it again locks up the host.

Error 512 sent me down a rabbit hole with the post here: https://forums.xilinx.com/t5/PCIe-and-CPM/debug-the-driver-of-IP-PCIE-with-DMA/m-p/1003427/highlight/true#M14300

I tested the suggested solution but the test failed anyway:

Info: Number of enabled h2c channels = 1
Info: Number of enabled c2h channels = 1
Info: The PCIe DMA core is memory mapped.
Info: Running PCIe DMA memory mapped write read test
transfer size: 1024
transfer count: 1
Info: Writing to h2c channel 0 at address offset 0. TransferSize: 1024 TransferCount: 1
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.382519
Info: Writing to h2c channel 0 at address offset 1024. TransferSize: 1024 TransferCount: 1
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 2.005331
Info: Writing to h2c channel 0 at address offset 2048. TransferSize: 1024 TransferCount: 1
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.795151
Info: Writing to h2c channel 0 at address offset 3072. TransferSize: 1024 TransferCount: 1
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 3.331175
Info: Reading from c2h channel 0 at address offset 0. TransferSize: 1024 TransferCount: 1
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 3.323952
Info: Reading from c2h channel 0 at address offset 1024. TransferSize: 1024 TransferCount: 1
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 3.227494
Info: Reading from c2h channel 0 at address offset 2048. TransferSize: 1024 TransferCount: 1
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 3.982716
Info: Reading from c2h channel 0 at address offset 3072. TransferSize: 1024 TransferCount: 1
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 2.669127
Info: Checking data integrity.
data/output_datafile0_4K.bin data/datafile0_4K.bin differ: char 53, line 1
Error: The data written did not match the data that was read.
address range: 0 - 1024
write data file: data/datafile0_4K.bin
read data file: data/output_datafile0_4K.bin
data/output_datafile1_4K.bin data/datafile1_4K.bin differ: char 53, line 2
Error: The data written did not match the data that was read.
address range: 1024 - 2048
write data file: data/datafile1_4K.bin
read data file: data/output_datafile1_4K.bin
data/output_datafile2_4K.bin data/datafile2_4K.bin differ: char 54, line 2
Error: The data written did not match the data that was read.
address range: 2048 - 3072
write data file: data/datafile2_4K.bin
read data file: data/output_datafile2_4K.bin
data/output_datafile3_4K.bin data/datafile3_4K.bin differ: char 53, line 2
Error: The data written did not match the data that was read.
address range: 3072 - 4096
write data file: data/datafile3_4K.bin
read data file: data/output_datafile3_4K.bin
Error: Test completed with Errors.
Error: Test completed with Errors.

I'm not sure if the board is faulty or just not compatible with the RockPi4.

@RHSResearchLLC
Copy link
Owner

RHSResearchLLC commented Feb 8, 2021 via email

@BoneGoat
Copy link
Author

BoneGoat commented Feb 9, 2021

Right, tested the following:

  1. TRANSFER_SIZE = 512 * 512 * 4
  2. TRANSFER_SIZE = 256 * 256 * 4
  3. TRANSFER_SIZE = 128 * 128 * 4

All resulted in:

#0
Segmentation fault

dmesg from driver being loaded to the seg fault:

[ 169.797974] xdma: loading out-of-tree module taints kernel.
[ 169.801380] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2020.1.8
[ 169.801395] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[ 169.802804] xdma:xdma_device_open: xdma device 0000:01:00.0, 0xffffffc0ed5e8000.
[ 169.802838] xdma 0000:01:00.0: enabling device (0000 -> 0002)
[ 169.803031] xdma:map_single_bar: BAR0 at 0xfa000000 mapped at 0xffffff800e400000, length=1048576(/1048576)
[ 169.803070] xdma:map_single_bar: BAR1 at 0xfa100000 mapped at 0xffffff800bec0000, length=65536(/65536)
[ 169.803081] xdma:map_bars: config bar 1, pos 1.
[ 169.803089] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[ 169.803394] xdma:pci_keep_intx_enabled: 0000:01:00.0: clear INTX_DISABLE, 0x406 -> 0x6.
[ 169.803474] xdma:probe_one: 0000:01:00.0 xdma0, pdev 0xffffffc0ed5e8000, xdev 0xffffffc0ed3e4000, 0xffffffc0ed3e6000, usr 16, ch 1,1.
[ 169.809281] xdma:cdev_xvc_init: xcdev 0xffffffc0ed3e5b88, bar 0, offset 0x40000.
[ 337.868732] Bad mode in Error handler detected, code 0xbf000002 -- SError
[ 337.869330] Internal error: Oops - bad mode: 0 [#1] SMP
[ 337.869795] Modules linked in: xdma(O) bcmdhd ip_tables x_tables autofs4
[ 337.870437] CPU: 4 PID: 784 Comm: python3 Tainted: G O 4.4.154-59-rockchip-g5e70f14 #4
[ 337.871231] Hardware name: ROCK PI 4B (DT)
[ 337.871592] task: ffffffc0ed236200 task.stack: ffffffc0eb854000
[ 337.872113] PC is at 0x7f80e9a07c
[ 337.872404] LR is at 0x47dc78
[ 337.872673] pc : [<0000007f80e9a07c>] lr : [<000000000047dc78>] pstate: 40000000
[ 337.873318] sp : 0000007ff1d237c0
[ 337.873609] x29: 0000007ff1d237c0 x28: 000000003707bc80
[ 337.874098] x27: 0000007f80b5d8a0 x26: 000000000047dbd0
[ 337.874586] x25: 0000000000000003 x24: 0000000001e00000
[ 337.875073] x23: 00000000007c9408 x22: 000000000085e570
[ 337.875561] x21: 0000007f80a44480 x20: 0000000036c57ba0
[ 337.876048] x19: 0000000000000003 x18: 00000000000002eb
[ 337.876535] x17: 0000007f80e9a050 x16: 00000000007c6ad0
[ 337.877022] x15: 0000007f80ddce08 x14: 0000007f80dea308
[ 337.877510] x13: 000000006022e5a9 x12: 0000000000000018
[ 337.877997] x11: 0000000000000000 x10: 0000007ff1d237c0
[ 337.878484] x9 : 0000000000000001 x8 : 0000000000000044
[ 337.878971] x7 : 0000000000000001 x6 : 0000000000000000
[ 337.879458] x5 : 000000000000017f x4 : 0000007f80f31270
[ 337.879946] x3 : 0000000001e00000 x2 : 0000000000100000
[ 337.880433] x1 : 0000000038eab130 x0 : 0000000000100000
[ 337.881062] Process python3 (pid: 784, stack limit = 0xffffffc0eb854000)
[ 337.881696] ---[ end trace 3a0bab8d539be790 ]---

@RHSResearchLLC
Copy link
Owner

RHSResearchLLC commented Feb 9, 2021

Doesn't look like enough information to really help. Probably the last thing you could try:

  • Build an XDMA sample project similar to Xilinx sample project
  • Use just the Xilinx tools to test the link

I'm not familiar with the Xilinx project, not sure what offset they map memory to, and they probably just use BRAM as the memory. If the problem persists, you'll have a nice clean project that you can use to open a support request. If the problem doesn't persist, you can use MIG and replace BRAM with DDR and see if the problem comes back.

Its not likely to be the Litefury itself, its more likely a compatibility issue with XDMA driver and RockPi; but if you really think LiteFury is defective I can send you another. My recommendation is to use BRAM instred of DDR- if BRAM works and DDR doesn't, that would indicate a defective Litefury.

@BoneGoat
Copy link
Author

OK, thanks Dave for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants