Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arista][T2] Kernel panic seen on supervisor during reboot tests #20901

Open
arista-nwolfe opened this issue Nov 22, 2024 · 1 comment
Open
Labels
Chassis 🤖 Modular chassis support Issue for 202405

Comments

@arista-nwolfe
Copy link
Contributor

As indicated in aristanetworks/sonic#109 during reboot tests (module api platform tests) a kernel panic can occur on the supervisor, this was introduced in the kernel upgrade to 6.1.94 (6.1.0-22-2)
#19885

2024 Nov 14 23:21:43.961688 str2-7804-sup-1 INFO kernel: [ 1284.481956] br1: port 13(lc7.42) entered disabled state
2024 Nov 14 23:21:44.025742 str2-7804-sup-1 INFO lc-interface-config[52854]: remove interface lc7 slot_id=
2024 Nov 14 23:21:44.078064 str2-7804-sup-1 INFO kernel: [ 1284.597508] pcieport 0000:73:0d.0: pciehp: Timeout on hotplug command 0x1038 (issued 1183788 msec ago)
2024 Nov 14 23:21:44.693686 str2-7804-sup-1 ERR kernel: [ 1285.105921] pcieport 0000:73:02.0: Unable to change power state from D3hot to D0, device inaccessible
2024 Nov 14 23:21:46.309702 str2-7804-sup-1 INFO kernel: [ 1286.721489] pcieport 0000:73:0d.0: pciehp: Timeout on hotplug command 0x0000 (issued 2124 msec ago)
2024 Nov 14 23:21:46.309722 str2-7804-sup-1 ERR kernel: [ 1286.721728] pcieport 0000:73:02.0: Unable to change power state from D3cold to D0, device inaccessible
2024 Nov 14 23:21:46.345683 str2-7804-sup-1 INFO kernel: [ 1286.834051] pci_bus 0000:74: busn_res: [bus 74] is released
2024 Nov 14 23:21:46.345705 str2-7804-sup-1 INFO kernel: [ 1286.834570] pci 0000:73:02.0: Removing from iommu group 20
2024 Nov 14 23:21:46.345707 str2-7804-sup-1 INFO kernel: [ 1286.834649] pci 0000:75:00.0: Removing from iommu group 20
2024 Nov 14 23:21:46.345708 str2-7804-sup-1 WARNING kernel: [ 1286.839869] general protection fault, probably for non-canonical address 0x32b727d667b7999a: 0000 [#1] PREEMPT SMP PTI
2024 Nov 14 23:21:51.054518 str2-7804-sup-1 WARNING kernel: [ 1286.968107] CPU: 11 PID: 151 Comm: irq/46-pciehp Tainted: G           OE      6.1.0-22-2-amd64 #1  Debian 6.1.94-1
2024 Nov 14 23:21:51.054538 str2-7804-sup-1 WARNING kernel: [ 1287.092181] Hardware name: Intel Camelback Mountain CRB/Camelback Mountain CRB, BIOS Aboot-norcal7-7.1.4-14169220 11/09/2019
2024 Nov 14 23:21:51.054540 str2-7804-sup-1 WARNING kernel: [ 1287.226668] RIP: 0010:pcie_config_aspm_link+0x48/0x330
2024 Nov 14 23:21:51.054541 str2-7804-sup-1 WARNING kernel: [ 1287.288242] Code: 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 8b 47 30 4c 8b 47 08 83 e3 7f c1 e8 0e f7 d3 89 c2 83 e0 7f 21 c3 83 e2 7f 21 f3 <41> 8b b6 a0 00 00 00 89 d8 83 e0 87 f6 c3 04 0f 44 d8 0f b7 47 30
2024 Nov 14 23:21:51.054543 str2-7804-sup-1 WARNING kernel: [ 1287.513355] RSP: 0000:ffffa81a0053bcb8 EFLAGS: 00010246
2024 Nov 14 23:21:51.054544 str2-7804-sup-1 WARNING kernel: [ 1287.575967] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
2024 Nov 14 23:21:51.054545 str2-7804-sup-1 WARNING kernel: [ 1287.661493] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a41c6c35480
2024 Nov 14 23:21:51.054546 str2-7804-sup-1 WARNING kernel: [ 1287.747022] RBP: ffff9a41c6c35480 R08: ffff9a424d08bf49 R09: ffffa81a0053bc6c
2024 Nov 14 23:21:51.054547 str2-7804-sup-1 WARNING kernel: [ 1287.832549] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9a41c1016000
2024 Nov 14 23:21:51.054548 str2-7804-sup-1 WARNING kernel: [ 1287.918078] R13: ffff9a41c5435028 R14: 32b727d667b798fa R15: ffff9a41c0ec3920
2024 Nov 14 23:21:51.054549 str2-7804-sup-1 WARNING kernel: [ 1288.003606] FS:  0000000000000000(0000) GS:ffff9a50ffcc0000(0000) knlGS:0000000000000000
2024 Nov 14 23:21:51.054550 str2-7804-sup-1 WARNING kernel: [ 1288.100593] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024 Nov 14 23:21:51.054550 str2-7804-sup-1 WARNING kernel: [ 1288.169454] CR2: 00007fb55fdf5030 CR3: 0000000101044001 CR4: 00000000003706e0
2024 Nov 14 23:21:51.054551 str2-7804-sup-1 WARNING kernel: [ 1288.254982] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024 Nov 14 23:21:51.054552 str2-7804-sup-1 WARNING kernel: [ 1288.340509] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024 Nov 14 23:21:51.054553 str2-7804-sup-1 WARNING kernel: [ 1288.426039] Call Trace:
2024 Nov 14 23:21:51.054554 str2-7804-sup-1 WARNING kernel: [ 1288.455317]  <TASK>
2024 Nov 14 23:21:51.054555 str2-7804-sup-1 WARNING kernel: [ 1288.480430]  ? __die_body.cold+0x1a/0x1f
2024 Nov 14 23:21:51.054555 str2-7804-sup-1 WARNING kernel: [ 1288.527428]  ? die_addr+0x38/0x60
2024 Nov 14 23:21:51.054556 str2-7804-sup-1 WARNING kernel: [ 1288.567128]  ? exc_general_protection+0x221/0x4a0
2024 Nov 14 23:21:51.054557 str2-7804-sup-1 WARNING kernel: [ 1288.623496]  ? asm_exc_general_protection+0x22/0x30
2024 Nov 14 23:21:51.054558 str2-7804-sup-1 WARNING kernel: [ 1288.681954]  ? pcie_config_aspm_link+0x48/0x330
2024 Nov 14 23:21:51.054559 str2-7804-sup-1 WARNING kernel: [ 1288.736243]  pcie_aspm_exit_link_state+0xb9/0x120
2024 Nov 14 23:21:51.054559 str2-7804-sup-1 WARNING kernel: [ 1288.792612]  pci_remove_bus_device+0xc8/0x110
2024 Nov 14 23:21:51.054560 str2-7804-sup-1 WARNING kernel: [ 1288.844818]  pci_remove_bus_device+0x2e/0x110
2024 Nov 14 23:21:51.054561 str2-7804-sup-1 WARNING kernel: [ 1288.897026]  pci_remove_bus_device+0x3e/0x110
2024 Nov 14 23:21:51.054562 str2-7804-sup-1 WARNING kernel: [ 1288.949234]  pciehp_unconfigure_device+0x94/0x160
2024 Nov 14 23:21:51.054563 str2-7804-sup-1 WARNING kernel: [ 1289.005609]  pciehp_disable_slot+0x69/0x100
2024 Nov 14 23:21:51.054564 str2-7804-sup-1 WARNING kernel: [ 1289.055731]  pciehp_handle_presence_or_link_change+0x241/0x350
2024 Nov 14 23:21:51.054564 str2-7804-sup-1 WARNING kernel: [ 1289.125642]  pciehp_ist+0x164/0x170
2024 Nov 14 23:21:51.054575 str2-7804-sup-1 WARNING kernel: [ 1289.167433]  ? disable_irq_nosync+0x10/0x10
2024 Nov 14 23:21:51.054577 str2-7804-sup-1 WARNING kernel: [ 1289.217548]  irq_thread_fn+0x1f/0x60
2024 Nov 14 23:21:51.054578 str2-7804-sup-1 WARNING kernel: [ 1289.260374]  irq_thread+0xfa/0x1c0
2024 Nov 14 23:21:51.054578 str2-7804-sup-1 WARNING kernel: [ 1289.301116]  ? irq_thread_fn+0x60/0x60
2024 Nov 14 23:21:51.054579 str2-7804-sup-1 WARNING kernel: [ 1289.346024]  ? irq_thread_check_affinity+0xf0/0xf0
2024 Nov 14 23:21:51.054580 str2-7804-sup-1 WARNING kernel: [ 1289.403432]  kthread+0xda/0x100
2024 Nov 14 23:21:51.054584 str2-7804-sup-1 WARNING kernel: [ 1289.441043]  ? kthread_complete_and_exit+0x20/0x20
2024 Nov 14 23:21:51.054585 str2-7804-sup-1 WARNING kernel: [ 1289.498448]  ret_from_fork+0x22/0x30
2024 Nov 14 23:21:51.054585 str2-7804-sup-1 WARNING kernel: [ 1289.541273]  </TASK>
2024 Nov 14 23:21:51.054586 str2-7804-sup-1 WARNING kernel: [ 1289.567422] Modules linked in: nft_meta_bridge(E) 8021q(E) garp(E) mrp(E) lm75(E) linux_ngbde(OE) linux_knet_cb(OE) linux_bcm_knet(OE) psample(E) linux_user_bde(OE) linux_kernel_bde(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) xt_conntrack(E) ebt_vlan(E) nft_compat(E) nf_tables(E) tmp468(OE) amax31790(OE) veth(E) pmbus(E) pmbus_core(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) i2c_mux_pca9541(E) i2c_mux(E) optoe(E) lm90(E) at24(E) regmap_i2c(E) scd_hwmon(OE) i2c_dev(E) eeprom(E) bridge(E) stp(E) llc(E) nvme_fabrics(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) bonding(E) tls(E) irqbypass(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha512_generic(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) intel_cstate(E) intel_uncore(E) iTCO_wdt(E) evdev(E)
2024 Nov 14 23:21:51.054588 str2-7804-sup-1 WARNING kernel: [ 1289.567494]  ofpart(E) intel_pmc_bxt(E) scd(OE) spi_nor(E) iTCO_vendor_support(E) pcspkr(E) mtd(E) intel_pch_thermal(E) uio(E) watchdog(E) sg(E) ioatdma(E) button(E) nfnetlink(E) fuse(E) efi_pstore(E) dm_mod(E) drm(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) zstd(E) zstd_compress(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) ahci(E) libahci(E) ixgbe(E) xhci_pci(E) crct10dif_pclmul(E) spi_intel_platform(E) xfrm_algo(E) crct10dif_common(E) spi_intel(E) gpio_ich(E) libata(E) ehci_pci(E) dca(E) crc32_pclmul(E) xhci_hcd(E) ehci_hcd(E) mdio_devres(E) of_mdio(E) crc32c_intel(E) i2c_i801(E) scsi_mod(E) lpc_ich(E) fixed_phy(E) i2c_smbus(E) scsi_common(E) usbcore(E) tg3(E) fwnode_mdio(E) usb_common(E) libphy(E) mdio(E)
2024 Nov 14 23:21:51.054592 str2-7804-sup-1 WARNING kernel: [ 1291.578230] sched: RT throttling activated
2024 Nov 14 23:21:51.103876 str2-7804-sup-1 WARNING kernel: [ 1291.578551] ---[ end trace 0000000000000000 ]---
2024 Nov 14 23:21:51.220783 str2-7804-sup-1 WARNING kernel: [ 1291.682963] RIP: 0010:pcie_config_aspm_link+0x48/0x330
2024 Nov 14 23:21:51.220806 str2-7804-sup-1 WARNING kernel: [ 1291.744550] Code: 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 8b 47 30 4c 8b 47 08 83 e3 7f c1 e8 0e f7 d3 89 c2 83 e0 7f 21 c3 83 e2 7f 21 f3 <41> 8b b6 a0 00 00 00 89 d8 83 e0 87 f6 c3 04 0f 44 d8 0f b7 47 30
2024 Nov 14 23:21:51.508531 str2-7804-sup-1 WARNING kernel: [ 1291.969674] RSP: 0000:ffffa81a0053bcb8 EFLAGS: 00010246
2024 Nov 14 23:21:51.508552 str2-7804-sup-1 WARNING kernel: [ 1292.032297] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
2024 Nov 14 23:21:51.679604 str2-7804-sup-1 WARNING kernel: [ 1292.117829] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a41c6c35480
2024 Nov 14 23:21:51.679626 str2-7804-sup-1 WARNING kernel: [ 1292.203366] RBP: ffff9a41c6c35480 R08: ffff9a424d08bf49 R09: ffffa81a0053bc6c
2024 Nov 14 23:21:51.850678 str2-7804-sup-1 WARNING kernel: [ 1292.288901] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9a41c1016000
2024 Nov 14 23:21:51.850701 str2-7804-sup-1 WARNING kernel: [ 1292.374438] R13: ffff9a41c5435028 R14: 32b727d667b798fa R15: ffff9a41c0ec3920
2024 Nov 14 23:21:52.033223 str2-7804-sup-1 WARNING kernel: [ 1292.459975] FS:  0000000000000000(0000) GS:ffff9a50ffcc0000(0000) knlGS:0000000000000000
2024 Nov 14 23:21:52.033244 str2-7804-sup-1 WARNING kernel: [ 1292.556981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024 Nov 14 23:21:52.108423 str2-7804-sup-1 WARNING pmon#chassisd: Unexpected: Module LINE-CARD4 (Slot 7) lost midplane connectivity
2024 Nov 14 23:21:52.187648 str2-7804-sup-1 WARNING kernel: [ 1292.625861] CR2: 00007fb55fdf5030 CR3: 0000000101044001 CR4: 00000000003706e0
2024 Nov 14 23:21:52.187669 str2-7804-sup-1 WARNING kernel: [ 1292.711408] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Message from syslogd@str2-7804-sup-1 at Nov 14 23:21:52 ...
 kernel:[ 1292.882490] Kernel panic - not syncing: Fatal exception
2024 Nov 14 23:21:52.358732 str2-7804-sup-1 WARNING kernel: [ 1292.796949] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024 Nov 14 23:21:52.358755 str2-7804-sup-1 EMERG kernel: [ 1292.882490] Kernel panic - not syncing: Fatal exception

Upon further investigation it was this specific change that seems to have caused this kernel panic:
torvalds/linux@456d8aa

We can see this commit is present when comparing the previous version (6.1.38)
https://elixir.free-electrons.com/linux/v6.1.38/source/drivers/pci/pcie/aspm.c#L1003
And the newer version (6.1.94)
https://elixir.free-electrons.com/linux/v6.1.94/source/drivers/pci/pcie/aspm.c#L1018

@arlakshm
Copy link
Contributor

@saiarcot895, can you please help with this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support Issue for 202405
Projects
Status: No status
Development

No branches or pull requests

2 participants