Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS on Linux null pointer dereference #11679

Closed
segdy opened this issue Mar 2, 2021 · 178 comments
Closed

ZFS on Linux null pointer dereference #11679

segdy opened this issue Mar 2, 2021 · 178 comments
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@segdy
Copy link

segdy commented Mar 2, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10 buster
Linux Kernel 4.19.0-14 (SMP Debian 4.19.171-2)
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

When I start sending raw ZFS snapshots to a different system, my Linux systen (4.19.0-14-amd64) starts to hang completely. I can ping it, I can start a very commands (such as dmesg) but most commands hang (incl zfs, zpool, htop, ps, ...). The entire systems hangs completely.

Dmesg shows the following entries at the time of the occurance:

[ 2293.134071] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 2293.149707] PGD 0 P4D 0
[ 2293.154752] Oops: 0000 [#1] SMP PTI
[ 2293.161701] CPU: 1 PID: 12576 Comm: receive_writer Tainted: P           OE     4.19.0-14-amd64 #1 Debian 4.19.171-2
[ 2293.182517] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015
[ 2293.196819] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2293.205865] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2293.243325] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2293.253741] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2293.267974] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2293.282205] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2293.296434] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2293.310664] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2293.324900] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2293.341053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2293.352510] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0
[ 2293.366743] Call Trace:
[ 2293.371704]  abd_borrow_buf+0x12/0x40 [zfs]
[ 2293.380104]  abd_borrow_buf_copy+0x28/0x70 [zfs]
[ 2293.389377]  zio_crypt_copy_dnode_bonus+0x36/0x130 [zfs]
[ 2293.400041]  arc_buf_fill+0x3ff/0xb60 [zfs]
[ 2293.408449]  ? zfs_btree_add_idx+0xd0/0x200 [zfs]
[ 2293.417889]  arc_untransform+0x1c/0x70 [zfs]
[ 2293.426461]  dbuf_read_verify_dnode_crypt+0xec/0x160 [zfs]
[ 2293.437466]  dbuf_read_impl.constprop.29+0x4ad/0x6b0 [zfs]
[ 2293.448423]  ? kmem_cache_alloc+0x167/0x1d0
[ 2293.456776]  ? __cv_init+0x3d/0x60 [spl]
[ 2293.464671]  ? dbuf_cons+0xa7/0xc0 [zfs]
[ 2293.472497]  ? spl_kmem_cache_alloc+0x108/0x7a0 [spl]
[ 2293.482583]  ? _cond_resched+0x15/0x30
[ 2293.490071]  ? _cond_resched+0x15/0x30
[ 2293.497542]  ? mutex_lock+0xe/0x30
[ 2293.504402]  ? aggsum_add+0x17a/0x190 [zfs]
[ 2293.512810]  dbuf_read+0x1b2/0x520 [zfs]
[ 2293.520672]  ? dnode_hold_impl+0x350/0xc20 [zfs]
[ 2293.529904]  dmu_bonus_hold_by_dnode+0x126/0x1a0 [zfs]
[ 2293.540186]  receive_object+0x403/0xc70 [zfs]
[ 2293.548906]  ? receive_freeobjects.isra.10+0x9d/0x120 [zfs]
[ 2293.560049]  receive_writer_thread+0x279/0xa00 [zfs]
[ 2293.569962]  ? set_curr_task_fair+0x26/0x50
[ 2293.578319]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.589793]  ? __thread_exit+0x20/0x20 [spl]
[ 2293.598317]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.608410]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.619882]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.629609]  kthread+0x112/0x130
[ 2293.636053]  ? kthread_bind+0x30/0x30
[ 2293.643351]  ret_from_fork+0x35/0x40
[ 2293.650473] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter veth pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) nf_tables nfnetlink vboxdrv(OE) bridge binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel kvm irqbypass crct10dif_pclmul ib_iser joydev crc32_pclmul rdma_cm ghash_clmulni_intel iw_cm intel_cstate ib_cm intel_uncore ib_core intel_rapl_perf configfs ipmi_si sg ipmi_devintf iTCO_wdt iTCO_vendor_support pcc_cpufreq intel_pch_thermal iscsi_tcp ipmi_msghandler libiscsi_tcp libiscsi evdev scsi_transport_iscsi pcspkr tun nfsd auth_rpcgss nfs_acl lockd grace sunrpc lm85 dme1737 hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack
[ 2293.793008]  nf_defrag_ipv6 nf_defrag_ipv4 fuse loop 8021q garp stp mrp llc ecryptfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 raid10 uas usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic raid1 usbhid hid md_mod sd_mod ast ahci ttm libahci libata drm_kms_helper drm crc32c_intel igb i2c_i801 dca i2c_algo_bit scsi_mod lpc_ich mfd_core e1000e xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common thermal fan video button
[ 2293.895677] CR2: 0000000000000000
[ 2293.902280] ---[ end trace 164c64ca87be80af ]---
[ 2294.020926] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2294.029975] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2294.067433] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2294.077850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2294.092082] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2294.106312] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2294.120542] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2294.134774] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2294.149006] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2294.165144] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2294.176600] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0

Interestingly, the transfer continues happily but just everything else in the system hangs.
The only way to recover is resetting the machine (since not even reboot works).

Describe how to reproduce the problem

It's a tough one. It seems to me that the issue might be load related in some sense since it only occurs if I have two zfs send's (via syncoid) running in parallel that have to do with encrypted datasets.

Transfer 1

The first one sends datasets from an unecrypted dataset into an encrypted one (I migrate to encryption).

I use syncoid and use the command:

syncoid -r --skip-parent --no-sync-snap zpradix1imain/sys/vz zpradix1imain/sys/vz_enc

This translates into

zfs send -I 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_hourly-2021-03-02-1917' 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_frequent-2021-03-02-1932' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 16392592 | zfs receive -s -F 'zpradix1imain/sys/vz_enc/main'

Transfer 2

I transfer data from an encrypted dataset raw to a secondary server.
The syncoid command is:

syncoid -r --skip-parent --no-sync-snap --sendoptions=w --exclude=zfs-auto-snap_hourly --exclude=zfs-auto-snap_frequent zpradix1imain/data [email protected]:zpzetta/radix/data

This translates into:

zfs send -w 'zpradix1imain/data/home'@'vicari-prev' | pv -s 179222507064 | lzop | mbuffer -q -s 128k -m 16M 2>/dev/null | ssh ...

In summary:

  • Both Transfer 1 and Transfer 2 have to be running in parallel
  • The issue only appears a few minutes/seconds after I started the transfer
@segdy segdy added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 2, 2021
@pquan
Copy link

pquan commented Mar 25, 2021

I am experiencing the same or similar issue. A hang-up during zfs send/receive of encrypted filesystem.
The two filesystems are encrypted with different keys.
I'm also using syncoid.
My stack trace is slightly different, but the null pointer dereference is in the same region of abd_borrow_buf_copy

My system is a standard proxmox install 6.3

Os: Debian Buster 10
Kernel: 5.4.78-2-pve
ZFS Module Version: v0.8.5-pve1
Pool: 4x2 two-drive mirrors of 2TB and 4TB vdev devices.
zpool scrub returns no errors.

[54105.343179] BUG: kernel NULL pointer dereference, address: 0000000000000000
[54105.343185] #PF: supervisor read access in kernel mode
[54105.343187] #PF: error_code(0x0000) - not-present page
[54105.343188] PGD 0 P4D 0
[54105.343190] Oops: 0000 [#1] SMP PTI
[54105.343193] CPU: 1 PID: 13831 Comm: receive_writer Tainted: P           O      5.4.78-2-pve #1
[54105.343195] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H81 Pro BTC R2.0, BIOS P1.40 03/09/2018
[54105.343253] RIP: 0010:abd_borrow_buf+0x13/0x70 [zfs]
[54105.343255] Code: 49 8b 44 24 18 41 5c 41 5d 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 f5 41 54 53 <44> 8b 27 48 89 fb 41 83 e4 01 75 30 44 8b 77 1c 48 8b 47 20 4d 85
[54105.343258] RSP: 0018:ffff96ae063a39d8 EFLAGS: 00010246
[54105.343260] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[54105.343261] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[54105.343263] RBP: ffff96ae063a39f8 R08: 00000000000000a5 R09: 0000000000000000
[54105.343264] R10: 0000000000000706 R11: 0000000000000000 R12: 0000000000000000
[54105.343265] R13: 0000000000004000 R14: 0000000000000000 R15: ffffffffc07ee470
[54105.343267] FS:  0000000000000000(0000) GS:ffff88c9af500000(0000) knlGS:0000000000000000
[54105.343269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54105.343271] CR2: 0000000000000000 CR3: 000000001980a004 CR4: 00000000000606e0
[54105.343272] Call Trace:
[54105.343292]  abd_borrow_buf_copy+0x2c/0x80 [zfs]
[54105.343323]  zio_crypt_copy_dnode_bonus+0x31/0x140 [zfs]
[54105.343327]  ? preempt_schedule_common+0xe/0x20
[54105.343345]  arc_buf_fill+0x402/0xb60 [zfs]
[54105.343364]  arc_untransform+0x21/0x90 [zfs]
[54105.343383]  dbuf_read_verify_dnode_crypt+0xee/0x160 [zfs]
[54105.343402]  dbuf_read+0x19d/0xb80 [zfs]
[54105.343425]  dmu_tx_check_ioerr+0x68/0xd0 [zfs]
[54105.343448]  dmu_tx_hold_free_impl+0x132/0x250 [zfs]
[54105.343470]  dmu_tx_hold_free+0x3c/0x50 [zfs]
[54105.343491]  dmu_free_long_range+0x234/0x4b0 [zfs]
[54105.343511]  dmu_free_long_object+0x26/0xc0 [zfs]
[54105.343533]  receive_freeobjects.isra.10+0x7a/0x100 [zfs]
[54105.343555]  receive_writer_thread+0x43c/0xb80 [zfs]
[54105.343558]  ? set_next_task_fair+0x73/0xb0
[54105.343565]  ? spl_kmem_free+0x33/0x40 [spl]
[54105.343586]  ? receive_read_prefetch+0x140/0x140 [zfs]
[54105.343591]  thread_generic_wrapper+0x74/0x90 [spl]
[54105.343613]  ? receive_read_prefetch+0x140/0x140 [zfs]
[54105.343617]  ? thread_generic_wrapper+0x74/0x90 [spl]
[54105.343620]  kthread+0x120/0x140
[54105.343624]  ? __thread_exit+0x20/0x20 [spl]
[54105.343626]  ? kthread_park+0x90/0x90
[54105.343629]  ret_from_fork+0x35/0x40
[54105.343631] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp sch_ingress iptable_filter bpfilter bonding 8021q garp mrp openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink intel_rapl_msr mei_hdcp ppdev intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd rapl intel_cstate snd_hda_codec_hdmi pcspkr i915 drm_kms_helper snd_hda_codec_realtek mei_me snd_hda_codec_generic ledtrig_audio mei drm snd_hda_intel snd_intel_dspcfg snd_hda_codec i2c_algo_bit snd_hda_core fb_sys_fops syscopyarea sysfillrect snd_hwdep sysimgblt snd_pcm snd_timer snd soundcore parport_pc parport mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid coretemp sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO)
[54105.343657]  zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c r8169 realtek e1000e i2c_i801 lpc_ich ahci libahci ehci_pci ehci_hcd xhci_pci xhci_hcd mpt3sas raid_class scsi_transport_sas video
[54105.343675] CR2: 0000000000000000
[54105.343677] ---[ end trace 61f4d52a93ce31c9 ]---
[54105.343695] RIP: 0010:abd_borrow_buf+0x13/0x70 [zfs]
[54105.343696] Code: 49 8b 44 24 18 41 5c 41 5d 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 f5 41 54 53 <44> 8b 27 48 89 fb 41 83 e4 01 75 30 44 8b 77 1c 48 8b 47 20 4d 85
[54105.343699] RSP: 0018:ffff96ae063a39d8 EFLAGS: 00010246
[54105.343700] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[54105.343702] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[54105.343703] RBP: ffff96ae063a39f8 R08: 00000000000000a5 R09: 0000000000000000
[54105.343705] R10: 0000000000000706 R11: 0000000000000000 R12: 0000000000000000
[54105.343706] R13: 0000000000004000 R14: 0000000000000000 R15: ffffffffc07ee470
[54105.343708] FS:  0000000000000000(0000) GS:ffff88c9af500000(0000) knlGS:0000000000000000
[54105.343709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54105.343710] CR2: 0000000000000000 CR3: 000000001980a004 CR4: 00000000000606e0
[89614.646434] hrtimer: interrupt took 14015 ns

@cserem
Copy link

cserem commented May 20, 2021

Similar hung system, on the same distribution. The problem has been around ever since reinstalling the system and enabling encryption with ZFS 0.8.5, and persists with 5.4.114-1-pve / 2.0.4-pve1

dmesg
 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
 kernel: #PF: supervisor read access in kernel mode
 kernel: #PF: error_code(0x0000) - not-present page
 kernel: PGD 0 P4D 0
 kernel: Oops: 0000 [#1] SMP PTI
 kernel: CPU: 1 PID: 23147 Comm: receive_writer Tainted: P           O      5.4.114-1-pve #1
 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q2900-ITX, BIOS P1.60 03/01/2018
 kernel: RIP: 0010:abd_verify+0x5/0x70 [zfs]
 kernel: Code: 00 00 00 66 90 0f 1f 44 00 00 55 8b 07 c1 e8 06 48 89 e5 83 e0 01 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <8b> 07 a8 01 75 4f 55 48 89 e5 41 55 41 54 49 89 fc 53 a8 40 74 40
 kernel: RSP: 0018:ffffbd418d5b7930 EFLAGS: 00010246
 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 kernel: RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
 kernel: RBP: ffffbd418d5b7948 R08: 0000000000000037 R09: 0000000000000000
 kernel: R10: 000000000000b1e5 R11: 0000000000000000 R12: 0000000000004000
 kernel: R13: 0000000000004000 R14: 0000000000000000 R15: ffffffffc06b19d0
 kernel: FS:  0000000000000000(0000) GS:ffff97e2efe80000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 0000000000000000 CR3: 00000003150e4000 CR4: 00000000001026e0
 kernel: Call Trace:
 kernel:  ? abd_borrow_buf+0x17/0x40 [zfs]
 kernel:  abd_borrow_buf_copy+0x2c/0x80 [zfs]
 kernel:  zio_crypt_copy_dnode_bonus+0x31/0x140 [zfs]
 kernel:  ? preempt_schedule_common+0xe/0x20
 kernel:  arc_buf_fill+0x402/0xb40 [zfs]
 kernel:  arc_untransform+0x21/0x80 [zfs]
 kernel:  dbuf_read_verify_dnode_crypt+0xee/0x160 [zfs]
 kernel:  ? __cv_init+0x42/0x60 [spl]
 kernel:  dbuf_read_impl.constprop.33+0x10c/0x6e0 [zfs]
 kernel:  ? arc_space_consume+0x4f/0x120 [zfs]
 kernel:  ? dbuf_create+0x3e8/0x580 [zfs]
 kernel:  ? mutex_lock+0x12/0x30
 kernel:  ? dbuf_find+0x111/0x1d0 [zfs]
 kernel:  ? _cond_resched+0x19/0x30
 kernel:  ? down_read+0x12/0xa0
 kernel:  dbuf_read+0x1b2/0x510 [zfs]
 kernel:  dmu_tx_check_ioerr+0x68/0xd0 [zfs]
 kernel:  dmu_tx_hold_free_impl+0x132/0x250 [zfs]
 kernel:  dmu_tx_hold_free+0x3c/0x50 [zfs]
 kernel:  dmu_free_long_range+0x22a/0x4d0 [zfs]
 kernel:  dmu_free_long_object+0x26/0xc0 [zfs]
 kernel:  receive_freeobjects.isra.15+0x8d/0x120 [zfs]
 kernel:  receive_writer_thread+0x46a/0xa00 [zfs]
 kernel:  ? spl_kmem_free+0x33/0x40 [spl]
 kernel:  ? receive_process_write_record+0x180/0x180 [zfs]
 kernel:  thread_generic_wrapper+0x74/0x90 [spl]
 kernel:  ? receive_process_write_record+0x180/0x180 [zfs]
 kernel:  ? thread_generic_wrapper+0x74/0x90 [spl]
 kernel:  kthread+0x120/0x140
 kernel:  ? __thread_exit+0x20/0x20 [spl]
 kernel:  ? kthread_park+0x90/0x90
 kernel:  ret_from_fork+0x35/0x40
 kernel: Modules linked in: xt_multiport binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl_msr intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp i915 kvm_intel snd_hda_intel drm_kms_helper snd_intel_dspcfg kvm snd_hda_codec hci_uart irqbypass snd_hda_core punit_atom_debug btqca btrtl btbcm drm btintel crct10dif_pclmul i2c_algo_bit bluetooth crc32_pclmul fb_sys_fops syscopyarea sysfillrect ghash_clmulni_intel mei_hdcp snd_hwdep snd_pcm sysimgblt mei_txe ppdev ecdh_generic cryptd snd_timer ecc mei snd pcspkr parport_pc soundcore intel_cstate parport rfkill_gpio mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid sunrpc coretemp ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O)
 kernel:  zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress hid_generic raid6_pq libcrc32c usbhid r8169 ahci xhci_pci libahci realtek i2c_i801 lpc_ich xhci_hcd video i2c_hid hid
 kernel: CR2: 0000000000000000
 kernel: ---[ end trace 737a4d6fa99905a7 ]---
 kernel: RIP: 0010:abd_verify+0x5/0x70 [zfs]
 kernel: Code: 00 00 00 66 90 0f 1f 44 00 00 55 8b 07 c1 e8 06 48 89 e5 83 e0 01 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <8b> 07 a8 01 75 4f 55 48 89 e5 41 55 41 54 49 89 fc 53 a8 40 74 40
:
 kernel: RSP: 0018:ffffbd418d5b7930 EFLAGS: 00010246
 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 kernel: RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
 kernel: RBP: ffffbd418d5b7948 R08: 0000000000000037 R09: 0000000000000000
 kernel: R10: 000000000000b1e5 R11: 0000000000000000 R12: 0000000000004000
 kernel: R13: 0000000000004000 R14: 0000000000000000 R15: ffffffffc06b19d0
 kernel: FS:  0000000000000000(0000) GS:ffff97e2efe80000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 0000000000000000 CR3: 00000003150e4000 CR4: 00000000001026e0
I too was suspecting that paralell execution of receives was causing the problem, but it turned out, that with serialized execution the problem still occurs (less often i think, but kind of hard to match a pattern here)

My problem is triggered by receiving snapshots into an encrypted dataset, i am not sending data anywhere from this node.

This machine is a consumer of snapshots for multiple machines (all different versions of ZFS).
The kernel NULL pointer dereference always happens when a zfs receive runs, never when idle (most of the time there is a running receive, so this might not be valuable information).
All sources are non-encrypted, the destination dataset is encrypted and opened with a key.

The commands being run are something along:
zfs send -I dataset1@from dataset1@to |compression|buffering|ssh| zfs receive -s -F 'encrypted-dataset/backups/dataset1'

I have tried checking on /proc/spl/kstat/zfs/dbgmsg (zfs_flags = 256, turning it higher, and i can't manage the amout of data it produces) but nothing interesting showed up.

Perhaps someone could tell us, how to diagnose it further?

On a side note, it seems all three of us are using relatively old arcitectures, from the same era:

  • segdy: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015 - C220 released Q2'13
  • pquan: H81 Pro BTC R2.0 - H81 released Q3'13
  • me: Pentium J2900, Bay Trail-D released Q4'13

@lheckemann
Copy link

lheckemann commented May 22, 2021

Same problem here, on NixOS, Linux 5.4.114, ZFS 2.0.4, also (probably) when receiving into an encrypted dataset.

Kernel log: https://gist.github.com/lheckemann/1c11ab6865c44a192c9d2597d17ba72b

This is the second time these symptoms have shown up, though I didn't think to get logs the first time around.

I'm on a Celeron G3900 (Skylake, Q4'15).

EDIT: I'm receiving non-raw sends.

@pquan
Copy link

pquan commented May 24, 2021

  • segdy: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015 - C220 released Q2'13
  • pquan: H81 Pro BTC R2.0 - H81 released Q3'13
  • me: Pentium J2900, Bay Trail-D released Q4'13

I'm not sure if this is related to the age of the architecture. But add another of my systems: Intel(R) Xeon(R) CPU E3-1231 running on ASUS P9D-M board. This is relatively "new" when compared to the others.

Incidentally, I also have a X8SIL running an X3440 cpu and 32GB ram. That has not experienced the issue (yet). All these systems are, as you define them, consumers for ZFS snapshots (i.e. backup servers receiving the snapshots from production).
The lockups seem related to load too. The more you receive, the more likely the issue occurs. Parallel streams seem trigger the issue more easily .

@pquan
Copy link

pquan commented May 24, 2021

Today I had a corruption on the receiving system.
The load average was over 3000 when I had to hard-reset it. On boot, the zfs import took an unusual long time.
zfs status is reporting a permanent data error

errors: Permanent errors have been detected in the following files:

        <0x1e39d>:<0x0>

But I'm not sure what that file is.
If data corruption is involved, this would be a serious issue.

@rincebrain
Copy link
Contributor

Could you share the output of these two commands?
cat /sys/module/icp/parameters/icp_aes_impl
cat /sys/module/icp/parameters/icp_gcm_impl

I have a vague suspicion that all the CPUs mentioned here don't believe in aesni, and am wondering if there's something awry in the codepath they're taking.

(Note that I am not remotely an expert on this code, just a random person musing about what could be going so awry for you and yet not causing much more screaming by encryption users.)

(If you'd like to more aggressively experiment with this theory, I suspect the speed order for this goes aesni > x86_64 > generic, so if you are indeed using x86_64, you could try using the "generic" implementation and see if the problem still reproduces.)

@cserem
Copy link

cserem commented May 24, 2021

# cat /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 
# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq

I have switched both parameters to generic now. Will report back as things advance.

@rincebrain
Copy link
Contributor

One further thought - I don't see a way to validate which implementation it was currently using on "fastest", so it's also possible you were already using generic for one of both of those. So if it still NULL derefs, you could try explicitly setting x86_64 and pclmulqdq.

(I thought of this because if #12106 is indeed related to this, it definitely can't be using x86_64, so...)

@pquan
Copy link

pquan commented May 24, 2021

# cat /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 
# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq

I have switched both parameters to generic now. Will report back as things advance.

Same output here, on all CPU I use (aes-ni aware and in use not only for zfs but for many other things).

# cat /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 
# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq

Are we suggesting that the aes-ni implementation in the kernel is "buggy"? That is most certainly not the case.

Not sure if I pointed this out, but, in my case, I'm using different encryption keys on the sending and on the receiving side. Also, some of my senders are not encrypted, but the receiver is (via a parent fs which is encrypted).

@rincebrain
Copy link
Contributor

rincebrain commented May 24, 2021

No, I was suggesting precisely the opposite - that the reason people here might be seeing an issue and others aren't is the absence of use of aesni.

edit: Just for reference, on my (Broadwell) Xeon:

$ for i in aes gcm;do cat /sys/module/icp/parameters/icp_${i}_impl;echo "";done;
cycle [fastest] generic x86_64 aesni
cycle [fastest] avx generic pclmulqdq

So I was hypothesizing the absence of use of one or both of those could be related to this, and was more suspicious of aesni than avx.

@pquan
Copy link

pquan commented May 24, 2021

Understood.
Not sure how to test if zfs is using aes-ni, but a quick check on openssl with evp and non-evp api show a triple-fold increase with using evp, suggesting that aes-ni hardware is being exercised.

I'm going to turn explicitly on aesni and avx on one of the servers now. Just to test the opposite case of generic.

cycle fastest generic x86_64 [aesni]
cycle fastest [avx] generic pclmulqdq

@rincebrain
Copy link
Contributor

AFAICS from quickly skimming the code, there's no equivalent of vdev_raidz_bench for the crypto bits. Maybe I'll see how hard implementing one would be after this.

@rincebrain
Copy link
Contributor

Just to be clear I'm trying to reproduce the right thing, could somebody share which values of "encryption/keyformat" they're using on the dataset that things are being received under? (And of course, if they turn out to all be the same/different, that's another data point...)

@cserem
Copy link

cserem commented May 24, 2021

My consumer is:

rpool/backups  encryption                  aes-256-gcm                                    -
rpool/backups  keylocation                 file:///rpool/other-encrypted/zfs.KEY              local
rpool/backups  keyformat                   raw     

(After changing icp_[aes|gcm]_impl to "generic" a couple of hours ago, the next backup iteration caused the same null pointer/stack trace.

After restarting and re-applying the "generic" settings, has yet crashed again)

@rincebrain
Copy link
Contributor

rincebrain commented May 24, 2021

Well, unless generic was mysteriously calculated to be faster before and you were always using that, there goes that theory. Rats.

edit: Well, I guess it's technically not impossible for both the non-aesni/avx implementations to cause similar misbehavior, but I'm going to assume for now that it's not specific to those...somehow.

@AttilaFueloep
Copy link
Contributor

@rincebrain The fastest implementation is statically selected according to the available hardware. See here and here. To summarize:

GCM: generic < pclmulqdq < aes
AES: generic < x86_64 < aesni

I'm still working on supporting more instruction set extensions for GCM and am thinking of adding a bench as part of that work. Currently the static selection should always select the fastest one.

Further it's quite unlikely that the implementations differ in behavior since ztest uses the cycle implementation, which cycles through all avialable implementations per encryption call/block, and should've catched any discrepancies. I think it's more likely that the load encryption puts on the system triggers some race condition. Just a guess though.

@rincebrain
Copy link
Contributor

@rincebrain The fastest implementation is statically selected according to the available hardware. See here and here. To summarize:

GCM: generic < pclmulqdq < aes
AES: generic < x86_64 < aesni

I'm still working on supporting more instruction set extensions for GCM and am thinking of adding a bench as part of that work. Currently the static selection should always select the fastest one.

Further it's quite unlikely that the implementations differ in behavior since ztest uses the cycle implementation, which cycles through all avialable implementations per encryption call/block, and should've catched any discrepancies. I think it's more likely that the load encryption puts on the system triggers some race condition. Just a guess though.

Yeah, I'd buy that, the only reason I proposed the selection as an idea before was, as I said, that I noticed all of them happened to be using CPUs that did not believe in aesni, and the assumption that most people are not running into this regularly, so something about their setups is different.

I also, annoyingly, have not been able to reproduce this in a VM no matter how I've varied the amount of computational power or cores, so far. :/

@pquan
Copy link

pquan commented May 26, 2021

I must say, this is the first time someone talks about cpu beliefs.
This is a serious issue. It happens on multiple cpu generations, mainboard etc.
All my stuff is server grade using ecc memory and enterprise sata disks.
No errors are logged on the disks, memory, cpu anywhere.
Yet, now I'm starting to develop data errors on multiple servers.
These seem related to errors in snapshots, even on "source" snapshots: i.e. those that were not transferred to a backup server, but that are native to a server.
It started when I started using encrypted storage and OpenZFS v2.0.4 (specifially the proxmox kernel pve-kernel-5.4.114-1-pve)

It might be related to proxmox. I tought I've never seen it with the 0.8.5 zfs module, but turns out it was there. Now I have the same issue with 2.0.4. The thing that changed was encryption.

For now the data corruption regards only snapshots. I hope it won't start affecting live files.

It also seems there's some silent corruption goin on. On one of the servers with data errors, I never had this null pointer dereference, yet, there are corrupted snapshots, and may of them:

errors: Permanent errors have been detected in the following files:

        <0x16c01>:<0x0>
        <0x17506>:<0x1>
        <0x16220>:<0x0>
        <0x1742a>:<0x0>
        <0x1742a>:<0x1>
        <0x1742c>:<0x1>
        <0x1612f>:<0x0>
        <0x17435>:<0x0>
        <0x17439>:<0x0>
        <0x16151>:<0x0>
        <0x17451>:<0x0>
        <0x17451>:<0x1>
        <0x16d66>:<0x0>
        rpool/vm/vm-1234-disk-0@autosnap_2021-05-22_20:00:03_hourly:<0x0>
        <0x16280>:<0x0>
        <0x1678c>:<0x0>
        rpool/vm/vm-1234-disk-0@autosnap_2021-05-22_22:00:01_hourly:<0x0>
        <0x165ad>:<0x0>
        <0x165ad>:<0x1>
        <0x171bd>:<0x0>
        <0x15fcb>:<0x0>
        <0x165d8>:<0x0>
        <0x169f9>:<0x0>

Those hex ids are deleted snapshots (most likely). It can happen that a snapshot is created/deleted during the send/receive. They are asynchronous processes.

At the moment I'm running a scrub now on all servers. Will see if those errors go away or not (deleted snapshots). But the "vm-1234" snapshots above cannot be deleted. They're not visible when issuing zfs list -t snapshot, yet, they do have a name in the error list.

# zfs destroy rpool/vm/vm-1234-disk-0@autosnap_2021-05-22_22:00:01_hourly
could not find any snapshots to destroy; check snapshot names.

There's also more:

# zfs list -t snapshot rpool/vm/vm-1234-disk-0|grep 05-22
cannot iterate filesystems: I/O error
#

@pquan
Copy link

pquan commented May 27, 2021

Ok, I was waken up this morning by a (perhaps related) issue on one of the "sending" servers. The stack trace does not show the encryption part and seems more related to the arc. I haven't seen one of those for years, so being a very rare thing, I decided to post it here in hope it would help.

I was doing some things on this server tonight, namely:

  • transferring all filesystems from an encrypted parent to an unencrypted parent (i.e. from rpool/encrypted/filesystem{0..n} to rpool/non-encrypted/filesystem{0..n}
  • making hourly snapshots, some of which must have happened during the transfer, and deleting older expired snapshots.
  • making (very likely) other send transfers from the encrypted filesystems to a remote backup server.
  • scrubbing the rpool pool to verify for errors and hopefully remove those older snapshots which are now identified via a hex-id in the error table.
  • all this while running around 10 lxc containers and a couple of kvm virtual machines.

The server was completely locked up, load average around 6800 and the following stack trace:

[1214732.919868] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[1214732.919930] PANIC at arc.c:3791:arc_buf_destroy()
[1214732.919985] Showing stack for process 1114
[1214732.920040] CPU: 2 PID: 1114 Comm: z_rd_int Tainted: P           O      5.4.114-1-pve #1
[1214732.920114] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 3.3.V1 02/04/2020
[1214732.920193] Call Trace:
[1214732.920251]  dump_stack+0x6d/0x8b
[1214732.920309]  spl_dumpstack+0x29/0x2b [spl]
[1214732.920365]  spl_panic+0xd3/0xfb [spl]
[1214732.920420]  ? __wake_up_common+0x78/0x140
[1214732.920521]  ? zei_add_range+0x140/0x140 [zfs]
[1214732.920576]  ? __wake_up+0x13/0x20
[1214732.920630]  ? __cv_broadcast+0x2a/0x40 [spl]
[1214732.920715]  ? zfs_zevent_post+0x223/0x280 [zfs]
[1214732.920792]  arc_buf_destroy+0x101/0x110 [zfs]
[1214732.920869]  arc_read_done+0x243/0x480 [zfs]
[1214732.920957]  zio_done+0x43f/0x1020 [zfs]
[1214732.921045]  zio_execute+0x99/0xf0 [zfs]
[1214732.921101]  taskq_thread+0x2f7/0x4e0 [spl]
[1214732.921157]  ? wake_up_q+0x80/0x80
[1214732.921243]  ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[1214732.921302]  kthread+0x120/0x140
[1214732.921357]  ? task_done+0xb0/0xb0 [spl]
[1214732.921410]  ? kthread_park+0x90/0x90
[1214732.921465]  ret_from_fork+0x35/0x40
[1214871.316671] INFO: task z_rd_int:1114 blocked for more than 120 seconds.
[1214871.316733]       Tainted: P           O      5.4.114-1-pve #1
[1214871.316790] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214871.316864] z_rd_int        D    0  1114      2 0x80004000
[1214871.316921] Call Trace:
[1214871.316978]  __schedule+0x2e6/0x700
[1214871.317040]  schedule+0x33/0xa0
[1214871.317096]  spl_panic+0xf9/0xfb [spl]
[1214871.317195]  ? zei_add_range+0x140/0x140 [zfs]
[1214871.317252]  ? __wake_up+0x13/0x20
[1214871.317307]  ? __cv_broadcast+0x2a/0x40 [spl]
[1214871.317389]  ? zfs_zevent_post+0x223/0x280 [zfs]
[1214871.317466]  arc_buf_destroy+0x101/0x110 [zfs]
[1214871.317542]  arc_read_done+0x243/0x480 [zfs]
[1214871.317628]  zio_done+0x43f/0x1020 [zfs]
[1214871.317727]  zio_execute+0x99/0xf0 [zfs]
[1214871.317783]  taskq_thread+0x2f7/0x4e0 [spl]
[1214871.317839]  ? wake_up_q+0x80/0x80
[1214871.317925]  ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[1214871.317984]  kthread+0x120/0x140
[1214871.318050]  ? task_done+0xb0/0xb0 [spl]
[1214871.318104]  ? kthread_park+0x90/0x90
[1214871.318157]  ret_from_fork+0x35/0x40
[1214871.318215] INFO: task dp_sync_taskq:1157 blocked for more than 120 seconds.
[1214871.318287]       Tainted: P           O      5.4.114-1-pve #1
[1214871.318344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214871.318417] dp_sync_taskq   D    0  1157      2 0x80004000
[1214871.318473] Call Trace:
[1214871.318525]  __schedule+0x2e6/0x700
[1214871.318577]  schedule+0x33/0xa0
[1214871.318631]  cv_wait_common+0x104/0x130 [spl]
[1214871.318685]  ? wait_woken+0x80/0x80
[1214871.318751]  __cv_wait+0x15/0x20 [spl]
[1214871.318834]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[1214871.318892]  taskq_thread+0x2f7/0x4e0 [spl]
[1214871.318947]  ? wake_up_q+0x80/0x80
[1214871.319001]  kthread+0x120/0x140
[1214871.319066]  ? task_done+0xb0/0xb0 [spl]
[1214871.319119]  ? kthread_park+0x90/0x90
[1214871.319173]  ret_from_fork+0x35/0x40
[1214871.319228] INFO: task txg_sync:1325 blocked for more than 120 seconds.
[1214871.319286]       Tainted: P           O      5.4.114-1-pve #1
[1214871.319342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214871.319416] txg_sync        D    0  1325      2 0x80004000
[1214871.319472] Call Trace:
[1214871.319523]  __schedule+0x2e6/0x700
[1214871.319607]  ? vdev_mirror_io_start+0xb0/0x1a0 [zfs]
[1214871.319662]  schedule+0x33/0xa0
[1214871.319715]  schedule_timeout+0x152/0x330
[1214871.319778]  ? __next_timer_interrupt+0xd0/0xd0
[1214871.319833]  io_schedule_timeout+0x1e/0x50
[1214871.319888]  __cv_timedwait_common+0x138/0x170 [spl]
[1214871.319944]  ? wait_woken+0x80/0x80
[1214871.319998]  __cv_timedwait_io+0x19/0x20 [spl]
[1214871.320086]  zio_wait+0x139/0x280 [zfs]
[1214871.320162]  arc_read+0xd68/0x1210 [zfs]
[1214871.320249]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[1214871.320332]  dsl_scan_visitbp+0x77d/0xcd0 [zfs]
[1214871.320388]  ? __cv_broadcast+0x2a/0x40 [spl]
[1214871.320470]  ? dsl_scan_prefetch+0x222/0x260 [zfs]
[1214871.320552]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[1214871.320648]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[1214871.320790]  ? dnode_rele_and_unlock+0xb6/0xe0 [zfs]
[1214871.320874]  ? dnode_rele+0x3b/0x40 [zfs]
[1214871.320951]  ? dbuf_rele_and_unlock+0x306/0x6a0 [zfs]
[1214871.321039]  ? dsl_dataset_hold_obj+0x68c/0x9e0 [zfs]
[1214871.321130]  ? dbuf_rele+0x3b/0x40 [zfs]
[1214871.321246]  dsl_scan_sync+0x90e/0x1320 [zfs]
[1214871.321337]  spa_sync+0x610/0xfe0 [zfs]
[1214871.321392]  ? mutex_lock+0x12/0x30
[1214871.321478]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[1214871.321568]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[1214871.321656]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[1214871.321714]  thread_generic_wrapper+0x74/0x90 [spl]
[1214871.321770]  kthread+0x120/0x140
[1214871.321825]  ? __thread_exit+0x20/0x20 [spl]
[1214871.321879]  ? kthread_park+0x90/0x90
[1214871.321933]  ret_from_fork+0x35/0x40
[1214992.151145] INFO: task z_rd_int:1114 blocked for more than 241 seconds.
[1214992.151211]       Tainted: P           O      5.4.114-1-pve #1
[1214992.151268] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214992.151342] z_rd_int        D    0  1114      2 0x80004000
[1214992.151399] Call Trace:
[1214992.151456]  __schedule+0x2e6/0x700
[1214992.151518]  schedule+0x33/0xa0
[1214992.151584]  spl_panic+0xf9/0xfb [spl]
[1214992.151687]  ? zei_add_range+0x140/0x140 [zfs]
[1214992.151743]  ? __wake_up+0x13/0x20
[1214992.151798]  ? __cv_broadcast+0x2a/0x40 [spl]
[1214992.151882]  ? zfs_zevent_post+0x223/0x280 [zfs]
[1214992.151966]  arc_buf_destroy+0x101/0x110 [zfs]
[1214992.152042]  arc_read_done+0x243/0x480 [zfs]
[1214992.152128]  zio_done+0x43f/0x1020 [zfs]
[1214992.152213]  zio_execute+0x99/0xf0 [zfs]
[1214992.152281]  taskq_thread+0x2f7/0x4e0 [spl]
[1214992.152337]  ? wake_up_q+0x80/0x80
[1214992.152423]  ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[1214992.152482]  kthread+0x120/0x140
[1214992.152537]  ? task_done+0xb0/0xb0 [spl]
[1214992.152591]  ? kthread_park+0x90/0x90
[1214992.152645]  ret_from_fork+0x35/0x40
[1214992.152702] INFO: task dp_sync_taskq:1157 blocked for more than 241 seconds.
[1214992.152774]       Tainted: P           O      5.4.114-1-pve #1
[1214992.152831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214992.152905] dp_sync_taskq   D    0  1157      2 0x80004000
[1214992.152968] Call Trace:
[1214992.153019]  __schedule+0x2e6/0x700
[1214992.153072]  schedule+0x33/0xa0
[1214992.153126]  cv_wait_common+0x104/0x130 [spl]
[1214992.153180]  ? wait_woken+0x80/0x80
[1214992.153235]  __cv_wait+0x15/0x20 [spl]
[1214992.153315]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[1214992.153374]  taskq_thread+0x2f7/0x4e0 [spl]
[1214992.153428]  ? wake_up_q+0x80/0x80
[1214992.153482]  kthread+0x120/0x140
[1214992.153536]  ? task_done+0xb0/0xb0 [spl]
[1214992.153589]  ? kthread_park+0x90/0x90
[1214992.153654]  ret_from_fork+0x35/0x40
[1214992.153709] INFO: task txg_sync:1325 blocked for more than 241 seconds.
[1214992.153767]       Tainted: P           O      5.4.114-1-pve #1
[1214992.153823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1214992.153897] txg_sync        D    0  1325      2 0x80004000
[1214992.153953] Call Trace:
[1214992.154016]  __schedule+0x2e6/0x700
[1214992.154099]  ? vdev_mirror_io_start+0xb0/0x1a0 [zfs]
[1214992.159404]  schedule+0x33/0xa0
[1214992.159463]  schedule_timeout+0x152/0x330
[1214992.159521]  ? __next_timer_interrupt+0xd0/0xd0
[1214992.159579]  io_schedule_timeout+0x1e/0x50
[1214992.159640]  __cv_timedwait_common+0x138/0x170 [spl]
[1214992.159699]  ? wait_woken+0x80/0x80
[1214992.159754]  __cv_timedwait_io+0x19/0x20 [spl]
[1214992.159842]  zio_wait+0x139/0x280 [zfs]
[1214992.159918]  arc_read+0xd68/0x1210 [zfs]
[1214992.159994]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[1214992.160079]  dsl_scan_visitbp+0x77d/0xcd0 [zfs]
[1214992.160136]  ? __cv_broadcast+0x2a/0x40 [spl]
[1214992.160219]  ? dsl_scan_prefetch+0x222/0x260 [zfs]
[1214992.160304]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[1214992.160388]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[1214992.160469]  ? dnode_rele_and_unlock+0xb6/0xe0 [zfs]
[1214992.160551]  ? dnode_rele+0x3b/0x40 [zfs]
[1214992.160628]  ? dbuf_rele_and_unlock+0x306/0x6a0 [zfs]
[1214992.160711]  ? dsl_dataset_hold_obj+0x68c/0x9e0 [zfs]
[1214992.160791]  ? dbuf_rele+0x3b/0x40 [zfs]
[1214992.160874]  dsl_scan_sync+0x90e/0x1320 [zfs]
[1214992.160959]  spa_sync+0x610/0xfe0 [zfs]
[1214992.161013]  ? mutex_lock+0x12/0x30
[1214992.161098]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[1214992.161186]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[1214992.161273]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[1214992.161332]  thread_generic_wrapper+0x74/0x90 [spl]
[1214992.161388]  kthread+0x120/0x140
[1214992.161442]  ? __thread_exit+0x20/0x20 [spl]
[1214992.161497]  ? kthread_park+0x90/0x90
[1214992.161550]  ret_from_fork+0x35/0x40
[1215112.985599] INFO: task z_rd_int:1114 blocked for more than 362 seconds.
[1215112.985661]       Tainted: P           O      5.4.114-1-pve #1
[1215112.985718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1215112.985792] z_rd_int        D    0  1114      2 0x80004000
[1215112.985849] Call Trace:
[1215112.985905]  __schedule+0x2e6/0x700
[1215112.985960]  schedule+0x33/0xa0
[1215112.986017]  spl_panic+0xf9/0xfb [spl]
[1215112.986119]  ? zei_add_range+0x140/0x140 [zfs]
[1215112.986175]  ? __wake_up+0x13/0x20
[1215112.986229]  ? __cv_broadcast+0x2a/0x40 [spl]
[1215112.986314]  ? zfs_zevent_post+0x223/0x280 [zfs]
[1215112.986395]  arc_buf_destroy+0x101/0x110 [zfs]
[1215112.986473]  arc_read_done+0x243/0x480 [zfs]
[1215112.986561]  zio_done+0x43f/0x1020 [zfs]
[1215112.986648]  zio_execute+0x99/0xf0 [zfs]
[1215112.986705]  taskq_thread+0x2f7/0x4e0 [spl]
[1215112.986761]  ? wake_up_q+0x80/0x80
[1215112.986847]  ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[1215112.986906]  kthread+0x120/0x140
[1215112.986961]  ? task_done+0xb0/0xb0 [spl]
[1215112.987015]  ? kthread_park+0x90/0x90
[1215112.987069]  ret_from_fork+0x35/0x40
[1215112.987126] INFO: task dp_sync_taskq:1157 blocked for more than 362 seconds.
[1215112.987199]       Tainted: P           O      5.4.114-1-pve #1
[1215112.987255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1215112.987329] dp_sync_taskq   D    0  1157      2 0x80004000
[1215112.987385] Call Trace:
[1215112.987436]  __schedule+0x2e6/0x700
[1215112.987490]  schedule+0x33/0xa0
[1215112.987543]  cv_wait_common+0x104/0x130 [spl]
[1215112.987598]  ? wait_woken+0x80/0x80
[1215112.987653]  __cv_wait+0x15/0x20 [spl]
[1215112.987735]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[1215112.987794]  taskq_thread+0x2f7/0x4e0 [spl]
[1215112.987849]  ? wake_up_q+0x80/0x80
[1215112.987902]  kthread+0x120/0x140
[1215112.987957]  ? task_done+0xb0/0xb0 [spl]
[1215112.988010]  ? kthread_park+0x90/0x90
[1215112.988064]  ret_from_fork+0x35/0x40
[1215233.820082] INFO: task z_rd_int:1114 blocked for more than 483 seconds.
[1215233.820144]       Tainted: P           O      5.4.114-1-pve #1
[1215233.820201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1215233.820275] z_rd_int        D    0  1114      2 0x80004000
[1215233.820332] Call Trace:
[1215233.820389]  __schedule+0x2e6/0x700
[1215233.820443]  schedule+0x33/0xa0
[1215233.820501]  spl_panic+0xf9/0xfb [spl]
[1215233.820603]  ? zei_add_range+0x140/0x140 [zfs]
[1215233.820660]  ? __wake_up+0x13/0x20
[1215233.820714]  ? __cv_broadcast+0x2a/0x40 [spl]
[1215233.820799]  ? zfs_zevent_post+0x223/0x280 [zfs]
[1215233.820876]  arc_buf_destroy+0x101/0x110 [zfs]
[1215233.820953]  arc_read_done+0x243/0x480 [zfs]
[1215233.821041]  zio_done+0x43f/0x1020 [zfs]
[1215233.821129]  zio_execute+0x99/0xf0 [zfs]
[1215233.821185]  taskq_thread+0x2f7/0x4e0 [spl]
[1215233.821241]  ? wake_up_q+0x80/0x80
[1215233.821327]  ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[1215233.821386]  kthread+0x120/0x140
[1215233.821441]  ? task_done+0xb0/0xb0 [spl]
[1215233.821494]  ? kthread_park+0x90/0x90
[1215233.821549]  ret_from_fork+0x35/0x40
[1215233.821606] INFO: task dp_sync_taskq:1157 blocked for more than 483 seconds.
[1215233.821679]       Tainted: P           O      5.4.114-1-pve #1
[1215233.821735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1215233.821809] dp_sync_taskq   D    0  1157      2 0x80004000
[1215233.821865] Call Trace:
[1215233.821917]  __schedule+0x2e6/0x700
[1215233.821970]  schedule+0x33/0xa0
[1215233.822024]  cv_wait_common+0x104/0x130 [spl]
[1215233.822078]  ? wait_woken+0x80/0x80
[1215233.822133]  __cv_wait+0x15/0x20 [spl]
[1215233.822216]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[1215233.822274]  taskq_thread+0x2f7/0x4e0 [spl]
[1215233.822329]  ? wake_up_q+0x80/0x80
[1215233.822383]  kthread+0x120/0x140
[1215233.822437]  ? task_done+0xb0/0xb0 [spl]
[1215233.822491]  ? kthread_park+0x90/0x90
[1215233.822545]  ret_from_fork+0x35/0x40

This is an Intel(R) Xeon(R) CPU E5-1620 v2 (4 core, 8 threads) cpu, 32GB of ECC ram running on a Supermicro X9SRE-3F motherboard.

No other errors logged. The system was responsive due to a separated ext4 root unrelated to zfs. I was unable to unmount, mount, touch anything zfs-related. I had to issue a unmount/sync/reboot magic sysreq to reboot it.

Maybe the reason we're able to trigger this issue is because these are server systems doing heavy lifting on real parallel threads as opposed to "simulated" under a VM?

I hope this helps. Let me know if I can do anything to help diagnose further.

Edit: I'm reading he metrics on the server and can give some more details. The issue started around 2 o'clock in the morning, The total lockup of the server arrived around 4 o'clock, 2 hours later. I had the following symptoms:

  • iowait climbed from 10% to 100% steady, then at 4 o'clock iowait went to 50%, system time went from 0-5% to 50%
  • Disk io request went from 200/sec to 0 at 2 o'clock and staid like this until the reset.
  • swap usage was around 4GB the whole time (100%), with used swap climbing from 0 to ~512Mb starting at 2 o'clock
  • Swap io bytes were high. There was active swap in/out going on. This is not normal. Usually this server has 0% swapping going around.

All other parameters are ok. This last lock-up seems unrelated to the crypt path above. It looks more like the issues of ram starvation like we used to have back a few years ago.

@rincebrain
Copy link
Contributor

A data point:
I left my Xeon D-1540 (8c16t, Broadwell) (Debian buster, 4.19.0-16-amd64, 2.0.3-1~bpo10+1) running a loop for 24 hours of running 16 zfs receives into an encrypted parent of an unencrypted source send (no flags) of a ~39GiB dataset I have, with no VM, just in case it was indeed something that couldn't be triggered for that reason, then destroying them, and so on. No excitement ensued.

Can you share the output of zfs get all [one of these source datasets] and zpool get all [source pool], and the same for the destination pool and dataset? I'd like to eliminate as many possible differences as I can.

@cserem
Copy link

cserem commented May 29, 2021

I have randomly picked two sources with two datasets, and the destination: zfs get all output
It might be worth noting:

  • I have yet to encounter data corruption (scrubs are monthly, and the issue has been persistent around for about a year now (see target pool creation date)

  • The sends are incremental

  • Example command:
    /opt/sanoid-2.0.3/syncoid root@source-machine:rpool-ssd/data rpool/backups/source-machine/rpool-ssd/data --skip-parent -r --compress pigz-fast -o ServerAliveCountMax=3 -o ServerAliveInterval=30 --no-sync-snap --debug
    (there are machines where i make the sync-snap so the --no-sync-snap is omitted)

  • I have summarized the available crash data:

single thread, icp_aes_impl icp_gcm_impl on fastest
Apr 02 08:16:26 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Apr 14 09:05:41 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Apr 17 04:33:22 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
May 01 16:02:09 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
May 06 11:46:07 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
May 12 05:49:43 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
May 12 21:07:13 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000

(Since the last two crashes where so close to eachother, I've implemented a watchdog and automatic unlocking, and went full berserk, re-enabling concurrency)
4 thread, icp_aes_impl icp_gcm_impl on fastest:
May 21 23:25:21 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Uptime: 39:46:02
May 24 03:31:06 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Uptime: 52:05:45

4 threads icp_aes_impl icp_gcm_impl on generic:
May 24 14:00:40 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Uptime: 10:29:34
May 26 03:29:24 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Uptime: 37:28:44

4 threads icp_aes_impl icp_gcm_impl on x86_64 / pclmulqdq
No crash yet, uptime is 3 days, 7:47. (Not even close to the ~2 week best uptime with single thread, but with 4 threads it is the best so far)

@rincebrain
Copy link
Contributor

rincebrain commented May 30, 2021

This time, I left a loop of 4 sends and rollbacks running, after creating an incremental by (on the same source I used for my prior send loop:

  • truncating half the files to 0b and deleting the other half
  • setting src and dst to compression=gzip-9 since the examples posted by cserem all had that
  • rsync a --inplace all the data back over
  • take a new snapshot

Then zfs send -I snap1 snap2 > incsend
Then a loop of while true; for i in $(seq 1 4); do dd if=incsend bs=1M | zfs recv encroot/$i & done; wait; for i in $(seq 1 4);do zfs rollback encroot/$i@snap1;done; done; (boy, my server loved 4 gzip-9 streams at once)

Unfortunately, as you might guess from my not leading with a success statement, it did not reproduce after 24 hours of looping.

I'll probably explore including a zvol next, unless someone else comes up with a brighter idea.

@pquan
Copy link

pquan commented May 31, 2021

I think I stated before, my setup, but I'll try to explain my setup, in a deeper fashion:

  • I'm running encrypted to encrypted, recursive, incremental transfer. The keys on the two servers differ.
  • I'm using resumable transfers because of the size of each of them
  • The size of the dataset is easily 200-1TB each. Some of them are mail pools with millions of small files (in case this matters).
  • While doing the transfers, I'm also snapshotting these same datasets every hour, possibly having several snapshots occur during the transfer.
  • And, I'm deleting expired snapshots on both the source and destination server. Those expired snapshots are not interested in the transfer.
  • There are around 60 to 100 different snapshots in each dataset: 48 hourly, 30 daily, 12 monthly and an yearly snapshot, so some of the data is quite old.
  • Both servers use a 2x2 mirrors (2 stripes of two mirrors), spinning disks (2TB each, , and thus awfully slow, 512bytes physical but ashift is 12).

I do not lock easily. Once in a week/10 days.

Now, some meat. In the past days, I detected some corruption on one of the source servers. ZFS could mount the dataset but the data in it was inconsistent (missing most directories under the root). Also I could not unmount it once mounted. I had this issue on two related datasets: Same LXC root and data (1TB mail server data).

Weirdly enough, sending this corrupted dataset to another dataset on the same machine seemed to fix the issue on one of them. The other (the mail spool), I had to roll back to a known good state, then I had to resync all missing mails form an off-site server before putting it back in production (this is a multisite, geo-replicated mail spool setup, so I had some live data to sync in addition).

I was able to delete the corrupted datasets. A full scrub fixed the data errors in the already deleted snapshots.

I know all the above does not give any hints, other than anecdotal evidence of a possible problem. I'm only sure there was corruption to start with and it was silent. It probably got to break some algo somewhere. I'd start with a full scrub on all affected pools.

Here's some get all statistics. Please note, despite the name of the rpool (proxmox root pool) this is not an actual root pool containing the boot files. I just use the same name for historic reasons, but the pool is mounted inside an "ext4" root.

First: sending server pool:

# zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           3.38T                          -
rpool  capacity                       49%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           10861474846250979260           -
rpool  version                        -                              default
rpool  bootfs                         -                              default
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           1.71T                          -
rpool  allocated                      1.67T                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  35%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      8733562770359337761            -
rpool  autotrim                       off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            active                         local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             active                         local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    disabled                       local
rpool  feature@redacted_datasets      disabled                       local
rpool  feature@bookmark_written       disabled                       local
rpool  feature@log_spacemap           disabled                       local
rpool  feature@livelist               disabled                       local
rpool  feature@device_rebuild         disabled                       local
rpool  feature@zstd_compress          disabled                       local

Sending server dataset. This is an example. All my datasets are configured the same way (lz4, xattr=sa etc).

# zfs get all rpool/vm/subvol-4242-disk-0
NAME                         PROPERTY              VALUE                         SOURCE
rpool/vm/subvol-4242-disk-0  type                  filesystem                    -
rpool/vm/subvol-4242-disk-0  creation              Wed Mar 10  8:45 2021         -
rpool/vm/subvol-4242-disk-0  used                  11.7G                         -
rpool/vm/subvol-4242-disk-0  available             1.60T                         -
rpool/vm/subvol-4242-disk-0  referenced            5.54G                         -
rpool/vm/subvol-4242-disk-0  compressratio         2.30x                         -
rpool/vm/subvol-4242-disk-0  mounted               yes                           -
rpool/vm/subvol-4242-disk-0  quota                 none                          default
rpool/vm/subvol-4242-disk-0  reservation           none                          default
rpool/vm/subvol-4242-disk-0  recordsize            128K                          default
rpool/vm/subvol-4242-disk-0  mountpoint            /rpool/vm/subvol-4242-disk-0  default
rpool/vm/subvol-4242-disk-0  sharenfs              off                           default
rpool/vm/subvol-4242-disk-0  checksum              on                            default
rpool/vm/subvol-4242-disk-0  compression           lz4                           inherited from rpool
rpool/vm/subvol-4242-disk-0  atime                 off                           inherited from rpool
rpool/vm/subvol-4242-disk-0  devices               on                            default
rpool/vm/subvol-4242-disk-0  exec                  on                            default
rpool/vm/subvol-4242-disk-0  setuid                on                            default
rpool/vm/subvol-4242-disk-0  readonly              off                           default
rpool/vm/subvol-4242-disk-0  zoned                 off                           default
rpool/vm/subvol-4242-disk-0  snapdir               hidden                        default
rpool/vm/subvol-4242-disk-0  aclmode               discard                       default
rpool/vm/subvol-4242-disk-0  aclinherit            restricted                    default
rpool/vm/subvol-4242-disk-0  createtxg             3675833                       -
rpool/vm/subvol-4242-disk-0  canmount              on                            default
rpool/vm/subvol-4242-disk-0  xattr                 sa                            inherited from rpool
rpool/vm/subvol-4242-disk-0  copies                1                             default
rpool/vm/subvol-4242-disk-0  version               5                             -
rpool/vm/subvol-4242-disk-0  utf8only              off                           -
rpool/vm/subvol-4242-disk-0  normalization         none                          -
rpool/vm/subvol-4242-disk-0  casesensitivity       sensitive                     -
rpool/vm/subvol-4242-disk-0  vscan                 off                           default
rpool/vm/subvol-4242-disk-0  nbmand                off                           default
rpool/vm/subvol-4242-disk-0  sharesmb              off                           default
rpool/vm/subvol-4242-disk-0  refquota              none                          default
rpool/vm/subvol-4242-disk-0  refreservation        none                          default
rpool/vm/subvol-4242-disk-0  guid                  13076753325802082524          -
rpool/vm/subvol-4242-disk-0  primarycache          all                           default
rpool/vm/subvol-4242-disk-0  secondarycache        all                           default
rpool/vm/subvol-4242-disk-0  usedbysnapshots       6.13G                         -
rpool/vm/subvol-4242-disk-0  usedbydataset         5.54G                         -
rpool/vm/subvol-4242-disk-0  usedbychildren        0B                            -
rpool/vm/subvol-4242-disk-0  usedbyrefreservation  0B                            -
rpool/vm/subvol-4242-disk-0  logbias               latency                       default
rpool/vm/subvol-4242-disk-0  objsetid              53904                         -
rpool/vm/subvol-4242-disk-0  dedup                 off                           default
rpool/vm/subvol-4242-disk-0  mlslabel              none                          default
rpool/vm/subvol-4242-disk-0  sync                  standard                      inherited from rpool/vm
rpool/vm/subvol-4242-disk-0  dnodesize             auto                          inherited from rpool
rpool/vm/subvol-4242-disk-0  refcompressratio      2.06x                         -
rpool/vm/subvol-4242-disk-0  written               8K                            -
rpool/vm/subvol-4242-disk-0  logicalused           24.6G                         -
rpool/vm/subvol-4242-disk-0  logicalreferenced     10.8G                         -
rpool/vm/subvol-4242-disk-0  volmode               default                       default
rpool/vm/subvol-4242-disk-0  filesystem_limit      none                          default
rpool/vm/subvol-4242-disk-0  snapshot_limit        none                          default
rpool/vm/subvol-4242-disk-0  filesystem_count      none                          default
rpool/vm/subvol-4242-disk-0  snapshot_count        none                          default
rpool/vm/subvol-4242-disk-0  snapdev               hidden                        default
rpool/vm/subvol-4242-disk-0  acltype               posix                         inherited from rpool
rpool/vm/subvol-4242-disk-0  context               none                          default
rpool/vm/subvol-4242-disk-0  fscontext             none                          default
rpool/vm/subvol-4242-disk-0  defcontext            none                          default
rpool/vm/subvol-4242-disk-0  rootcontext           none                          default
rpool/vm/subvol-4242-disk-0  relatime              off                           default
rpool/vm/subvol-4242-disk-0  redundant_metadata    all                           default
rpool/vm/subvol-4242-disk-0  overlay               on                            default
rpool/vm/subvol-4242-disk-0  encryption            aes-256-gcm                   -
rpool/vm/subvol-4242-disk-0  keylocation           none                          default
rpool/vm/subvol-4242-disk-0  keyformat             passphrase                    -
rpool/vm/subvol-4242-disk-0  pbkdf2iters           350000                        -
rpool/vm/subvol-4242-disk-0  encryptionroot        rpool                         -
rpool/vm/subvol-4242-disk-0  keystatus             available                     -
rpool/vm/subvol-4242-disk-0  special_small_blocks  0                             default

Here's a receiver get all :

# zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           3.62T                          -
rpool  capacity                       70%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           16014196203982227614           -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           1.07T                          -
rpool  allocated                      2.56T                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  29%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      9824040886009751730            -
rpool  autotrim                       off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            active                         local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             active                         local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    disabled                       local
rpool  feature@redacted_datasets      disabled                       local
rpool  feature@bookmark_written       disabled                       local
rpool  feature@log_spacemap           disabled                       local
rpool  feature@livelist               disabled                       local
rpool  feature@device_rebuild         disabled                       local
rpool  feature@zstd_compress          disabled                       local

And a dataset:

NAME                                       PROPERTY              VALUE                                       SOURCE
rpool/backup/vm/subvol-3140-disk-1  type                  filesystem                                  -
rpool/backup/vm/subvol-3140-disk-1  creation              Thu Apr  8 10:28 2021                       -
rpool/backup/vm/subvol-3140-disk-1  used                  20.6G                                       -
rpool/backup/vm/subvol-3140-disk-1  available             970G                                        -
rpool/backup/vm/subvol-3140-disk-1  referenced            7.66G                                       -
rpool/backup/vm/subvol-3140-disk-1  compressratio         2.13x                                       -
rpool/backup/vm/subvol-3140-disk-1  mounted               no                                          -
rpool/backup/vm/subvol-3140-disk-1  quota                 none                                        default
rpool/backup/vm/subvol-3140-disk-1  reservation           none                                        default
rpool/backup/vm/subvol-3140-disk-1  recordsize            128K                                        default
rpool/backup/vm/subvol-3140-disk-1  mountpoint            /rpool/backup/vm/subvol-3140-disk-1  default
rpool/backup/vm/subvol-3140-disk-1  sharenfs              off                                         default
rpool/backup/vm/subvol-3140-disk-1  checksum              on                                          default
rpool/backup/vm/subvol-3140-disk-1  compression           lz4                                         inherited from rpool
rpool/backup/vm/subvol-3140-disk-1  atime                 off                                         inherited from rpool
rpool/backup/vm/subvol-3140-disk-1  devices               on                                          default
rpool/backup/vm/subvol-3140-disk-1  exec                  on                                          default
rpool/backup/vm/subvol-3140-disk-1  setuid                on                                          default
rpool/backup/vm/subvol-3140-disk-1  readonly              off                                         default
rpool/backup/vm/subvol-3140-disk-1  zoned                 off                                         default
rpool/backup/vm/subvol-3140-disk-1  snapdir               hidden                                      default
rpool/backup/vm/subvol-3140-disk-1  aclmode               discard                                     default
rpool/backup/vm/subvol-3140-disk-1  aclinherit            restricted                                  default
rpool/backup/vm/subvol-3140-disk-1  createtxg             10724                                       -
rpool/backup/vm/subvol-3140-disk-1  canmount              on                                          default
rpool/backup/vm/subvol-3140-disk-1  xattr                 on                                          default
rpool/backup/vm/subvol-3140-disk-1  copies                1                                           default
rpool/backup/vm/subvol-3140-disk-1  version               5                                           -
rpool/backup/vm/subvol-3140-disk-1  utf8only              off                                         -
rpool/backup/vm/subvol-3140-disk-1  normalization         none                                        -
rpool/backup/vm/subvol-3140-disk-1  casesensitivity       sensitive                                   -
rpool/backup/vm/subvol-3140-disk-1  vscan                 off                                         default
rpool/backup/vm/subvol-3140-disk-1  nbmand                off                                         default
rpool/backup/vm/subvol-3140-disk-1  sharesmb              off                                         default
rpool/backup/vm/subvol-3140-disk-1  refquota              none                                        default
rpool/backup/vm/subvol-3140-disk-1  refreservation        none                                        default
rpool/backup/vm/subvol-3140-disk-1  guid                  5702550096622166170                         -
rpool/backup/vm/subvol-3140-disk-1  primarycache          all                                         default
rpool/backup/vm/subvol-3140-disk-1  secondarycache        all                                         default
rpool/backup/vm/subvol-3140-disk-1  usedbysnapshots       12.9G                                       -
rpool/backup/vm/subvol-3140-disk-1  usedbydataset         7.66G                                       -
rpool/backup/vm/subvol-3140-disk-1  usedbychildren        0B                                          -
rpool/backup/vm/subvol-3140-disk-1  usedbyrefreservation  0B                                          -
rpool/backup/vm/subvol-3140-disk-1  logbias               latency                                     default
rpool/backup/vm/subvol-3140-disk-1  objsetid              185                                         -
rpool/backup/vm/subvol-3140-disk-1  dedup                 off                                         default
rpool/backup/vm/subvol-3140-disk-1  mlslabel              none                                        default
rpool/backup/vm/subvol-3140-disk-1  sync                  standard                                    inherited from rpool
rpool/backup/vm/subvol-3140-disk-1  dnodesize             legacy                                      default
rpool/backup/vm/subvol-3140-disk-1  refcompressratio      1.19x                                       -
rpool/backup/vm/subvol-3140-disk-1  written               0                                           -
rpool/backup/vm/subvol-3140-disk-1  logicalused           43.2G                                       -
rpool/backup/vm/subvol-3140-disk-1  logicalreferenced     8.96G                                       -
rpool/backup/vm/subvol-3140-disk-1  volmode               default                                     default
rpool/backup/vm/subvol-3140-disk-1  filesystem_limit      none                                        default
rpool/backup/vm/subvol-3140-disk-1  snapshot_limit        none                                        default
rpool/backup/vm/subvol-3140-disk-1  filesystem_count      none                                        default
rpool/backup/vm/subvol-3140-disk-1  snapshot_count        none                                        default
rpool/backup/vm/subvol-3140-disk-1  snapdev               hidden                                      default
rpool/backup/vm/subvol-3140-disk-1  acltype               off                                         default
rpool/backup/vm/subvol-3140-disk-1  context               none                                        default
rpool/backup/vm/subvol-3140-disk-1  fscontext             none                                        default
rpool/backup/vm/subvol-3140-disk-1  defcontext            none                                        default
rpool/backup/vm/subvol-3140-disk-1  rootcontext           none                                        default
rpool/backup/vm/subvol-3140-disk-1  relatime              off                                         default
rpool/backup/vm/subvol-3140-disk-1  redundant_metadata    all                                         default
rpool/backup/vm/subvol-3140-disk-1  overlay               on                                          default
rpool/backup/vm/subvol-3140-disk-1  encryption            aes-256-gcm                                 -
rpool/backup/vm/subvol-3140-disk-1  keylocation           none                                        default
rpool/backup/vm/subvol-3140-disk-1  keyformat             passphrase                                  -
rpool/backup/vm/subvol-3140-disk-1  pbkdf2iters           350000                                      -
rpool/backup/vm/subvol-3140-disk-1  encryptionroot        rpool/backup                          -
rpool/backup/vm/subvol-3140-disk-1  keystatus             available                                   -
rpool/backup/vm/subvol-3140-disk-1  special_small_blocks  0                                           default

And a last thing I almost forgot:
On some of the systems I'm having a misaligned write error. In the past this seemed related to hardware problems with bad sata cables and/or controller issues.
I'm getting this issue on servers with onboard sata ports with very short cables. Again all this started after the upgrade to 0.8.x and most prominently I'm finding it on v2.0.4.
This happens on both SSD and SATA disks.
I'll reduce the speed on the sata disks to 1.5GBit as a check but doing so on the SSDs would not be feasible.

Sorry to drop all this meat on the table. I hope I'm not causing confusion. I've not been able to isolate the issue in a better way.

@cserem
Copy link

cserem commented May 31, 2021

Update:
4 threads icp_aes_impl icp_gcm_impl on x86_64 / pclmulqdq
crashed after: 5 days, 14:47
Trying again with the above parameters.

I was able to get the zfs related processes:

root     20454  0.0  0.0   2384   700 ?        S    18:25   0:00 sh -c ssh  -o ServerAliveCountMax=3 -o ServerAliveInterval=30   -S /tmp/syncoid-root@source1-1622478259 root@source1 ' zfs send  -I '"'"'rpool/rpool-ssd-backup/subvol-120-disk-2'"'"'@'"'"'2021-05-31-180000'"'"' '"'"'rpool/rpool-ssd-backup/subvol-120-disk-2'"'"'@'"'"'2021-05-31-181500'"'"' | pigz -3 | mbuffer  -q -s 128k -m 16M 2>/dev/null' | mbuffer  -q -s 128k -m 16M 2>/dev/null | pigz -dc | pv -p -t -e -r -b -s 52098832 |  zfs receive  -s -F 'rpool/backups/SOURCE1/rpool/rpool-ssd-backup/subvol-120-disk-2' 2>&1
root     20459  0.0  0.0  15708  4172 ?        S    18:25   0:00 zfs receive -s -F rpool/backups/SOURCE1/rpool/rpool-ssd-backup/subvol-120-disk-2
root     28671  0.0  0.0   2384   764 ?        S    18:25   0:00 sh -c ssh  -o ServerAliveCountMax=3 -o ServerAliveInterval=30   -S /tmp/syncoid-root@SOURCE2-1622478239 root@SOURCE2 ' zfs send  -I '"'"'rpool/data/subvol-117-disk-0'"'"'@'"'"'syncoid_pve_2021-05-31:18:12:10-GMT02:00'"'"' '"'"'rpool/data/subvol-117-disk-0'"'"'@'"'"'syncoid_pve_2021-05-31:18:25:51-GMT02:00'"'"' | pigz -3 | mbuffer  -q -s 128k -m 16M 2>/dev/null' | mbuffer  -q -s 128k -m 16M 2>/dev/null | pigz -dc | pv -p -t -e -r -b -s 495575408 |  zfs receive  -s -F 'rpool/backups/SOURCE2/rpool/data/subvol-117-disk-0' 2>&1
root     28680  0.3  0.0  15708  4424 ?        D    18:25   0:00 zfs receive -s -F rpool/backups/SOURCE2/rpool/data/subvol-117-disk-0
root     31831  0.0  0.0   2384   760 ?        S    18:26   0:00 sh -c zfs rollback -R 'rpool/backups/SOURCE3/rpool-ssd/data/subvol-104-disk-1'@'2021-05-31-180000'
root     31832  0.0  0.0  15704  4144 ?        D    18:26   0:00 zfs rollback -R rpool/backups/SOURCE3/rpool-ssd/data/subvol-104-disk-1@2021-05-31-180000
root     31863  0.0  0.0   2384   764 ?        S    18:26   0:00 sh -c zfs destroy 'rpool/backups/SOURCE4/rpool/data/subvol-106-disk-2'@syncoid_pve_2021-05-31:18:12:03-GMT02:00
root     31865  0.0  0.0  15708  4012 ?        S    18:26   0:00 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-106-disk-2@syncoid_pve_2021-05-31:18:12:03-GMT02:00

Since the destroys (thoese are some of the receives where I deliberately make a snapshot for syncing, and they are being cleaned up after the transfer) have the highest pid, perhaps the crash is related to that?
Then again the rollback is 'D'... I will hopefully have more ps outputs in the future.

@cserem
Copy link

cserem commented Jun 2, 2021

4 threads icp_aes_impl icp_gcm_impl on x86_64 / pclmulqdq
crashed after: 1 day, 6:44
I am switching back to default (fastest)

Unfortunately i couldn't capture a process list this time.

Here are some events from previous crashes up to the point the system is rebooted and the pool is imported:

zpool history
2021-05-24.03:30:46 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-112-disk-1@syncoid_pve_2021-05-24:03:15:33-GMT02:00
2021-05-24.03:30:51 zfs rollback -R rpool/backups/SOURCE8/rpool/data/subvol-106-disk-2@syncoid_pve_2021-05-24:03:15:50-GMT02:00
2021-05-24.03:30:52 zfs rollback -R rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1@syncoid_pve_2021-05-24:03:15:46-GMT02:00
2021-05-24.03:30:55 zfs receive -s -F rpool/backups/SOURCE8/rpool/data/subvol-106-disk-2
2021-05-24.03:30:57 zfs destroy rpool/backups/SOURCE8/rpool/data/subvol-106-disk-2@syncoid_pve_2021-05-24:03:15:50-GMT02:00
2021-05-24.03:31:00 zfs receive -s -F rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1
2021-05-24.03:31:02 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1@syncoid_pve_2021-05-24:03:15:46-GMT02:00
2021-05-24.03:31:03 zfs receive -s -F rpool/backups/SOURCE3/rpool/rpool-ssd-backup/subvol-120-disk-2
2021-05-24.03:31:04 zfs receive -s -F rpool/backups/SOURCE7/rpool-hdd/data/vm-102-disk-5
2021-05-24.03:31:04 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/ROOT/pve-1@2021-05-24-030000
>>>>May 24 03:31:06 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
2021-05-24.07:11:22 zpool import -N rpool
--


2021-05-24.14:00:33 zfs receive -s -F rpool/backups/SOURCE7/rpool-hdd/data/vm-101-disk-0
2021-05-24.14:00:34 zfs receive -F rpool/backups/SOURCE5/rpool/data/vm-100-disk-2
2021-05-24.14:00:34 zfs receive -s -F rpool/backups/SOURCE4/rpool/ROOT/pve-1
2021-05-24.14:00:35 zfs destroy rpool/backups/SOURCE5/rpool/data/vm-100-disk-2@syncoid_pve_2021-05-24:13:51:04-GMT02:00
2021-05-24.14:00:35 zfs rollback -R rpool/backups/SOURCE7/rpool-hdd/data/vm-102-disk-0@2021-05-24-134500
2021-05-24.14:00:36 zfs receive -s -F rpool/backups/SOURCE6/rpool/data/vm-101-disk-0
2021-05-24.14:00:37 zfs destroy rpool/backups/SOURCE6/rpool/data/vm-101-disk-0@syncoid_pve_2021-05-24:13:51:14-GMT02:00
2021-05-24.14:00:37 zfs destroy rpool/backups/SOURCE4/rpool/ROOT/pve-1@syncoid_pve_2021-05-24:13:50:54-GMT02:00
2021-05-24.14:00:38 zfs rollback -R rpool/backups/SOURCE6/rpool/data/vm-102-disk-0@syncoid_pve_2021-05-24:13:51:24-GMT02:00
2021-05-24.14:00:39 zfs rollback -R rpool/backups/SOURCE3/rpool/ROOT/pve-1@2021-05-24-134500
>>>>May 24 14:00:40 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
2021-05-24.16:03:19 zpool import -N rpool
--


2021-05-26.03:29:07 zfs receive -s -F rpool/backups/SOURCE4/rpool/data/subvol-101-disk-2
2021-05-26.03:29:10 zfs rollback -R rpool/backups/SOURCE3/rpool/rpool-ssd-backup/subvol-120-disk-1@2021-05-26-030000
2021-05-26.03:29:10 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-101-disk-2@syncoid_pve_2021-05-26:03:14:09-GMT02:00
2021-05-26.03:29:11 zfs receive -s -F rpool/backups/SOURCE3/rpool/rpool-ssd-backup/subvol-120-disk-1
2021-05-26.03:29:13 zfs rollback -R rpool/backups/SOURCE4/rpool/data/subvol-102-disk-1@syncoid_pve_2021-05-26:03:14:20-GMT02:00
2021-05-26.03:29:14 zfs rollback -R rpool/backups/SOURCE8/rpool/ROOT/pve-1@syncoid_pve_2021-05-26:03:14:38-GMT02:00
2021-05-26.03:29:16 zfs rollback -R rpool/backups/SOURCE3/rpool/rpool-ssd-backup/subvol-120-disk-2@2021-05-26-030000
2021-05-26.03:29:20 zfs receive -s -F rpool/backups/SOURCE4/rpool/data/subvol-102-disk-1
2021-05-26.03:29:22 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-102-disk-1@syncoid_pve_2021-05-26:03:14:20-GMT02:00
>>>>May 26 03:29:24 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
2021-05-26.03:29:26 zfs receive -s -F rpool/backups/SOURCE8/rpool/ROOT/pve-1
2021-05-26.03:40:17 zpool import -N rpool
--


2021-05-31.18:25:37 zfs rollback -R rpool/backups/SOURCE4/rpool/data/subvol-112-disk-1@syncoid_pve_2021-05-31:18:11:48-GMT02:00
2021-05-31.18:25:38 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-100-disk-1@2021-05-31-180000
2021-05-31.18:25:42 zfs receive -s -F rpool/backups/SOURCE4/rpool/data/subvol-112-disk-1
2021-05-31.18:25:42 zfs receive -s -F rpool/backups/SOURCE8/rpool/data/subvol-106-disk-0
2021-05-31.18:25:43 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-100-disk-1
2021-05-31.18:25:43 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-112-disk-1@syncoid_pve_2021-05-31:18:11:48-GMT02:00
2021-05-31.18:25:44 zfs destroy rpool/backups/SOURCE8/rpool/data/subvol-106-disk-0@syncoid_pve_2021-05-31:18:11:39-GMT02:00
>>>>May 31 18:25:44 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
2021-05-31.18:25:45 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-100-disk-2@2021-05-31-180000
2021-05-31.18:25:47 zfs rollback -R rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1@syncoid_pve_2021-05-31:18:12:00-GMT02:00
2021-05-31.18:25:47 zfs rollback -R rpool/backups/SOURCE8/rpool/data/subvol-106-disk-1@syncoid_pve_2021-05-31:18:11:51-GMT02:00
2021-05-31.18:25:48 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-100-disk-2
2021-05-31.18:25:49 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-0@2021-05-31-180000
2021-05-31.18:25:49 zfs receive -s -F rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1
2021-05-31.18:25:50 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-0
2021-05-31.18:25:51 zfs destroy rpool/backups/SOURCE4/rpool/data/subvol-115-disk-1@syncoid_pve_2021-05-31:18:12:00-GMT02:00
2021-05-31.18:25:51 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-1@2021-05-31-180000
2021-05-31.18:25:52 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-1
2021-05-31.18:25:54 zfs rollback -R rpool/backups/SOURCE4/rpool/data/subvol-117-disk-0@syncoid_pve_2021-05-31:18:12:10-GMT02:00
2021-05-31.18:25:54 zfs receive -s -F rpool/backups/SOURCE8/rpool/data/subvol-106-disk-1
2021-05-31.18:25:55 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-2@2021-05-31-180000
2021-05-31.18:26:04 zfs destroy rpool/backups/SOURCE8/rpool/data/subvol-106-disk-1@syncoid_pve_2021-05-31:18:11:51-GMT02:00
2021-05-31.18:26:05 zfs rollback -R rpool/backups/SOURCE8/rpool/data/subvol-106-disk-2@syncoid_pve_2021-05-31:18:12:03-GMT02:00
2021-05-31.18:26:09 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-101-disk-2
2021-05-31.18:26:11 zfs rollback -R rpool/backups/SOURCE2/rpool-ssd/data/subvol-104-disk-0@2021-05-31-180000
2021-05-31.18:26:12 zfs receive -s -F rpool/backups/SOURCE2/rpool-ssd/data/subvol-104-disk-0
2021-05-31.18:37:02 zpool import -N rpool

--


2021-06-02.01:07:35 zfs destroy rpool/backups/SOURCE4/rpool/ROOT@syncoid_pve_2021-06-02:00:39:31-GMT02:00
2021-06-02.01:07:37 zfs receive -F rpool/backups/SOURCE5/rpool/data/subvol-101-disk-1
2021-06-02.01:07:37 zfs destroy rpool/backups/SOURCE5/rpool/data/subvol-101-disk-1@syncoid_pve_2021-06-02:00:39:37-GMT02:00
2021-06-02.01:07:39 zfs rollback -R rpool/backups/SOURCE4/rpool/ROOT/pve-1@syncoid_pve_2021-06-02:00:39:44-GMT02:00
2021-06-02.01:07:41 zfs receive -s -F rpool/backups/SOURCE7/rpool-hdd/data/subvol-103-disk-0
2021-06-02.01:07:42 zfs rollback -R rpool/backups/SOURCE7/rpool-hdd/data/vm-101-disk-0@2021-06-02-003000
2021-06-02.01:07:45 zfs rollback -R rpool/backups/SOURCE6/rpool/data/vm-100-disk-0@syncoid_pve_2021-06-02:00:40:18-GMT02:00
2021-06-02.01:07:45 zfs rollback -R rpool/backups/SOURCE5/rpool/data/vm-100-disk-1@syncoid_pve_2021-06-02:00:39:46-GMT02:00
>>>>Jun 02 01:07:46 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
2021-06-02.01:07:48 zfs receive -s -F rpool/backups/SOURCE7/rpool-hdd/data/vm-101-disk-0
2021-06-02.01:07:49 zfs rollback -R rpool/backups/SOURCE7/rpool-hdd/data/vm-102-disk-0@2021-06-02-003000
2021-06-02.01:21:17 zpool import -N rpool

Aren't there some commands that I could run after the exception, to gather more useful data?

@pquan
Copy link

pquan commented Jun 4, 2021

It seems that "reflowing" some of the problematic datasets on different mount point in the same pool has had some beneficial effects in my case. I mean just moving datasets with syncoid from rpool/mount1/dataset[x] -> rpool/mount2/dataset[x] and deleting some older snapshots and scrubbing the pool.
I'm moving said datasets back to their original position now to see if the fix is permanent.
This is all empiric work, but it would support the theory of some type of corruption occurring undetected. It does not seem to affect data tough, just metadata (I hope).

BTW: zfs send/receive is so painfully slow. 2x2 mirrors spinning disk often manage only 20-30MB on average, with lots of time spent under the 5-10MB region.

@rincebrain
Copy link
Contributor

rincebrain commented Jun 4, 2021

You might benefit, if you're not using it, from zfs send -c, since you're using gzip-9, and I would find it difficult to overstate how slow it is.

edit: I'm terribly sorry, I was thinking of someone else in the thread who is using gzip-9. My suggestion remains as stated, but assuming its properties work for you, you might want to explore using -w to avoid paying the de+re-encryption overhead.

Because, for example, I get much faster throughput than that on my single-disk pools sans encryption.

You might also want to use e.g. perf top and see if it is indeed spending a whole bunch of CPU cycles in the encryption code, or somewhere else.

@amotin
Copy link
Member

amotin commented Apr 25, 2024

I expect this to be already fixed by #16104 in master. It is likely too fresh for upcoming 2.2.4, but should appear in following 2.2.5 release.

@vaclavskala
Copy link
Contributor

vaclavskala commented Jun 24, 2024

With zfs 2.2.4 with patch #16104 applied I still get kernel panics with this stacktrace:

[Mon Jun 24 08:48:45 2024] VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[Mon Jun 24 08:48:45 2024] PANIC at dmu_recv.c:2093:receive_object()
[Mon Jun 24 08:48:45 2024] Showing stack for process 3446320
[Mon Jun 24 08:48:45 2024] CPU: 35 PID: 3446320 Comm: receive_writer Not tainted 6.1.94-vsh0zfs224 #3
[Mon Jun 24 08:48:45 2024] Hardware name: Supermicro X10DRi/X10DRi, BIOS 3.4a 08/16/2021
[Mon Jun 24 08:48:45 2024] Call Trace:
[Mon Jun 24 08:48:45 2024]  <TASK>
[Mon Jun 24 08:48:45 2024]  dump_stack_lvl+0x45/0x5e
[Mon Jun 24 08:48:45 2024]  spl_panic+0xd1/0xe9 [spl]
[Mon Jun 24 08:48:45 2024]  ? spl_kmem_cache_free+0x127/0x1d0 [spl]
[Mon Jun 24 08:48:45 2024]  ? mutex_lock+0xe/0x30
[Mon Jun 24 08:48:45 2024]  ? aggsum_add+0x173/0x190 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dnode_evict_bonus+0x7d/0xa0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dbuf_rele_and_unlock+0x312/0x4d0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
[Mon Jun 24 08:48:45 2024]  receive_object+0xae1/0xce0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dmu_object_next+0xe7/0x160 [zfs]
[Mon Jun 24 08:48:45 2024]  ? receive_freeobjects+0xa8/0x110 [zfs]
[Mon Jun 24 08:48:45 2024]  receive_writer_thread+0x313/0xb20 [zfs]
[Mon Jun 24 08:48:45 2024]  ? __slab_free+0x9e/0x2b0
[Mon Jun 24 08:48:45 2024]  ? set_next_task_fair+0x2d/0xd0
[Mon Jun 24 08:48:45 2024]  ? receive_process_write_record+0x2d0/0x2d0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? spl_taskq_fini+0x80/0x80 [spl]
[Mon Jun 24 08:48:45 2024]  ? thread_generic_wrapper+0x56/0x70 [spl]
[Mon Jun 24 08:48:45 2024]  thread_generic_wrapper+0x56/0x70 [spl]
[Mon Jun 24 08:48:45 2024]  kthread+0xd6/0x100
[Mon Jun 24 08:48:45 2024]  ? kthread_complete_and_exit+0x20/0x20
[Mon Jun 24 08:48:45 2024]  ret_from_fork+0x1f/0x30
[Mon Jun 24 08:48:45 2024]  </TASK>

@amano-kenji
Copy link

Is this fixed with 2.2.6?

@scratchings
Copy link

I updated to 2.2.6 a few days ago, so haven't had enough time to be confident about this. I'd been running a patched earlier release with significant success, but had also stopped daisy-chained sending to a third host, so it's not clear to me if this is the main reason for the vastly improved stability of the system.
I'm currently starting the process of migrating my backups to a new platform which will be heavily stressing send/receive so we'll see how it goes.

@clhedrick
Copy link

clhedrick commented Sep 18, 2024 via email

@scratchings
Copy link

Indeed, but my box was running 2.2.4 with the proposed patches, so never tested 2.2.5 as released.

@scratchings
Copy link

Sorry to report that the machine running 2.2.6 spontaneously rebooted overnight. I can only assume a kernel panic, but as nothing has been logged I can't be certain that ZFS was the cause.

@amano-kenji
Copy link

Usually a kernel panic freezes a system and requires a manual reboot.

@rincebrain
Copy link
Contributor

rincebrain commented Sep 20, 2024

You can set a tunable to trigger triple fault on panic, though I don't think any distros do by default, to my knowledge.

...well, sorry, that was imprecise and inaccurate.

A panic will generally reboot the machine outright by default on most distros (possibly with a round trip through kexec+kdump). Oops and BUG_ON don't, out of the box.

@scratchings
Copy link

Indeed. I was surprised to see it stuck at the LUKS (boot drive) passphrase prompt with a clearly faulty network connection as Clevis hadn’t auto-unlocked it. It then needed a further reboot before it came up, so any lingering previous boot info was long gone before I could interact with it.

The system logs just stop with nothing in them.

kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.watchdog = 1
kernel.watchdog_thresh = 10

So I suspect the watchdog kicked in.

@rincebrain
Copy link
Contributor

I'd suggest configuring kdump if you want to figure out more, possibly over the network if a kernel core dump saved on local disk is fraught for you, though perhaps just logging the output of when it broke would be sufficient for you.

@scratchings
Copy link

It definitely panicked a few minutes ago.

Message from syslogd@backup01 at Sep 23 10:00:16 ...
kernel:VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@backup01 at Sep 23 10:00:16 ...
kernel:PANIC at dmu_recv.c:2093:receive_object()
[duncan@backup01 ~]$ client_loop: send disconnect: Broken pipe

It had managed to internally (old to new pool) send/receive over 6TB of data over the weekend, a process that had completed. Looking at logs there were a couple of remote receives launched at the time of the crash - looks like I need to re-instate my Syncoid concurrency prevention wrapper.

Section from /var/log/messages attached. Although kdump is configured it didn't write anything to /var/crash :-(

zfs_crash.log

@scratchings
Copy link

Today's errant behaviour is a stuck 'zfs receive' - 100% CPU utilisation, no obvious progress (dataset size is static) and receive is unkillable.

@amano-kenji
Copy link

amano-kenji commented Sep 30, 2024

Until zfs native encrypted backups become stable, you can use restic to make encrypted remote snapshots of ZFS snapshots. Restic snapshots can be encrypted and incremental, and you can delete any restic snapshot without losing data. Restic has its own deduplicated blocks.

You can use https://zfs.rent or https://www.rsync.net/ with restic right now.

For now, LUKS is faster than zfs native encryption. Thus, LUKS, zfs, and restic are the best options now.

@amano-kenji
Copy link

amano-kenji commented Oct 2, 2024

@scratchings If you remove dust in your computer case, does the issue go away? Sometimes, dust in computer case causes errors in RAM or GPU.

My GPU used to freeze until I removed dust in my computer case.

@scratchings
Copy link

My host is server grade, ECC RAM (with no reported bit corrections) in a clean server room, and only a few months old, so I very much doubt this.

@amotin
Copy link
Member

amotin commented Oct 2, 2024

As I have written above, I believe the original panic of the report and its later reproductions should be fixed in 2.2.5 and up. Following report seems to be different, so lets close this and open new issue if needed

@amotin amotin closed this as completed Oct 2, 2024
@amotin
Copy link
Member

amotin commented Oct 2, 2024

Looking at logs there were a couple of remote receives launched at the time of the crash - looks like I need to re-instate my Syncoid concurrency prevention wrapper.

Section from /var/log/messages attached.

@scratchings The panic definitely happened during receive. What is that "Syncoid concurrency prevention wrapper" and what does it prevent and why was it needed? Any evidences that concurrent receives are bad? I suppose it does not try to concurrently receive the same dataset or something like that?

@scratchings
Copy link

It's a python script that uses lock files to ensure that only one syncoid can run on the instigating host, and also checks that syncoid isn't running on the sending host (to prevent crashes on that host when it's being used in a daisy-chain fashion).

I've reached the point with this particular host (which is replacing the original host that first suffered this issue, and has always had ZFS related crashes, no matter which version of ZFS it used) that I'm going to give up and convert it to LUKS + plain ZFS as I was having to send/receive onto a new JBOD anyway.

As someone mentioned historic pool migration, the pool I'm migrating from is from the 0.8 RC era - old enough to have originally suffered from Errata #4 (but the dataset in question was send/received to a fresh dataset to clear this years ago).

We will still have a few hosts running native encryption, one of which is running 2.2.6, and those that take snapshots regularly experience #12014. At least these hosts have never crashed, I've only ever seen this on the receiving side.

Even post LUKSing I'll still be running encryption for the month+ it will take to transfer the >400TB of data, so if I get further crashes I'll report them.

Perhaps of interest... When I got the first zfs process hang, it was a receive on the old pool (aes-256-ccm, lz4), there was a recursive receive in progress from that pool to the new pool (aes-256-gcm + recompression with zstd), this continued to operate for several more days (10+TB transferred and multiple datasets) until it too got a stuck zfs process. I guess if this is a race condition, then the higher performance of the gcm vs ccm encryption may be helping somewhat.

@amotin
Copy link
Member

amotin commented Oct 3, 2024

@scratchings You still haven't really answered my questions. While working on #16104 I've noticed that ZFS dbuf layer is generally unable to handle both encrypted and non-encrypted data for a dataset same time. It should not happen normally, since dataset should not be accessible until encrypted receive is fully completed and TXG is synced, but if we guess there are some races, then we may see all kinds of problems and I want to know about it. That is why I want to know what sorts of concurrency creates you a problem and what sort of serialization fixes it. It would be huge help if we could reproduce it manually somehow, but we are rarely that lucky.

Speaking of mentioned "recompression" as part of encrypted dataset replication, that would be possible only if the replication is unencrypted. Is that what we are talking about in general, or it was that only specific case? Unencrypted replication of encrypted dataset? Because it might be very different case from encrypted replication I was thinking about with very different dragons.

@scratchings
Copy link

This is a host that backs up several (ca 10) hosts over both local 10Gbit and remote 1Gbit SSH connections. The most stable environment has been when the host pulls backups from remote hosts such that it can ensure that only one receive is ever happening across the whole pool. I wrote a py script to wrap syncoid calls using lock files such that the cron tasks can launch on schedule and then wait until nothing has a lock on this file. We also pushed backups to a remote location (slower network) and check for syncoids running there (it's doing some backups of hosts local to it). It's this latter process that seems to have the biggest impact on stability. Our remote push destination failed so regularly that the common snapshots were lost and thus sends stopped. At this point the stability of the primary backup location improved markedly.

I've now embarked on a replacement of this primary backup host (new server hardware and new JBOD). As part of this I briefly (based on the optimism that 2.2.5 fixed the crashes) switched back to client instigated receives (i.e. not central control of when this might happen, cron tasks on the client launch syncoid in push mode) with no exclusivity locks as this obviously has benefits for ensuring the client is in a 'good' state with respect to service backups and means that other backups aren't held up in the case of a large number of blocks needing to be transferred. This made things go bad.

As to the 'recompression', this is by overriding the compression setting on receive, e.g. for the internal transfer to the new pool:

syncoid -r --no-sync-snap --mbuffer-size=1G --compress=none --no-privilege-elevation --recvoptions="u o compression=zstd" old_pool/folder new_pool/folder

When we see 'corrupt'snapshots on the client end this is 'discovered' during the send process, syncoid will report:

warning: cannot send 'pool/ds@autosnap_2024-10-03_10:30:01_hourly': Invalid argument cannot receive incremental stream: most recent snapshot of pool/ds does not match incremental source mbuffer: error: outputThread: error writing to <stdout> at offset 0x501d1a000: Broken pipe mbuffer: warning: error during output to <stdout>: Broken pipe CRITICAL ERROR: ssh -c [email protected] -S /tmp/[email protected] ' zfs send -I '"'"'pool/ds'"'"'@'"'"'autosnap_2024-09-01_00:30:01_monthly'"'"' '"'"'pool/ds'"'"'@'"'"'autosnap_2024-10-03_11:30:02_hourly'"'"' | lzop | mbuffer -q -s 128k -m 16M' | mbuffer -q -s 128k -m 16M | lzop -dfc | pv -p -t -e -r -b -s 45331110800 | zfs receive -u -o compression=zstd -s -F 'old_pool/host-name/ds' failed: 256 at /usr/sbin/syncoid line 585.

This is Red Hat Enterprise 9.

Happy to clarify further as required.

@vaclavskala
Copy link
Contributor

This is not hw problem. I managed to reproduce it on VM server.
Test setup was:
unencrypted pool1 with postgresql (and running pg_bench to generate changes), replicated using snapshots to encrypted pool2 and pool2 was sending snapshots (unencrypted, no raw send) to /dev/null.

Few times I got this error message about dmu_recv, but i can not reproduce it at every run. But it looks like it is caused when sending unencrypted snapshot from encrypted dataset, this save unencrypted data in memory and then receive new snapshot.

In production backup server this same situation happens when source server send snapshot to backup server while there was recent send of this dataset from this backup server to another backup server (to ballance disk usage across servers).

@scratchings
Copy link

In case it's significant. My receiving pool is quite full - 94% (has been as high as 98%) and ca 20% fragmentation (it's a 650TB pool, so even at 94% there's ca 40TB of free space).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests