Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some copied files are still corrupted (chunks replaced by zeros) #15933

Closed
thulle opened this issue Feb 26, 2024 · 73 comments
Closed

some copied files are still corrupted (chunks replaced by zeros) #15933

thulle opened this issue Feb 26, 2024 · 73 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@thulle
Copy link

thulle commented Feb 26, 2024

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version (rolling)
Kernel Version 6.7.6-gentoo-x86_64
Architecture amd64
OpenZFS Version 2.2.3

Describe the problem you're observing

After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go.
It still seems to trigger the/some bug.

Worth noting is that I'm running on top of LUKS. Scrub reports no issues.

Describe how to reproduce the problem

# echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled 

# emerge --buildpkg=n --usepkg=n --quiet-build=y dev-lang/go 
>>> Completed (1 of 1) dev-lang/go-1.22.0::gentoo

# file /usr/lib/go/pkg/tool/linux_amd64/*|grep data$
/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/usr/lib/go/pkg/tool/linux_amd64/link:      data
/usr/lib/go/pkg/tool/linux_amd64/vet:       data

# hexdump -C /usr/lib/go/pkg/tool/linux_amd64/asm
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 53 43 51 58  |............SCQX|
00000fb0  63 56 34 74 49 75 66 65  59 67 64 4a 32 54 54 4e  |cV4tIufeYgdJ2TTN|
00000fc0  2f 51 78 63 67 68 50 6b  43 33 67 48 31 54 66 64  |/QxcghPkC3gH1Tfd|
00000fd0  70 59 48 39 33 2f 43 36  32 31 50 69 70 58 45 71  |pYH93/C621PipXEq|
00000fe0  30 6f 5f 39 4c 57 65 71  49 4d 2f 48 32 52 64 56  |0o_9LWeqIM/H2RdV|
00000ff0  50 64 46 45 4c 77 6d 77  4a 37 42 31 51 33 47 00  |PdFELwmwJ7B1Q3G.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
002346c0  53 43 51 58 63 56 34 74  49 75 66 65 59 67 64 4a  |SCQXcV4tIufeYgdJ|
002346d0  32 54 54 4e 2f 51 78 63  67 68 50 6b 43 33 67 48  |2TTN/QxcghPkC3gH|
002346e0  31 54 66 64 70 59 48 39  33 2f 43 36 32 31 50 69  |1TfdpYH93/C621Pi|
002346f0  70 58 45 71 30 6f 5f 39  4c 57 65 71 49 4d 2f 48  |pXEq0o_9LWeqIM/H|
00234700  32 52 64 56 50 64 46 45  4c 77 6d 77 4a 37 42 31  |2RdVPdFELwmwJ7B1|
00234710  51 33 47 00 00 00 00 00  00 00 00 00 00 00 00 00  |Q3G.............|
00234720  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003a88d0  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
003a88dd

# zfs -V
zfs-2.2.3-r0-gentoo
zfs-kmod-2.2.3-r0-gentoo

@thulle thulle added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 26, 2024
@thesamesam
Copy link
Contributor

cc @robn

@thulle
Copy link
Author

thulle commented Feb 26, 2024

<sam_> thulle: also, it matters as to when the corruption occurs. is it when cp calls stuff in src_install via Go (is it corrupted already in the image dir?) or is it OK in the image dir, but corrupted when it gets merged to the livefs?

Ran the steps manually first without issue, then made a script that passed the first loop, but stopped on the 2nd loop:

#!/bin/bash
function checkfiles {
	if [ "`file ${1}/*|grep data$`" != "" ];then
		echo Corrupted files in ${1}:
		file ${1}/*|grep data$
		exit
	fi
}

cd /usr/portage/dev-lang/go
while :;do
	ebuild go-1.22.0.ebuild clean
	ebuild go-1.22.0.ebuild compile
	checkfiles /var/tmp/portage/dev-lang/go-1.22.0/work/go/pkg/tool/linux_amd64
	ebuild go-1.22.0.ebuild install
	checkfiles /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64
	ebuild go-1.22.0.ebuild merge
	checkfiles /usr/lib/go/pkg/tool/linux_amd64
done

On the 2nd loop I got:

Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

@thulle
Copy link
Author

thulle commented Feb 26, 2024

<PMT> neat. the next question i'd have, is if you have brt enabled, and set zfs_bclone_wait_dirty=1, does the issue also go away?

# echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty

Started the script again, but it stalled and something seemed stuck writing at ~10MB/s to the nvme. Killed the script and saw this in syslog:

Feb 26 15:35:27 soma kernel: INFO: task kworker/u65:1:22874 blocked for more than 122 seconds.
Feb 26 15:35:27 soma kernel:       Tainted: P        W  OE   T  6.7.6-gentoo-x86_64 #2
Feb 26 15:35:27 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 15:35:27 soma kernel: task:kworker/u65:1   state:D stack:0     pid:22874 tgid:22874 ppid:2      flags:0x00004000
Feb 26 15:35:27 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 15:35:27 soma kernel: Call Trace:
Feb 26 15:35:27 soma kernel:  <TASK>
Feb 26 15:35:27 soma kernel:  __schedule+0x277/0x680
Feb 26 15:35:27 soma kernel:  schedule+0x31/0xe0
Feb 26 15:35:27 soma kernel:  __cv_broadcast+0x154/0x190 [spl]
Feb 26 15:35:27 soma kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 15:35:27 soma kernel:  zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 15:35:27 soma kernel:  zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 15:35:27 soma kernel:  zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  write_cache_pages+0x13e/0x3b0
Feb 26 15:35:27 soma kernel:  ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  do_writepages+0xce/0x1f0
Feb 26 15:35:27 soma kernel:  __writeback_single_inode+0x3d/0x290
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  writeback_sb_inodes+0x206/0x4e0
Feb 26 15:35:27 soma kernel:  __writeback_inodes_wb+0x4c/0xf0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  wb_writeback+0x29e/0x310
Feb 26 15:35:27 soma kernel:  wb_do_writeback+0x238/0x2c0
Feb 26 15:35:27 soma kernel:  wb_workfn+0x5b/0x270
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? queue_delayed_work_on+0x8d/0xa0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  process_one_work+0x173/0x340
Feb 26 15:35:27 soma kernel:  worker_thread+0x28b/0x3b0
Feb 26 15:35:27 soma kernel:  ? preempt_count_add+0x7a/0xb0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? __pfx_worker_thread+0x10/0x10
Feb 26 15:35:27 soma kernel:  kthread+0xf6/0x130
Feb 26 15:35:27 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel:  ret_from_fork+0x30/0x50
Feb 26 15:35:27 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel:  ret_from_fork_asm+0x1b/0x30
Feb 26 15:35:27 soma kernel:  </TASK>

checking the process list I suspect this straggler is the culprit, only remaining child of the ebuild python process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9953 33.3  0.0  10136  2008 ?        RN   15:31  12:42 cp -R api bin doc lib pkg misc src test /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go

edit: As soon as I disabled zfs_bclone_wait_dirty the write load disappeared and the cp-process died.

@thulle
Copy link
Author

thulle commented Feb 26, 2024

Rebooted and tried the same thing again, but not killing the script this time, instead just disabling zfs_bclone_wait_dirty.
Waited for the trace first:

Feb 26 16:37:51 soma kernel: INFO: task kworker/u66:2:2121 blocked for more than 122 seconds.
Feb 26 16:37:51 soma kernel:       Tainted: P        W  OE   T  6.7.6-gentoo-x86_64 #2
Feb 26 16:37:51 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 16:37:51 soma kernel: task:kworker/u66:2   state:D stack:0     pid:2121  tgid:2121  ppid:2      flags:0x00004000
Feb 26 16:37:51 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 16:37:51 soma kernel: Call Trace:
Feb 26 16:37:51 soma kernel:  <TASK>
Feb 26 16:37:51 soma kernel:  __schedule+0x277/0x680
Feb 26 16:37:51 soma kernel:  schedule+0x31/0xe0
Feb 26 16:37:51 soma kernel:  __cv_broadcast+0x154/0x190 [spl]
Feb 26 16:37:51 soma kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 16:37:51 soma kernel:  zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 16:37:51 soma kernel:  zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 16:37:51 soma kernel:  zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  write_cache_pages+0x13e/0x3b0
Feb 26 16:37:51 soma kernel:  ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  do_writepages+0xce/0x1f0
Feb 26 16:37:51 soma kernel:  __writeback_single_inode+0x3d/0x290
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  writeback_sb_inodes+0x206/0x4e0
Feb 26 16:37:51 soma kernel:  __writeback_inodes_wb+0x4c/0xf0
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  wb_writeback+0x29e/0x310
Feb 26 16:37:51 soma kernel:  wb_do_writeback+0x238/0x2c0
Feb 26 16:37:51 soma kernel:  wb_workfn+0x5b/0x270
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? try_to_wake_up+0x9d/0x690
Feb 26 16:37:51 soma kernel:  process_one_work+0x173/0x340
Feb 26 16:37:51 soma kernel:  worker_thread+0x28b/0x3b0
Feb 26 16:37:51 soma kernel:  ? __pfx_worker_thread+0x10/0x10
Feb 26 16:37:51 soma kernel:  kthread+0xf6/0x130
Feb 26 16:37:51 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel:  ret_from_fork+0x30/0x50
Feb 26 16:37:51 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel:  ret_from_fork_asm+0x1b/0x30
Feb 26 16:37:51 soma kernel:  </TASK>

Not killing the script and disabling zfs_bclone_wait_dirty it completes instead, no corruption detected.
I let it loop like this another 6 times, surprisingly without issue, until I realized my CPU was limited to 3GHz, default limit if a script doesn't detect that any listed processes are running. Listed are processes needing more speed, like portage or similar.
Uncapping CPU speed it fails in the next loop:

Loop 8 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

Two theories:

<thulle> tiny window that's only hit w higher frequencies?
<PMT> possible. might be something more indirect like the speed change causing things to briefly stall and skewing timing that way.

Stabilizing the core speeds a bit:

# cpupower frequency-set -g performance -d 4000000 -u 5200000

Loop 3 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

Letting it vary more:

# cpupower frequency-set -g schedutil -d 550000 -u 5200000

Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

@thulle
Copy link
Author

thulle commented Feb 26, 2024

<PMT> Please be sure to include the output of "zfs get all [DATASET THIS HAPPENS ON]"

NAME                     PROPERTY              VALUE                  SOURCE
kc3000/EPHEMERAL/usrtmp  type                  filesystem             -
kc3000/EPHEMERAL/usrtmp  creation              fre nov 24 16:51 2023  -
kc3000/EPHEMERAL/usrtmp  used                  7.31G                  -
kc3000/EPHEMERAL/usrtmp  available             449G                   -
kc3000/EPHEMERAL/usrtmp  referenced            7.31G                  -
kc3000/EPHEMERAL/usrtmp  compressratio         2.40x                  -
kc3000/EPHEMERAL/usrtmp  mounted               yes                    -
kc3000/EPHEMERAL/usrtmp  quota                 none                   default
kc3000/EPHEMERAL/usrtmp  reservation           none                   default
kc3000/EPHEMERAL/usrtmp  recordsize            128K                   default
kc3000/EPHEMERAL/usrtmp  mountpoint            /usr/tmp               local
kc3000/EPHEMERAL/usrtmp  sharenfs              off                    default
kc3000/EPHEMERAL/usrtmp  checksum              blake3                 inherited from kc3000
kc3000/EPHEMERAL/usrtmp  compression           zstd-3                 inherited from kc3000
kc3000/EPHEMERAL/usrtmp  atime                 on                     default
kc3000/EPHEMERAL/usrtmp  devices               on                     default
kc3000/EPHEMERAL/usrtmp  exec                  on                     default
kc3000/EPHEMERAL/usrtmp  setuid                on                     default
kc3000/EPHEMERAL/usrtmp  readonly              off                    default
kc3000/EPHEMERAL/usrtmp  zoned                 off                    default
kc3000/EPHEMERAL/usrtmp  snapdir               hidden                 default
kc3000/EPHEMERAL/usrtmp  aclmode               discard                default
kc3000/EPHEMERAL/usrtmp  aclinherit            restricted             default
kc3000/EPHEMERAL/usrtmp  createtxg             15                     -
kc3000/EPHEMERAL/usrtmp  canmount              on                     default
kc3000/EPHEMERAL/usrtmp  xattr                 sa                     inherited from kc3000
kc3000/EPHEMERAL/usrtmp  copies                1                      default
kc3000/EPHEMERAL/usrtmp  version               5                      -
kc3000/EPHEMERAL/usrtmp  utf8only              on                     -
kc3000/EPHEMERAL/usrtmp  normalization         formD                  -
kc3000/EPHEMERAL/usrtmp  casesensitivity       sensitive              -
kc3000/EPHEMERAL/usrtmp  vscan                 off                    default
kc3000/EPHEMERAL/usrtmp  nbmand                off                    default
kc3000/EPHEMERAL/usrtmp  sharesmb              off                    default
kc3000/EPHEMERAL/usrtmp  refquota              none                   default
kc3000/EPHEMERAL/usrtmp  refreservation        none                   default
kc3000/EPHEMERAL/usrtmp  guid                  10378968521923463382   -
kc3000/EPHEMERAL/usrtmp  primarycache          all                    default
kc3000/EPHEMERAL/usrtmp  secondarycache        all                    default
kc3000/EPHEMERAL/usrtmp  usedbysnapshots       0B                     -
kc3000/EPHEMERAL/usrtmp  usedbydataset         7.31G                  -
kc3000/EPHEMERAL/usrtmp  usedbychildren        0B                     -
kc3000/EPHEMERAL/usrtmp  usedbyrefreservation  0B                     -
kc3000/EPHEMERAL/usrtmp  logbias               latency                default
kc3000/EPHEMERAL/usrtmp  objsetid              398                    -
kc3000/EPHEMERAL/usrtmp  dedup                 off                    default
kc3000/EPHEMERAL/usrtmp  mlslabel              none                   default
kc3000/EPHEMERAL/usrtmp  sync                  standard               default
kc3000/EPHEMERAL/usrtmp  dnodesize             legacy                 default
kc3000/EPHEMERAL/usrtmp  refcompressratio      2.40x                  -
kc3000/EPHEMERAL/usrtmp  written               7.31G                  -
kc3000/EPHEMERAL/usrtmp  logicalused           13.9G                  -
kc3000/EPHEMERAL/usrtmp  logicalreferenced     13.9G                  -
kc3000/EPHEMERAL/usrtmp  volmode               default                default
kc3000/EPHEMERAL/usrtmp  filesystem_limit      none                   default
kc3000/EPHEMERAL/usrtmp  snapshot_limit        none                   default
kc3000/EPHEMERAL/usrtmp  filesystem_count      none                   default
kc3000/EPHEMERAL/usrtmp  snapshot_count        none                   default
kc3000/EPHEMERAL/usrtmp  snapdev               hidden                 default
kc3000/EPHEMERAL/usrtmp  acltype               posix                  inherited from kc3000
kc3000/EPHEMERAL/usrtmp  context               none                   default
kc3000/EPHEMERAL/usrtmp  fscontext             none                   default
kc3000/EPHEMERAL/usrtmp  defcontext            none                   default
kc3000/EPHEMERAL/usrtmp  rootcontext           none                   default
kc3000/EPHEMERAL/usrtmp  relatime              on                     inherited from kc3000
kc3000/EPHEMERAL/usrtmp  redundant_metadata    all                    default
kc3000/EPHEMERAL/usrtmp  overlay               on                     default
kc3000/EPHEMERAL/usrtmp  encryption            off                    default
kc3000/EPHEMERAL/usrtmp  keylocation           none                   default
kc3000/EPHEMERAL/usrtmp  keyformat             none                   default
kc3000/EPHEMERAL/usrtmp  pbkdf2iters           0                      default
kc3000/EPHEMERAL/usrtmp  special_small_blocks  0                      default

@thulle
Copy link
Author

thulle commented Feb 26, 2024

<PMT> the one other experiment would be to try zfs_dmu_offset_next_sync
<PMT> it defaults to 1, if you change it to 0, does this still mangle your bits
<PMT> in a case where it would have mangled your bits otherwise, e.g. changing clock speeds

# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

@rincebrain
Copy link
Contributor

Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.

Gross.

@AllKind
Copy link
Contributor

AllKind commented Feb 26, 2024

How about #15588 - I wonder if that changes the outcome?

@rincebrain
Copy link
Contributor

I tried, and no.

@AllKind
Copy link
Contributor

AllKind commented Feb 26, 2024

Thanks for trying - and damn!

@DocJr90
Copy link

DocJr90 commented Mar 13, 2024

Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.

Gross.

"Really easily" as in "just as easily as before"?

@numinit
Copy link
Contributor

numinit commented Mar 13, 2024

Rebased and running this all night... numinit/nixpkgs@c0712aa

Increased blocksize to 256k kept at 16k and added more logging:

zfs_2_2_3_stock # [zhammer::1266] Uname: Linux zfs223stock 6.7.9 #1-NixOS SMP PREEMPT_DYNAMIC Wed Mar  6 14:54:01 UTC 2024 x86_64 GNU/Linux
zfs_2_2_3_stock # [zhammer::1266] Cmdline: initrd=\EFI\nixos\zi31gj6mb1lmp3w92wxl0n93s36kszq9-initrd-linux-6.7.9-initrd.efi init=/nix/store/gs2xw1yis46qx221h1jy0z4jv7z8w5sd-nixos-syst
em-zfs_2_2_3_stock-test/init console=ttyS0 panic=1 boot.panic_on_fail clock=acpi_pm zfs.zfs_bclone_enabled=1 nohibernate loglevel=7 net.ifnames=0
zfs_2_2_3_stock # [zhammer::1266]   - zfs_bclone_enabled: 1
zfs_2_2_3_stock # [zhammer::1266]   - zfs_bclone_wait_dirty: 0
zfs_2_2_3_stock # [zhammer::1266]   - zfs_dmu_offset_next_sync: 1
zfs_2_2_3_stock # [zhammer::1266] ZFS userspace: zfs-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] ZFS kernel: zfs-kmod-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] Module: /run/booted-system/kernel-modules/lib/modules/6.7.9/extra/zfs.ko.xz
zfs_2_2_3_stock # [zhammer::1266] Srcversion: 62EA6EC3953A8524022765E
zfs_2_2_3_stock # [zhammer::1266] SHA256: 9f3dc6469b41a73ab0cf0c6c76c7ddce6e9ca0b1bc4da64f0e09206ee5d70f2b
zfs_2_2_3_stock # [zhammer::1266] ===
zfs_2_2_3_stock # [zhammer::1266] Work dir: /test
zfs_2_2_3_stock # [zhammer::1266] Count: 10000000 files
zfs_2_2_3_stock # [zhammer::1266] Block size: 16k
zfs_2_2_3_stock # [zhammer::1266] Check every: 5000 files
zfs_2_2_3_stock # [zhammer::1266] writing 5000 files at iteration 0

I'll update in case anything happens but it didn't repro immediately which is a good sign.

@rrevans
Copy link
Contributor

rrevans commented Mar 13, 2024

@thulle Are you able to reproduce this with zfs_bclone_enabled=0?

@rrevans
Copy link
Contributor

rrevans commented Mar 14, 2024

See #15994 for a possible explanation for cp stuckness when zfs_bclone_wait_dirty=1.

@numinit
Copy link
Contributor

numinit commented Mar 14, 2024

10,000,000 iterations of 16k blocks on each of 8 cores, and... no trouble.

(finished: must succeed: parallel --lb --halt-on-error now,fail=1 zhammer /test 10000000 16k 5000 ::: $(seq $(nproc)), in 81147.21 seconds)
(finished: run the VM test script, in 81156.74 seconds)
test script finished in 81156.79s
cleanup
kill machine (pid 9)
qemu-kvm: terminating on signal 15 from pid 6 (/nix/store/3v2ch16fkl50i85n05h5ckss8pxx6836-python3-3.11.8/bin/python3.11)
(finished: cleanup, in 0.08 seconds)
kill vlan (pid 7)
/nix/store/vxxi7mfvynzqpyavjszgqmjd5ni0l1xh-vm-test-run-zfs-zfs_2_2_3_stock

Is this another problem?

@rincebrain
Copy link
Contributor

Yes.

@numinit
Copy link
Contributor

numinit commented Mar 14, 2024

So, not racing holes; probably very specifically pointing toward something else related to block cloning, or at least a second-order issue thereof, then. Maybe something ARC related?

@rrevans
Copy link
Contributor

rrevans commented Mar 15, 2024

I think this is zpl_clone_file_range_impl not flushing mapped pages before attempting to perform the clone.

The file is miscopied if it has been synced out once, pages are dirty, and the kernel has not flushed any of those pages yet. (See zfs_holey_common for another code path that needs to flush dirty pages before examining the file.)

Here is a trivial reproducer:

#!/bin/bash
echo 1 >/sys/module/zfs/parameters/zfs_bclone_enabled
dd if=/dev/zero of=x bs=1M count=1 status=none
zpool sync
python -c "
from mmap import mmap
with open('x', 'r+b') as fh:
	with mmap(fh.fileno(), 0) as m:
		m[4000:5000] = b'z' * 1000
"
cp --reflink=always x y
diff -q x y

Output:

Files x and y differ

What's happening here?

  1. The zpool sync flushes out the file and its block pointers
  2. Some pages get written with mmap
  3. Reflink-copy clones the zero blocks from (1) not the pages written in (2)

This happens because none of the pages have been written back yet, so the file looks clean to the clone code. But the contents are the last-synced rather than the latest mmap'ed state. (Also note that merely reading the file without using mmap or clone will flush dirty pages hiding the bug.)

golang build is a reproducer because it builds the tool chain multiple times in the work directory and overwrites the contents with mmap. This reliably triggers the sequence above.

The script above succeeds with no differences for zfs_bclone_enabled=0 or cp --reflink=never.

@rincebrain
Copy link
Contributor

rincebrain commented Mar 15, 2024

That tracks with what I privately found, which was that turning the conditional check for cached pages triggering a flush in zfs_holey_common into an unconditional flush made this bug vanish.

My theory when I last looked at this was that because we didn't hold the range lock until after we did this check, there was a nasty window for us to have passed that check, not trigger the flush, and then go bang. But attempting to move that check inside the range lock (in addition to causing fun with nested locks sometimes) seemed to still break where unconditional flushing did not, suggesting that there was more to it.

Then life got busy for the past few days and I haven't looked again since.

Another interesting question is whether this reproduces on FreeBSD, because my theory for the issue I was observing (which was that, in my testing at least, I ended up with everything past a certain point zeroed where it shouldn't have been) was that the filesize was changing between when we read it and used it for the check, or between the check and when we actually did the probe, resulting in comedy, but since FreeBSD's mapping of the check doesn't actually use the range arguments, it might not have that problem if that was indeed the problem.

Anyway, if someone wants to just slap a FIXME on this, I imagine you could do something like turning

        /* Flush any mmap()'d data to disk */
        if (zn_has_cached_data(zp, 0, file_sz - 1))
                zn_flush_cached_data(zp, B_FALSE);

        lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
        error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
        zfs_rangelock_exit(lr);

into

        /* Flush any mmap()'d data to disk */
        if (zn_has_cached_data(zp, 0, file_sz - 1) || (zfs_brt_force_flush))
                zn_flush_cached_data(zp, B_FALSE);

        lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
        error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
        zfs_rangelock_exit(lr);

with a new tunable plumbed in, and have a nice day for now. But this feels like the problem there is that a clone should be triggering that before it tries it, not flushing before anything that might reference it is hitting it later. (cf. #15772).

So similar to how zfs_clone_range added:

                error = dmu_brt_clone(outos, outzp->z_id, outoff, size, tx,
                    bps, nbps);
                if (error != 0) {
                        dmu_tx_commit(tx);
                        break;
                }

                if (zn_has_cached_data(outzp, outoff, outoff + size - 1)) {
                        update_pages(outzp, outoff, size, outos);
                }

after the clone, we probably want something like (untested):

                if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
                        update_pages(inzp, inoff, size, inos);
                }

                nbps = maxblocks;
                last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
                error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
                    &nbps);

.

@rincebrain
Copy link
Contributor

$ for i in `seq 1 50`;do sudo bash repro.sh;done;
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
^C
$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_brt_fix_this_mess
1
$ for i in `seq 1 50`;do sudo bash repro.sh;done;
$

Where that tunable is an if guard around the above patch.

@rincebrain
Copy link
Contributor

rincebrain commented Mar 15, 2024

Don't try it, probably.

This ameliorates the reproducer script listed, but trying the original report, of emerge go on Gentoo, produces mangled results more often with that proposed change, fascinatingly.

Wonder if what I'm seeing is this fix turning this test case into #15994.

@rincebrain
Copy link
Contributor

rincebrain commented Mar 15, 2024

Nah, just me rushing in trying things.

What I actually wanted was:

                if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
                        zn_flush_cached_data(inzp, B_FALSE);
                }

                nbps = maxblocks;
                last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
                error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
                    &nbps);

But I didn't think carefully about what update_pages did. I wanted to flush cached pages, not rewrite them with the buffer's contents, that's the bad place.

...and if I do that in there, I deadlock on the rangelock. So let's move that around...

@rrevans
Copy link
Contributor

rrevans commented Mar 15, 2024

rrevans/zfs@ae5efbe fixes both emerge go (using @thulle's step-at-a-time testing script) and the reproducer script for me. I've run both in a loop for 30 minutes with no failures so far.

The dirty page flush happens before the rangelock in zfs_holey_common, and I'm just copying that approach.

I am not sure this is correct or sufficient or if there are other bclone/mmap gaps. (And, it flushes the whole file when it could maybe flush only the cloned range.) That said, it does pass ZTS sanity.run and bclone tests for me.

@rincebrain
Copy link
Contributor

That doesn't fix it, for me, as I already said above.

@rincebrain
Copy link
Contributor

rincebrain commented Mar 15, 2024

Sorry, to be less terse, I wound up with rincebrain@e55d3b0 because you can't flush that under the rangelock without deadlocking sometimes, so there's going to be a window where it might change between the flush and the lock again.

So I added a case for the check inside the lock as well, told it to retry once to hopefully catch more of the trivial cases of it being dirtied with inconvenient timing infrequently, and then fall back to not reflinking in the event that it's still dirty.

I confirmed that it sometimes did still reflink with that code (since I was slightly worried I might be missing some detail of how it worked that implied the check would always be true), and then pushed it to see whether the CI blew up.

It can probably be improved to avoid that window, but similarly, I could not produce incorrect behavior after hammering it for some time, and anything simpler still failed some of the time.

(Not, admittedly, very often, but "one in 50 runs" is still "not zero times".)

(Also, to be clear, I'm redoing my tests with your patch to see if my memory is faulty of having tried that, because I would certainly prefer a simpler solution...)

@amotin
Copy link
Member

amotin commented Mar 15, 2024

I'll need to look closer on the area, but just a first thought: if it is impossible to reliably flush the caches, we could abort cloning in zn_has_cached_data() case. I.e. we could flush caches first before taking the lock, but if somebody is still writing the mmap()'ed region when we get the lock, then give up completely. Though since we'd need to commit the txg before we'll be able to clone the file, we may just give up on it from the beginning and don't bother to even flush the cache, leaving it to fall-back copy routine to do.

@rrevans
Copy link
Contributor

rrevans commented Mar 15, 2024

@amotin +1

Looking more at this, I'm suspecting background writeback is a confounding factor (and then in turn also system load).

Staring at zn_flush_cached_data and zfs_putpage, if the pages are already in writeback by the time this is called, then it won't wait if sync == B_FALSE. Then it's possible it returns before any pages get copied to the DMU/ARC state, and the file state looks clean. (@rincebrain's version of the patch uses sync == B_TRUE which is more reliable vs. background WB.)

But note the fallback copy routine might seek holes and end up calling zn_flush_cached_data with sync == FALSE which could in turn have the same problem? Maybe this explains why unconditional flushing in zfs_holey_common has some effect on the outcome?

p.s. I don't think it's correct to use rangelocks to coordinate with mmap writes since page locking is entirely separate and mmap writes are inherently racy. IOW if there are racing page writes there's no way to guarantee any outcome anyway, right? This is true even for filesystems that use the pagecache directly. That said, copying should always reliably observe page writes guaranteed to be dirtied before clone starts (e.g. the build tool process wait is done before clone begins).

@rrevans
Copy link
Contributor

rrevans commented Mar 17, 2024

@rincebrain As I currently understand, the problem for non-BRT is that sometimes mmap page dirtiness doesn't reach the DMU before dnode_is_dirty. IIRC generic_file_llseek implements "the whole file is data" based on the inode size which so far seems correct. Based on the fact ZFS has no special handling for SEEK_END it should work to disable hole seeks.

That said, that approach is a regression for sparse copies without mmap, and I'm not sure we need to break that case.

I'm thoroughly convinced that the original fix works for the original problem and its root cause, though it's still got theoretical data races which I think are unreachable in practice. I've been studying this space and all the code ever since in pursuit of an implementation that doesn't need to force txg syncs because I think ZFS shouldn't have limits in this regard (and I have working code that indeed does that and passes ZTS, ztest, and my own stress tests).

While #16000 is super conservative I'd vote for adding code that keeps llseek functional EXCEPT for files with mappings (dirty or not) which is an easy test to add. Maybe leave in the option to fully disable as cheap advice for users in case this analysis is wrong somehow...?

@rincebrain
Copy link
Contributor

Given the history, I'm really in favor of "super conservative" at the moment. I'm not opposed to other improvements, obviously, but given that distros tend to ship releases and take forever to cherrypick fixes from the future if they won't update (if they do it at all), I'm really in favor of safety models that don't require assuming people run a newer version if we're wrong about it being safe.

Also, I have a similar document detailing the history, with commits, so it's interesting to read someone else's deep dive, thank you. A little curious why you prepared it, but.

@rrevans
Copy link
Contributor

rrevans commented Mar 17, 2024

@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but withzfs_dmu_offset_next_sync=0?

Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.

@rincebrain
Copy link
Contributor

rincebrain commented Mar 17, 2024

Breaks on 5.10.0-27 zfs_dmu_offset_next_sync=0 and bclone_enabled=0.

Files x.1109 and x.1109.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1116 and x.1116.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1839 and x.1839.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1874 and x.1874.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1615 and x.1615.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1961 and x.1961.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

e: Were you actually asking about 2.1.5, or no? Because bclone only went in on 2.2, so that setting wouldn't exist there...

@rincebrain
Copy link
Contributor

Same applies to 2.1.5 stock with either dmu_offset_next_sync.
1:

Files x.1816 and x.1816.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

0:

Files x.1958 and x.1958.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.639 and x.639.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1261 and x.1261.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1885 and x.1885.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)

@Gendra13
Copy link

@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but withzfs_dmu_offset_next_sync=0?

Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.

That was weird. When reducing the files down to 200 (or 20) I couldn't trigger the bug at all, even with zfs_dmu_offset_next_sync=1.
I tested different numbers with zfs_dmu_offset_next_sync=1 and found 400 files to be the sweet spot where the script failed immediately.

Then I repeated the test with zfs_dmu_offset_next_sync=0 and indeed, it also failed but took a few minutes of running.
ZFS-Version: zfs-kmod-2.1.15-1
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4

Files x.346 and x.346.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.341 and x.341.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

@rrevans
Copy link
Contributor

rrevans commented Mar 18, 2024

Well, falling back to generic_file_llseek still breaks in the BRT case, as @rrevans said above, but we appear to no longer break in the BRT-disabled case for that, suggesting at least two issues at hand, unless it's just so rare now that I've not hit it yet.

I don't think the problems at hand are possible without fancy llseek or block clone.

If we attempt a direct read, then the cached mapping pages are used as the source of truth of the file state instead of the dirty buffers in the DMU/ARC. This happens as long as cp is forced to actually read the file or have copy_file_range do it indirectly (which also DTRT - eventually the kernel calls splice_read which becomes zpl_iter_read which ultimately calls zfs_read; that in turn does a read of the pagecache for mapped ranges).

Thus, assuming those paths work, skipping fancy llseek and reflink is always going to be correct and avoid any dependence on writeback or flushing dirty pages to the DMU.

@rrevans what do you think about shorting out to that for the SEEK_DATA case, assuming Linux's llseek has correct visibility into state, and encouraging people to leave bclone disabled for now? Can you think of any cases that doesn't cover?

No, I can't think of any other cases. Fancy llseek getting it wrong is the proximate cause of all known problems pre-BRT. It's the same problem as #15526 but triggered by mmap instead of racing syncs. Edit: I'm also convinced Linux has the correct state already because the other seek flavors work, especially SEEK_END (see also O_APPEND maybe).

In short, while at least some older releases are affected, I'm convinced that triggering the writeback flavor of this bug for zfs_bclone_enabled=0 requires approximately all of:

  1. recent cp tool that is sparse aware (plain reads do not trigger this)
  2. a sparse file (heuristics will likely skip fancy llseek in dense files)
  3. mmap used to make some changes to that file (plain writes do not trigger this)
  4. copying the file soon after changing it (before normal writeback happens)

Note that for golang it seems to rewrite the outputs during build, so the files are briefly sparse during the build process but this is enough to create a window for (2): stat reports on-disk block usage from the prior moment the file was sparse if it happens to be flushed as part of a normal txg sync during the window where it is sparse. (I think this explains why golang repros are really hard pre-BRT: in addition to hitting a writeback race the system also needs to hit a txg flush in the small window that the output file is sparse.)

@rrevans
Copy link
Contributor

rrevans commented Mar 19, 2024

Okay... kernel tracing to the rescue. TL;DR zfs_putpage fails on pushback from dsl_dir_tempreserve_space, and nothing in Linux retries the writeback after this happens (in part because the error is not propagated to the kernel).

I ran my repro with the zn_flush_cached_data(B_TRUE) patch in both zfs_holey_common and zfs_clone_range. Then I installed bpftrace probes focused on a specific DMU object to try and figure out why mmap flush is broken.

On flush, the kernel calls zpl_writepages which calls write_cache_pages twice and that in turn calls zfs_putpage for each dirty page. I added a trace for each page and found that some of the blocks that were mismatching were ones being written in the second write_cache_pages call.

After a bit more poking, I gathered stacks for end_page_writeback and noticed that the traces included stacks with write_cache_pages which shouldn't happen as the normal call happens during zil_commit callbacks:

@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+274
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 4
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    zpl_writepages+149
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 5
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+274
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    icp_init+45
    write_inode_now+9
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 13
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+149
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    icp_init+45
    write_inode_now+9
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 22
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zil_itx_destroy+174
    zil_lwb_flush_vdevs_done+212
    zio_done+975
    zio_execute+243
    taskq_thread+510
    kthread+232
    ret_from_fork+52
    ret_from_fork_asm+27
]: 992
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zil_itx_destroy+174
    zil_itxg_clean+149
    taskq_thread+510
    kthread+232
    ret_from_fork+52
    ret_from_fork_asm+27
]: 2660

generated from

kfunc:end_page_writeback
{
	if (args.page->mapping->host->i_ino == 3337) {
		printf("end %d\n", args.page->index);
		@epw[kstack()] = count();
	}
}

invoked as (from ZFS src root):

# bpftrace --include zfs_config.h -I include/ -I include/spl -I include/os/linux/spl -I include/os/linux/zfs -I include/os/linux/kernel trace_mmap.bt

The only place that zfs_putpage ends writeback early is when it hands the page back due to dmu_tx_assign failing.

Sure enough, if I trace the dmu_tx_abort call in that branch of zfs_putpage it is indeed getting hit:

kprobe:zfs_putpage
{
        $ip = (struct inode*)arg0;
        if ($ip->i_ino == 3337) {
                @put[tid] = 1;
        }
}
kretprobe:zfs_putpage /@put[tid] != 0/
{
        delete(@put[tid]);
}
kprobe:dmu_tx_abort /@put[tid] != 0/
{
        $tx = (struct dmu_tx*)arg0;
        printf("abort %d %d %d\n", $tx->tx_err, $tx->tx_needassign_txh, $tx->tx_wait_dirty);
        print(kstack());
}

And a bit more poking I find that dsl_dir_tempreserve_space is returning ERESTART which is checked at the bottom of dmu_tx_assign. Tracing also confirms zfs_putpage is returning ERESTART.

kretprobe:dsl_dir_tempreserve_space /@put[tid] != 0 && retval != 0/
{
        printf("reserve %d\n", retval);
}

The final failing sequence prints something like:

writepages 3337 1

        bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
        bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
        bpf_trampoline_6442532424+98
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

put 112
end 112
put 260
end 260
put 522
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 522
put 523
end 523
put 524
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 524
put 586
end 586
put 992
end 992
put 993
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 993
put 994
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 994
put 995
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 995
put 996
end 996
ret 0
filemap_fdatawait
done write_inode_now 0

My repro has new instrumentation to dump out blocks that don't match, and I'm running that with set -x and piping thru to grep x.364 to filter messages for the focus object.

This outputs something like this immediately after the traces above:

+ write x.364
+ truncate --size=4M x.364
' x.364
++ cp --debug x.364 x.364.2
+ cpout=''\''x.364'\'' -> '\''x.364.2'\''
++ diff -q x.364 x.364.2
+ diff='Files x.364 and x.364.2 differ'
++ echo ''\''x.364'\'' -> '\''x.364.2'\''
+ echo 'Files x.364 and x.364.2 differ' '(copy offload: unknown, reflink: yes, sparse detection: unknown)'
Files x.364 and x.364.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
' x.364 x.364.2
x.364 block 993 differs
x.364 block 994 differs
x.364 block 995 differs

All of blocks 993, 994, and 995 failed putpage in the trace. Some others did too, but they were probably zeros that didn't change from the last write to the mapping in the repro loop.

The new instrumentation:

# write, copy, compare, and print an error on mismatch
write() {
	truncate --size=4M "${f?}"
	python3 -c "${tool}" "${f?}" &>/dev/null
	cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
	diff=$(diff -q "${f?}" "${f?}.2")
	if (($?)); then
		echo "$diff" "($(echo "$cpout" | tail -1))"
		python -c "
import sys
def blocks(f):
	while True:
		b = f.read(1<<12)
		if b:
			yield b
		else:
			break
with open(sys.argv[1], 'rb') as f1, open(sys.argv[2], 'rb') as f2:
	for n, (b1, b2) in enumerate(zip(blocks(f1), blocks(f2))):
		if b1 != b2:
			print(sys.argv[1], 'block', n, 'differs')
" "${f?}" "${f?}.2"
	fi
}

TL;DR With more tracing, I confirmed all this by comparing the mismatched blocks reported in the repro to those that failed to write during the second call to write_cache_pages (the one which uses WB_SYNC_ALL, which is the one that matters). Skipping the writeback due to tx pressure causes the data to be missed later.

Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM which is implemented in do_writepages. Other errors break out of the loop (though all pages are written once), and the error is returned to the caller. Not that this error path does anything as-written: ZFS doesn't return errors from zpl_writefolio or zpl_writepage, nor does ZFS check that write_inode_now actually succeeds in either code path.

Note this is happening block-by-block with the first write_cache_pages call writing most of the data and the second call writing out the rest. When both calls hit tx pressure the data goes unwritten and bclone fails. The general pattern I'm seeing in traces is a few individual blocks fail and then dmu_tx_wait waits for pressure before attempting more. As a result, only individual blocks get skipped, and this probably explains why llseek is immune when sync=TRUE. Every DMU record is >1 page, so the next page succeeds with high probability, some block in thus dirtied, fancy llseek will notice dirty data, and DTRT because the record is dirty even though the DMU is missing some writebacks.

A simple fix, I guess, is to plumb the error up and loop on flush until it succeeds if it returns ERESTART.

Edit: See trace_mmap.bt for my full bpftrace tooling.

@rrevans
Copy link
Contributor

rrevans commented Mar 20, 2024

rrevans/zfs@c7100fc is my attempt at a simple retry loop to fix writeback gaps.

I'm a few minutes into testing, and so far I am unable to trigger failures with any of:

  • zfs_bclone_enabled = 0, zfs_dmu_offset_next_sync = 1
  • zfs_bclone_enabled = 0, zfs_dmu_offset_next_sync = 0
  • zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 0, zfs_dmu_offset_next_sync = 1
  • zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 1, zfs_dmu_offset_next_sync = 1

Write throughput is lower due to the synchronous ZIL commits. zil_commit_count from /proc/spl/kstat/zfs/test/objset-0x36 shows 5k ZIL commits per second with bclone and 2k files.

I would appreciate any feedback or attempts at reproducing with this patch (especially any glaring problems caused by synchronous flushing).

EDIT: I've run the repro loop and the golang emerge test script for 6 hours without reproducing any corruption. During this time I had a tuning script randomly setting the above module parameters to different values every 90 seconds. ZTS also passes running locally.

@IvanVolosyuk
Copy link

@rrevans Stupid question, will the fix still work with sync=disabled?

@rincebrain
Copy link
Contributor

sync=disabled ignores you requesting a sync, not syncing in general

@rrevans
Copy link
Contributor

rrevans commented Mar 20, 2024

@IvanVolosyuk Yes, and I've confirmed it works.

@ixhamza
Copy link
Contributor

ixhamza commented Mar 20, 2024

@rrevans - Kudos on your outstanding investigation. While I haven't been able to reproduce the corruption running your script overnight, it's worth noting that I got several instances of the following stack trace. Not sure if these are related to your patch though. I ran with zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 0, zfs_dmu_offset_next_sync = 1:

[ 3648.422271] INFO: task sync:72175 blocked for more than 120 seconds.
[ 3648.422340]       Tainted: P           OE      6.5.0-26-generic #26-Ubuntu
[ 3648.422366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3648.422390] task:sync            state:D stack:0     pid:72175 ppid:62726  flags:0x00000002
[ 3648.422431] Call Trace:
[ 3648.422438]  <TASK>
[ 3648.422448]  __schedule+0x2cb/0x760
[ 3648.422575]  ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422589]  schedule+0x63/0x110
[ 3648.422597]  wb_wait_for_completion+0x89/0xc0
[ 3648.422608]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3648.422620]  sync_inodes_sb+0xd6/0x2c0
[ 3648.422632]  ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422638]  sync_inodes_one_sb+0x1b/0x30
[ 3648.422644]  iterate_supers+0x7a/0x100
[ 3648.422726]  ksys_sync+0x42/0xb0
[ 3648.422736]  __do_sys_sync+0xe/0x20
[ 3648.422743]  do_syscall_64+0x59/0x90
[ 3648.422752]  ? irqentry_exit_to_user_mode+0x17/0x20
[ 3648.422764]  ? irqentry_exit+0x43/0x50
[ 3648.422775]  ? exc_page_fault+0x94/0x1b0
[ 3648.422787]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3648.422800] RIP: 0033:0x7d3e85325c8b
[ 3648.422840] RSP: 002b:00007ffd3ba8efb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[ 3648.422849] RAX: ffffffffffffffda RBX: 00007ffd3ba8f1a8 RCX: 00007d3e85325c8b
[ 3648.422853] RDX: 00007d3e85405a01 RSI: 00007ffd3ba90ee8 RDI: 00007d3e853c1b94
[ 3648.422958] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 3648.422964] R10: 000062f1a1d5d19e R11: 0000000000000246 R12: 00007ffd3ba90681
[ 3648.422967] R13: 00007d3e853fe428 R14: 000062f1a1d5eb80 R15: 000062f1a1d5d034
[ 3648.422975]  </TASK>

@rrevans
Copy link
Contributor

rrevans commented Mar 20, 2024

@ixhamza Thanks I appreciate the feedback

This message is the userland thread calling the sync system call taking a long time to finish. The kernel notices after a while and flags it.

It's somewhat expected as the repro script is spamming a lot of load, but I suspect that there may be a separate bug causing sync to livelock (block on ongoing writes vs. triggering a single pass of writeback work).

If you stop the script, does the sync process eventually terminate? It would be a large concern if the thread has become stuck forever. Note that lots and lots I/O may have to flush first - you can look at dirty and write back page counts in /proc/meminfo. These should be decreasing with writeback after interrupting the script.

Investigation of the sync call is on my to-do list. I've seen other usual behaviors like sync(3) blocking on background txg sync instead of triggering one, and I don't understand enough there yet to know what is going on.

@ixhamza
Copy link
Contributor

ixhamza commented Mar 20, 2024

@rrevans - I just ran the script again for about 12 minutes and noticed that only one sync process was able to spawn; it wasn't able to finish during the script's lifetime. After interrupting the script with Ctrl + C, it took around 20 seconds for the script to terminate along with the sync process. Additionally, both Dirty and Writeback pages reset back to zero after the termination. Also, the Dirty and Writeback pages kept resetting to 0 during the script run as well.

@amotin
Copy link
Member

amotin commented Mar 20, 2024

rrevans/zfs@c7100fc is my attempt at a simple retry loop to fix writeback gaps.

@rrevans Addition of error handling looks right to me, unless kernel expects errors to be silently ignored and it supposed to reissue silently writes later by itself.

What makes me really wonder though, why do we even use TXG_NOWAIT with the dmu_tx_assign() in Linux zfs_putpage()? I see Illumos switched to TXG_WAIT 10 years ago: https://www.illumos.org/issues/4347 , FreeBSD done it 7 years ago: freebsd/freebsd-src@16b46572fa997 , why Linux still uses TXG_NOWAIT? My first guess was that the context where it is called is not sleepable, but I see there:

<------>err = dmu_tx_assign(tx, TXG_NOWAIT);
<------>if (err != 0) {
<------><------>if (err == ERESTART)
<------><------><------>dmu_tx_wait(tx);

, where dmu_tx_wait() does sleep, even if though only once without retrial. It makes no sense to me. Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?

@rrevans
Copy link
Contributor

rrevans commented Mar 20, 2024

@amotin I agree it makes no sense, and TXG_WAIT is better. See rrevans/zfs@5270f15. I am running the repro tests and it looks OK so far.

As for how did we get here, 3c0e5c0f45 removed the loop but left the wait behind. I don't know why TXG_WAIT was not used at that time since it is still sleeping.

@behlendorf
Copy link
Contributor

@rrevans nice work running this down!

Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?

For the most part I think so. It looks like when the original Illumos change was ported zfs_putpage() was intentionally not updated to use TXG_WAIT. The comment indicates the concern was this function is reachable via direct memory reclaim so we'd prefer not to get stuck waiting on a sync. It'd potentially be better to redirty the page, allow direct reclaim to free memory elsewhere, and let the VFS try again in a few moments latter to write out the page when conditions may have improved.

    zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
    case may be a contention point just like zfs_write(), however it is not
    safe to block here since it may be called during memory reclaim.

But as @rrevans pointed out here's where things went wrong:

Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM which is implemented in do_writepages.

Over the years the kernel has significantly changed exactly how this page writeback was done and retries were handled. Unfortunately, no so much that it obviously broke things. When this change was originally ported from Illumos the Linux 3.10 kernel was hot off the presses.

I'd also suggest we switch to TXG_WAIT to help keep things simple. Adding the missing error propagation also looks like a good idea. Looking at some very old kernels it seems they largely ignored the return value, but that's no longer the case.

rrevans added a commit to rrevans/zfs that referenced this issue Mar 22, 2024
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
   already in writeback unless data-integrity sync is requested.

2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
   skipped due to DMU pushing back on TX assign.

3) Add missing mmap flush when doing block cloning.

4) While here, pass errors from putpage to writepage/writepages.

This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.

Fixes: openzfs#15933

Signed-off-by: Robert Evans <[email protected]>
@rrevans
Copy link
Contributor

rrevans commented Mar 22, 2024

#16019 fixes this with synchronous writeback and TXG_WAIT (same as my previous test patches).

I have some ideas how to improve the writeback code to avoid synchronous writes, but they need more invasive refactoring of how writeback works.

rrevans added a commit to rrevans/zfs that referenced this issue Mar 28, 2024
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
   already in writeback unless data-integrity sync is requested.

2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
   skipped due to DMU pushing back on TX assign.

3) Add missing mmap flush when doing block cloning.

4) While here, pass errors from putpage to writepage/writepages.

This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Robert Evans <[email protected]>
Closes openzfs#15933
Closes openzfs#16019
behlendorf pushed a commit that referenced this issue Mar 30, 2024
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
   already in writeback unless data-integrity sync is requested.

2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
   skipped due to DMU pushing back on TX assign.

3) Add missing mmap flush when doing block cloning.

4) While here, pass errors from putpage to writepage/writepages.

This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Robert Evans <[email protected]>
Closes #15933 
Closes #16019
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue May 3, 2024
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue May 3, 2024
@behlendorf
Copy link
Contributor

@thulle if possible would you mind confirming your compiling go reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.

@lopiuh
Copy link

lopiuh commented Jul 8, 2024

@thulle if possible would you mind confirming your compiling go reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.

Is that still open or done?

Yours lopiuh

lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Sep 4, 2024
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
   already in writeback unless data-integrity sync is requested.

2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
   skipped due to DMU pushing back on TX assign.

3) Add missing mmap flush when doing block cloning.

4) While here, pass errors from putpage to writepage/writepages.

This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Robert Evans <[email protected]>
Closes openzfs#15933 
Closes openzfs#16019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.