You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been able to reproduce a data corruption failure in functional/raidz/raidz_expand_001_pos on very slow systems. The test basically does:
Create a raidz[1,2,3] pool with 5 disks.
Add a 6th disk to test raidz expansion.
Artificially pause the raidz expansion operation
Zero out maximum number of disks possible [1, 2 or 3] for raidz group without data loss
Scrub the pool
Check to see there are no errors.
The test is failing at step 6 because there are data errors:
12:20:09.47 ERROR: check_pool_status testpool errors No known data errors exited 1
12:20:10.98 pool: testpool
12:20:10.98 state: ONLINE
12:20:10.98 status: One or more devices has experienced an error resulting in data
12:20:10.98 corruption. Applications may be affected.
12:20:10.98 action: Restore the file in question if possible. Otherwise restore the
12:20:10.98 entire pool from backup.
12:20:10.98 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
12:20:10.98 scan: scrub repaired 37.7M in 00:00:38 with 1 errors on Wed Aug 7 12:20:09 2024
12:20:10.98 expand: expansion of raidz2-0 in progress since Wed Aug 7 12:18:13 2024
12:20:10.98 62.9M / 129M copied at 550K/s, 48.80% done, 00:02:02 to go
12:20:10.98 config:
12:20:10.98
12:20:10.98 NAME STATE READ WRITE CKSUM
12:20:10.98 testpool ONLINE 0 0 0
12:20:10.98 raidz2-0 ONLINE 0 0 0
12:20:10.98 /var/tmp/dev-0 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-1 ONLINE 0 0 6
12:20:10.98 /var/tmp/dev-2 ONLINE 0 0 38.0K
12:20:10.98 /var/tmp/dev-3 ONLINE 0 0 37.4K
12:20:10.98 /var/tmp/dev-4 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-5 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-6 ONLINE 0 0 6
12:20:10.98
12:20:10.98 errors: 1 data errors, use '-v' for a list <<<<< SHOULD NOT SEE ERRORS HERE
Here's the dbgmsg snippit where it can't reconstruct:
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=4 devid=6 offset=30d3c00 error=0)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=5 devid=0 offset=30d3e00 error=0)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2912:raidz_reconstruct(): raidz_reconstruct_expanded(row=3)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 0 devidx 4
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 2 devidx 0
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 5 devidx 3
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2970:raidz_reconstruct(): reconstruction not possible; too many failures
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:3182:vdev_raidz_combrec(): reconstruction failed for all num_failures
12:20:09.87 1723058408 ffff91cf62aed200 dsl_scan.c:4535:dsl_scan_sync(): scan issued 9150 blocks for testpool (14 segs) in 35014ms (avg_block_size = 15269, avg_seg_size = 9995190)
12:20:09.87 1723058408 ffff91cf62aed200 dsl_scan.c:822:dsl_scan_sync_state(): finish scan checkpoint for testpool
12:20:09.87 1723058408 ffff91cf62aed200 metaslab.c:3984:metaslab_flush(): flushing: txg 119, spa testpool, vdev_id 0, ms_id 2, unflushed_allocs 58368, unflushed_frees 53760, appended 112 bytes
12:20:09.87 1723058409 ffff91cf62aed200 dsl_scan.c:4427:dsl_scan_sync(): finish scan clearing for testpool
12:20:09.87 1723058409 ffff91cf62aed200 dsl_scan.c:4546:dsl_scan_sync(): scan issuing complete txg 120 for testpool
12:20:09.87 1723058409 ffff91cf62aed200 spa_history.c:306:spa_history_log_sync(): txg 120 scan done errors=2
Another example with zpool status -vs
13:11:11.15 pool: testpool
13:11:11.15 state: ONLINE
13:11:11.15 status: One or more devices has experienced an error resulting in data
13:11:11.15 corruption. Applications may be affected.
13:11:11.15 action: Restore the file in question if possible. Otherwise restore the
13:11:11.15 entire pool from backup.
13:11:11.15 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
13:11:11.15 scan: scrub repaired 37.2M in 00:00:38 with 1 errors on Wed Aug 7 13:11:09 2024
13:11:11.15 expand: expansion of raidz2-0 in progress since Wed Aug 7 13:09:05 2024
13:11:11.15 73.0M / 129M copied at 593K/s, 56.57% done, 00:01:36 to go
13:11:11.15 config:
13:11:11.15
13:11:11.15 NAME STATE READ WRITE CKSUM SLOW
13:11:11.15 testpool ONLINE 0 0 0 -
13:11:11.15 raidz2-0 ONLINE 0 0 0 -
13:11:11.15 /var/tmp/dev-0 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-1 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-2 ONLINE 0 0 37.5K 0
13:11:11.15 /var/tmp/dev-3 ONLINE 0 0 37.0K 0
13:11:11.15 /var/tmp/dev-4 ONLINE 0 0 8 0
13:11:11.15 /var/tmp/dev-5 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-6 ONLINE 0 0 6 0
13:11:11.15
13:11:11.15 errors: Permanent errors have been detected in the following files:
13:11:11.15
13:11:11.15 /testpool/fs3/1/30
Describe how to reproduce the problem
We were hitting this pretty frequently while testing out the new "run ZTS on QEMU VM on github runner" PR (#15838). Those tests run fewer CPU resources than the buildbot ones. It's that slowness that seems to trigger the bug. We were able to hit it there on basically every distro: Almalinux 8-9, Fedora 39-40, Ubuntu 20,22,23.
I was able to reproduce this on a local VM with
Fedora 40 VM on 2 vCPUs w/8GB RAM.
On the host, I reduced down the QEMU process to about 10% CPU with:
sudo cpulimit -l 10 -p [QEMU PID HERE]
Even then it didn't hit it every time. Just to give you an idea:
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:55] [FAIL]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:32] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:18] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:58] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:09] [FAIL]
This raidz_expand_001_pos failure is our last major blocker before merging #15838
The text was updated successfully, but these errors were encountered:
Note that it randomly chooses the parity for each run so the ones that fail could have different parity than the ones that succeed. Looks like parity=2 is common to the failures I have observed.
System information
Describe the problem you're observing
I've been able to reproduce a data corruption failure in
functional/raidz/raidz_expand_001_pos
on very slow systems. The test basically does:The test is failing at step 6 because there are data errors:
Here's the dbgmsg snippit where it can't reconstruct:
Another example with
zpool status -vs
Describe how to reproduce the problem
We were hitting this pretty frequently while testing out the new "run ZTS on QEMU VM on github runner" PR (#15838). Those tests run fewer CPU resources than the buildbot ones. It's that slowness that seems to trigger the bug. We were able to hit it there on basically every distro: Almalinux 8-9, Fedora 39-40, Ubuntu 20,22,23.
I was able to reproduce this on a local VM with
Even then it didn't hit it every time. Just to give you an idea:
This
raidz_expand_001_pos
failure is our last major blocker before merging #15838The text was updated successfully, but these errors were encountered: