Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: raidz_expand_001_pos data errors on slow systems #16421

Open
tonyhutter opened this issue Aug 7, 2024 · 2 comments
Open

ZTS: raidz_expand_001_pos data errors on slow systems #16421

tonyhutter opened this issue Aug 7, 2024 · 2 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@tonyhutter
Copy link
Contributor

tonyhutter commented Aug 7, 2024

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 40
Kernel Version 6.9.9-200
Architecture x86_64
OpenZFS Version master (bd949b1)

Describe the problem you're observing

I've been able to reproduce a data corruption failure in functional/raidz/raidz_expand_001_pos on very slow systems. The test basically does:

  1. Create a raidz[1,2,3] pool with 5 disks.
  2. Add a 6th disk to test raidz expansion.
  3. Artificially pause the raidz expansion operation
  4. Zero out maximum number of disks possible [1, 2 or 3] for raidz group without data loss
  5. Scrub the pool
  6. Check to see there are no errors.

The test is failing at step 6 because there are data errors:

12:20:09.47 ERROR: check_pool_status testpool errors No known data errors exited 1
12:20:10.98   pool: testpool                                                     
12:20:10.98  state: ONLINE                                                       
12:20:10.98 status: One or more devices has experienced an error resulting in data
12:20:10.98   corruption.  Applications may be affected.                         
12:20:10.98 action: Restore the file in question if possible.  Otherwise restore the
12:20:10.98   entire pool from backup.                                           
12:20:10.98    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A       
12:20:10.98   scan: scrub repaired 37.7M in 00:00:38 with 1 errors on Wed Aug  7 12:20:09 2024
12:20:10.98 expand: expansion of raidz2-0 in progress since Wed Aug  7 12:18:13 2024
12:20:10.98   62.9M / 129M copied at 550K/s, 48.80% done, 00:02:02 to go         
12:20:10.98 config:                                                              
12:20:10.98                                                                      
12:20:10.98   NAME                STATE     READ WRITE CKSUM                     
12:20:10.98   testpool            ONLINE       0     0     0                     
12:20:10.98     raidz2-0          ONLINE       0     0     0                     
12:20:10.98       /var/tmp/dev-0  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-1  ONLINE       0     0     6                     
12:20:10.98       /var/tmp/dev-2  ONLINE       0     0 38.0K                     
12:20:10.98       /var/tmp/dev-3  ONLINE       0     0 37.4K                     
12:20:10.98       /var/tmp/dev-4  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-5  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-6  ONLINE       0     0     6                     
12:20:10.98                                                                      
12:20:10.98 errors: 1 data errors, use '-v' for a list   <<<<< SHOULD NOT SEE ERRORS HERE

Here's the dbgmsg snippit where it can't reconstruct:

12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=4 devid=6 offset=30d3c00 error=0)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=5 devid=0 offset=30d3e00 error=0)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2912:raidz_reconstruct(): raidz_reconstruct_expanded(row=3)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 0 devidx 4
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 2 devidx 0
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 5 devidx 3
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2970:raidz_reconstruct(): reconstruction not possible; too many failures
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:3182:vdev_raidz_combrec(): reconstruction failed for all num_failures
12:20:09.87 1723058408   ffff91cf62aed200 dsl_scan.c:4535:dsl_scan_sync(): scan issued 9150 blocks for testpool (14 segs) in 35014ms (avg_block_size = 15269, avg_seg_size = 9995190)
12:20:09.87 1723058408   ffff91cf62aed200 dsl_scan.c:822:dsl_scan_sync_state(): finish scan checkpoint for testpool
12:20:09.87 1723058408   ffff91cf62aed200 metaslab.c:3984:metaslab_flush(): flushing: txg 119, spa testpool, vdev_id 0, ms_id 2, unflushed_allocs 58368, unflushed_frees 53760, appended 112 bytes
12:20:09.87 1723058409   ffff91cf62aed200 dsl_scan.c:4427:dsl_scan_sync(): finish scan clearing for testpool
12:20:09.87 1723058409   ffff91cf62aed200 dsl_scan.c:4546:dsl_scan_sync(): scan issuing complete txg 120 for testpool
12:20:09.87 1723058409   ffff91cf62aed200 spa_history.c:306:spa_history_log_sync(): txg 120 scan done errors=2

Another example with zpool status -vs


13:11:11.15   pool: testpool                                                     
13:11:11.15  state: ONLINE                                                       
13:11:11.15 status: One or more devices has experienced an error resulting in data
13:11:11.15   corruption.  Applications may be affected.                         
13:11:11.15 action: Restore the file in question if possible.  Otherwise restore the
13:11:11.15   entire pool from backup.                                           
13:11:11.15    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A       
13:11:11.15   scan: scrub repaired 37.2M in 00:00:38 with 1 errors on Wed Aug  7 13:11:09 2024
13:11:11.15 expand: expansion of raidz2-0 in progress since Wed Aug  7 13:09:05 2024
13:11:11.15   73.0M / 129M copied at 593K/s, 56.57% done, 00:01:36 to go         
13:11:11.15 config:                                                              
13:11:11.15                                                                      
13:11:11.15   NAME                STATE     READ WRITE CKSUM  SLOW               
13:11:11.15   testpool            ONLINE       0     0     0     -               
13:11:11.15     raidz2-0          ONLINE       0     0     0     -               
13:11:11.15       /var/tmp/dev-0  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-1  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-2  ONLINE       0     0 37.5K     0               
13:11:11.15       /var/tmp/dev-3  ONLINE       0     0 37.0K     0               
13:11:11.15       /var/tmp/dev-4  ONLINE       0     0     8     0               
13:11:11.15       /var/tmp/dev-5  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-6  ONLINE       0     0     6     0               
13:11:11.15                                                                      
13:11:11.15 errors: Permanent errors have been detected in the following files:  
13:11:11.15                                                                      
13:11:11.15         /testpool/fs3/1/30

Describe how to reproduce the problem

We were hitting this pretty frequently while testing out the new "run ZTS on QEMU VM on github runner" PR (#15838). Those tests run fewer CPU resources than the buildbot ones. It's that slowness that seems to trigger the bug. We were able to hit it there on basically every distro: Almalinux 8-9, Fedora 39-40, Ubuntu 20,22,23.

I was able to reproduce this on a local VM with

  1. Fedora 40 VM on 2 vCPUs w/8GB RAM.
  2. On the host, I reduced down the QEMU process to about 10% CPU with:
	sudo cpulimit -l 10 -p [QEMU PID HERE]

Even then it didn't hit it every time. Just to give you an idea:

Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:55] [FAIL]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:32] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:18] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:58] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:09] [FAIL]

This raidz_expand_001_pos failure is our last major blocker before merging #15838

@tonyhutter tonyhutter added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 7, 2024
@don-brady
Copy link
Contributor

Note that it randomly chooses the parity for each run so the ones that fail could have different parity than the ones that succeed. Looks like parity=2 is common to the failures I have observed.

@don-brady
Copy link
Contributor

Seems to always be one 8K block that has unexpected zeros inside it on 512 byte boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants