-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running ZFS 2.2.0 test suite causes kernel to hang (general protection fault) #15477
Comments
Something very similar was recently hit by the CI: |
After looking at Brian's log I continued testing differently. Summary:The results seem to indicate a hard to catch, but serious issue. Testbed:To speed things up I switched to the upcoming Ubuntu 24.04 (development) release since it comes with ZFS built-in (currently 2.2.0rc3). The test suite runs on files and loop devices, so it actually tests the ZFS code and stresses the file system underneath. The logfile collected by the testsuite doesn't contain kernel hang/oops messages, I ran the tests with options vxK and collected the syslog. So, I ran the test suite:
Results:on ext4 root filesystem on ext4 root filesystem, but /var/tmp as zpool (default options) on ext4 root filesystem, but /var/tmp as zpool (-o version=28) on ZFS root filesystem on my laptop with Ubuntu 20.04 on ext4 root but with ZFS bwatkinson directIO master from 11/10 The files:(the dmesg files are actually taken from a filtered syslog, don't be confused ;-)
|
Wow, that's a lot. I hope that this will be helpful to the devs. Good job! |
Update: since the problem is hard to reproduce I focused on making it easier to catch. 1. test on real hard disks20 iterations of the test suite did not show a problem! Test environment: IBM x3650, Ubuntu 24.04, Ubuntu kernel 6.5, root on ext4, tests on ZFS (mirrored stripe of 4 300GB SAS disks on a LSI 2008 IT controller). Promising :-) 2. test different underlying file systemsTest environment: Tyan S8036GM2NE, Ubuntu 24.04, Ubuntu kernel 6.5, root on ext4, test suite ran on /var/zfs) /var/zfs on ext4 - general protection fault in the second run during /var/zfs on tmpfs - general protection fault in the eighth run during I suspected the trim code being a difference between SSDs and disks an so I modified the ZFS code ( 3. test new release 2.2.1Due to the data corruption bug that kept most of you busy I switched to release 2.2.1 using kernel 6.1.63 and applied the patch (#15571) manually. (Test environment: Tyan S8036GM2NE, Ubuntu 24.04, vanilla kernel 6.1.63, root on ext4, test suite ran on /var/zfs backed by ZFS [mirrored stripe of 4 NVMe SSDs The test suite repeatedly got stuck in the first iteration during
No further ZFS operations are possible, when shut down the system can't umount the file systems and shutdown is forced by systemd (zfs_2.2.1_run1.txt). 5. test release 2.2.2After release 2.2.2 was out today, I tested if the problem is still around. I cut down the test suite to just the ones of My summary/suspicion:
While data corruption is a truly disturbing problem in a file system, stability problems are almost as concerning if you want to run production workloads. |
I'm closing this issue. I tested release 2.2.3 of OpenZFS and I could not reproduce the issues reported here. While the problem is probably still in the older ZFS releases, I now found a stable release I can trust for production, hence I no longer need a fix for release 2.2.0. |
System information
Describe the problem you're observing
I had no problem builing the debian packages from the 2.2.0 release code. I installed the dkms and userland packages after removing the Ubuntu distributed ones. After having kernel modules and userland on version 2.2.0 I ran the test suite as documented on a Tyan Epyc system (using the Ubuntu 5.15.0-71-generic kernel). The first run succeeded with a few unexpected fails. I decided to test again. The second run did not finish, but hang after causing a general protection fault.
In order the remove Ubuntu specialties from the equation I built a vanilla kernel 6.1.59 using the Ubuntu config and installed ZFS 2.2.0.
Now it took 3 runs of the test suite (Epyc_6.1.59-20231023_zfs-2.2.0-run-1.txt, Epyc_6.1.59-20231023_zfs-2.2.0-run-2.txt, Epyc_6.1.59-20231023_zfs-2.2.0-run-3.txt) to hang the system with a general protection fault. If you compare the results of the first and the second run, the number of FAIL tests increased (huh).
Since this is a new platform, I repeated the tests on well-hung hardware in my lab. I could reproduce the behavior on AMD Opteron and Intel Westmere platforms. During the first run of the test suite I observed kernel block info messages in the vdev_autotrim task (see Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt), following the advice of a similar bug filed here I decided to change SSDs and modify/disable TRIM and NCQ using kernel options and udev rules. This did not make a huge difference, only the time when the kernel got stuck varied. However, often the second or third run of the test suite hang at the add_prop_ashift test. I also noticed that after successfully finishing the first run of the test suite the system had a load >2 even though it was completely idle.
Out of curiosity I decided to run the test suite against the kernel module of ZFS 2.1.5 as supplied by Canonical (userland was still on version 2.2.0). I expected more fails and skipped tests but the test suite hang with a null pointer dereference in the first run (Epyc_5.15.0-71-generic_zfs-2.1.5-run-1.txt).
I'm pretty sure, the test suite works on other systems you have tested, however I failed. We use ZFS in production for years but I never ran the test suite before, so this comes as a surprise ...
PS: in order to disable TRIM I set /sys/block/sdX/queue/discard_max_bytes to 0
Describe how to reproduce the problem
Have Ubuntu 22.04 installed on ZFS (mirrored rpool & bpool). Build and install vanilla LTS kernel 6.1.59 using the config of the generic Ubuntu kernel. Build debian packages from the 2.2.0 release source and install them using dkms. Run the test suite that comes with the source as documented (./scripts/zfs-tests.sh -vx) more than one time.
Include any warning/errors/backtraces from the system logs
The stack trace of the general protection fault looks like this:
The files:
Epyc_6.1.59-20231023_zfs-2.2.0-run-1.txt
Epyc_6.1.59-20231023_zfs-2.2.0-run-2.txt
Epyc_6.1.59-20231023_zfs-2.2.0-run-3.txt
Opteron_6.1.59-20231023_zfs-2.2.0-run-1.txt
Opteron_6.1.59-20231023_zfs-2.2.0-run-2.txt
Opteron_noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt
Opteron_noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt
Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt
Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt
Opteron_trimoff_6.1.59-20231023_zfs-2.2.0-run-1.txt
Opteron_trimoff_6.1.59-20231023_zfs-2.2.0-run-2.txt
Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt
Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt
Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-3.txt
ZFS 2.1.5 modules and 2.2.0 userland (see above):
Epyc_5.15.0-71-generic_zfs-2.1.5-run-1.txt
Westmere_5.15.0-71-generic_zfs-2.1.5-run-1.txt
The text was updated successfully, but these errors were encountered: