Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs fails to detect ZFS-8000-8A corruption: Reading file causes ZFS-8000-8A, scrub claims OK, repeat #16520

Open
haraldrudell opened this issue Sep 9, 2024 · 9 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@haraldrudell
Copy link

haraldrudell commented Sep 9, 2024

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.4 LTS jammy
Kernel Version 6.5.0-45-generic
Architecture x86_64
OpenZFS Version zfs-2.1.5-1ubuntu6~22.04.4
zfs-kmod-2.2.0-0ubuntu1~23.10.3

Describe the problem you're observing

  1. Every time a particular file is read, it returns I/O error
  2. zpool status -xv reports: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A, all error counters zero and that file permanent error
  3. date --rfc-3339=second && zpool scrub -w z2023 && date --rfc-3339=second && zpool scrub -w z2023 && date --rfc-3339=second claims everything fixed
  4. back to step 1

.

BUG: scrub and zfs should not claim everything is fine when it isn’t
BUG: there is no way to have zfs admit that there is corruption

.

QUESTION: is zfs-2.1.5 OK paired with zfs-kmod-2.2.0? the semantic versions are different.
fresh installs have same versions, another host also have the same difference

.

  • The disk is good ssd with little use
  • reading the disk surface is error free
  • smartctl reports no errors ever

Describe how to reproduce the problem

cp -avn /mnt/w/2024/Media/filename .
'/mnt/w/2024/Media/filename' -> './filename'
cp: error reading '/mnt/w/2024/Media/filename': Input/output error

The software that wrote this file:
— first wrote the file verifying no errors
– then read the file verifying no errors and validated the checksum
meaning: immediately after writing, the file could be read
the first bookmark event I/O error was 6 days later
nothing in particular happened to the host or the disk during that time, no reboots or such
this particular host has operated this pool for two years

zfs came up with this error all by itself, it can’t read what it writes to disk and
zfs can’t figure out ahead of time that it is unreadable.
Of course, 100s of other files worked written around that same time

syslog:

Sep  9 05:11:58 c68z zed: eid=6871 class=authentication pool='z2023' bookmark=3243:4600:0:450

there is no zed logging since when this file was written, September 2, or any sys-logging when the file was written
zfs came up with this issue all by itself, there was no power outage or tripping over cables

every time a scrub completes and the I/O error occurs, the bookmark log statement is printed

Include any warning/errors/backtraces from the system logs

@haraldrudell haraldrudell added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 9, 2024
@haraldrudell
Copy link
Author

haraldrudell commented Sep 11, 2024

  1. Apparently Ubuntu 22.04.3 with HWE kernel ends up with mismatched zfs
  2. zfs does not support different version of kernel module and utilities as displayed by zfs version
  3. the zfs 2.2.0 Ubuntu ended up providing is buggy and corrupts pools
  4. Ubuntu will not port a newer, non-corrupting zfs version to 22.04.3
  5. Ubuntu has for several months not fixed the version conflict
  6. easiest fix appears to be upgrade to 24.04

@AllKind
Copy link
Contributor

AllKind commented Sep 11, 2024

Personally I'd go one step further. I'd not trust the Ubuntu packages at all.
I'd download the source (not the tarball, as it's broken) and follow the directions to build 2.2.6 native debian packages.
Just make sure to remove all the Ubuntu zfs (including the libs) before.

@pimlie
Copy link

pimlie commented Sep 11, 2024

  1. easiest fix appears to be upgrade to 24.04

Or use this PPA https://launchpad.net/~patrickdk/+archive/ubuntu/zfs/+packages instead (if you trust the maintainer).

@clhedrick
Copy link

I'm not so sure Ubuntu's 2.2.0 corrupts pools. If you're referring to the bug I think you are, they backported a fix, but didn't upgrade the version number. Also, the 22.04 HWE kernel is now 6.8, with 2.2.2. The change from 6.5 to 6.8 is recent. They still use mismatched kernel and user in HWE, however. A number of bug reports have been ignored.

However for myself, I run 2.2.5 (haven't rebooted since 2.2.6) on Ubuntu. Discussions on this topic have indicated that they don't think openzfs does enough testing of new releases to meet their standards, so they stick with versions that have gone through a full Ubuntu release beta cycle. My own evaluation is different, both for zfs and kernel. (I'm running 6.6.44 on my file servers.)

@haraldrudell
Copy link
Author

haraldrudell commented Sep 11, 2024

  1. The learning is that Ubuntu 22.04.4 HWE kernel (better hardware support kernel) ends up with faulty zfs all by itself
  2. The indicator is that unmatched versions are displayed, or that version 2.2.0 is present:
zfs version
zfs-2.1.5-1ubuntu6~22.04.4
zfs-kmod-2.2.0-0ubuntu1~23.10.3
  1. This has been trouble several months from maybe February through September 2024 causing a small amount of zfs errors or zfs pools enter failed state fixable by scrub
  2. easiest fix is to upgrade to Ubuntu 24.04.1 LTS noble kernel 6.8.0-44-generic directly from the failed state
zfs version
zfs-2.2.2-0ubuntu9
zfs-kmod-2.2.2-0ubuntu9

@rincebrain
Copy link
Contributor

I would be curious to see pointers of these discussions, as well as the specific bugs that you're advocating upgrading to 2.2.2 to resolve.

@clhedrick
Copy link

I believe some people suggested 2.2.2 because of three serious problems with 2.2.0: BRT bugs, leading to 2.2.1 turning off the feature by default. CVE-2023-49298, a potentially serious corruption problem, fixed in 2.2.2. And #15526, which Ubuntu's change log treats as separae from the CVE. I'm not sure whether this is right.

I note, however, that the fix to CVE-2023-49298 was cherry-picked into Ubuntu's ZFS 2.2.0, along with a fix to #15526. I believe disabling BRT was as well, thuogh I can't verify it.The first two were also cherry-picked into their ZFS 2.1.5.

Ubuntu prefers to freeze ZFS and cherry-pick only CVEs and other very serious fixes. I disagree, which is why I'm using 2.2.5. (We haven't rebooted since 2.2.6 was released.)

@rincebrain
Copy link
Contributor

The CVE is also wrong, for most useful definitions, as you might expect for a CVE generated by some random person opening it. As the end of #15526 says, that wasn't fully resolved until later, with (among others) #16019, because, separate from any bugs found in BRT itself, it exposed a bunch of existing difficult to hit races since a metadata-only copy is inherently going to be a faster operation, so some otherwise impossible or impractical to hit races that have existed for a long time turned up.

There were also fixes in BRT, like #15842, though once the killswitch PR is cherrypicked that becomes less urgent unless people override that.

In particular, though, none of the flaws around BRT that I'm aware of should be producing checksum errors, since they were all around logical data handling, so none of those should be germane for this bug.

@clhedrick
Copy link

Sure. I doubt that most of the discussion here has anything to do with what's causing the user's problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants