-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input/output error in recent snapshot; three times on same host now #15474
Comments
Per your suggestion in #14911 (comment) I compiled zdb from git and ran it here. If I look at the diff between the Same with the diff between 0500Z (broken) and 0700Z: Nor with a diff between two good snapshots 0400Z and 0700Z: And the difference between the dataset and the snapshot (sans
zdb-K-db6-dataset-vs-0500.diff.gz I'd provide the full files, if you think there is useful info in there. I just have to know I'm not leaking anything, so I'd have to do some more redacting. |
I'd try running a version that's current like 2.1.13 or 2.2.0 and seeing if still produces EIOIO, rather than hoping whatever Ubuntu is shipping isn't on fire. |
Good news for openzfs/zfs#15474 -- after a scrub, a reading of all files and a second scrub, the data errors are gone with 2.1.13. Also, the sync of the encrypted snapshot to elsewhere succeeded.
Okay. I've spent some time getting proper deb packages which I can manage in an apt repository. (This meant getting reproducible builds right, because repository managers mandate that two identically named files must be identical.) https://github.com/ossobv/zfs-kmod-build-deb This appears to be sufficient to get things up and running.
That makes me a happy camper. With regards to self-compiled builds. There are a few items that would be nice to get fixed here:
See https://github.com/ossobv/zfs-kmod-build-deb/blob/3ddbe5fb2f09fda9724b232e6f90903d524abed8/Dockerfile#L60-L81 and https://github.com/ossobv/zfs-kmod-build-deb/blob/3ddbe5fb2f09fda9724b232e6f90903d524abed8/dpkg-repackage-reproducible . Best would probably be to create proper debian packaging infra. Then I could also ensure that e.g. libzpool5 conflicts with libzpool5linux, which is likely going to be my next hurdle. Thinking out loud... In any case: this particular issue is fixed with 2.1.13. Thanks! Walter Doekes |
Bad news. The error reappeared in snapshot 2013-11-13 22:00 Z. Same system, 2.1.13 ZFS module now. Versions
We're still running the Ubuntu provided userspace tools (2.1.5), but we run the 2.1.13 kernel module https://github.com/ossobv/zfs-kmod-build-deb/releases/tag/zfs-2.1.13-1osso1 ErrorsError seen in zpool status:
FollowupI'll try and replace the userland tools with the 2.1.13 builds and see if anything changes. @rincebrain: if you have other (zdb?) tips for debugging in the mean time, I'm interested. |
But the files were readable in the snapshot -- at least the ones I tried. Error state:
Removing the snapshot:
Doing a scrub:
... running 2.1.13 userland tools now. We'll see if the error reappears. |
And now, with userland at 2.1.13 and module 2.1.13, after a reboot, there is a new data error.
Interestingly, just after checking (cause?) with
dmesg has nothing (no disk/IO errors or anything). The storage seems fine too:
(I do notice that ashift is 9 and the sector size is 512, which is not optimal for this device.) (edit: added uname -r) |
(SSDs usually hide the performance impact of actual size of IOs to them well enough that you should weigh the observed performance against the space efficiency that results, IMO.) So, I would guess this is decryption errors, not actual r/w/c errors, since they're not showing up there. My guess would be there's some additional stupid caveat around the MAC caveats that said patch you originally were applying to Ubuntu's tree is a workaround for - the caveat is, that patch just works around it when it encounters it, a different change is what stops it from screwing up in the first place. (cc #11294 #13709 #14161 63a2645 e257bd4 for prior art.) Of course, those are all just for "I cannot unlock the dataset at all because of that error", not files within it erroring... If it's giving stat errors, I would guess that there could be a similar issue with not just the objset accounting data? I'm not really sure what we can do about that, though, because the reason we can just ignore it in the objset accounting data case is that we can just regenerate it. If this is a backup target, what's the source running, in terms of ZFS version? What's the automation that's doing the send/recv? 2.1.x has useful counters you can see in I would suggest that you look at the zpool events output for those errors, and see what they have to say - in particular, you might also find it useful to look at the (git, so it can look at encrypted objects) zdb output for objects 0 and 0x20 (so 32) on that dataset that's erroring, and see if it reports anything interesting. (0 could actually mean 0 or somewhere in the metadata without more precision; 0x20 appears to be the system attribute node on most of the more recent datasets I've made and a normal directory object on some others, so it'd be good to find out what it is here) |
It also, thinking about it, wouldn't completely astonish me if those decryption errors were transient and went away after it decided to regenerate the accounting information, but that's just wild speculation, so if you scrub and they go away, and it doesn't come back, great, that's a nice cleanup that should probably be applied to how that's reported, but that's better than getting "???" for files. If you're still seeing the ??? for some files, though, then we should probably dig even more into it. |
Thanks for the replies :)
This is a backup source. The target is running 2.1.5, but we're not actually getting any data there (when it fails), so I don't think that can be an issue. The automation consists of: zfs snapshot, zfs set planb:owner=..., zfs send -I. The data is sent unencrypted (not
So far this only went away by removing the snapshot and doing a scrub afterwards.
Weird. The files weren't there. And then they were, but were unreadable. Still exactly 2 errors at this point.
But this caused a third error to appear. (And later a fourth.)
Here are the counters:
Not sure if these say they are decryption errors or not. I'll see if I can get some |
|
Types of the aforementioned objects:
So, Very few differences between a good and a bad snapshot:
|
My guess remains that it's failing to decrypt the SA and thus refusing you access to the information on the object, but it's not obvious to me why it would be doing that. (zdb, of course, doesn't care about that anyway.) What's the server hardware that this is in? I don't immediately recall any bugs with a race in the native encryption code, but, who knows. |
Supermicro X11SCE-F, |
If you have suggestions on debug/tracing code to add to 2.1.13, I can certainly add patches. There was a second dataset snapshot broken today. I removed both snapshots so the backups are running again. So there's nothing to poke at until the next error pops up. |
Checking in. In November I've had corrupt snapshots on:
In December (2023-12-01 10:03:02), I updated to ZFS 2.1.14-1osso0:
No idea if it could be related to the fixes in 2.1.14. I do hope it is and that the snapshots remain working from now on. |
Okay. Well, that theory was fun while it lasted :)
For different reasons I'll be disabling hourly snapshots on this host. Let me know if you have anything I can work with, then I'll gladly restart something to try and reproduce. ( |
Maybe relevant: when no snapshots were made in between, I (still) had to scrub twice to get rid of the error after deleting the failing snapshot. |
That's expected - the way the error list in zpool status works, it keeps the previous list after a scrub and only removes things after they don't crop up for two scrubs, I think the original idea being that if you have something flaky, it's useful to remember where it was wrong before even if it read fine this time. |
Today I saw https://www.phoronix.com/news/OpenZFS-Encrypt-Corrupt and this looks very similar to #12014 :
|
System information
Problem observation
Input/output error
(Times below without a Z are in CET, i.e. 06:00 is 05:00Z.)
I have snapshots:
But the (now) second to last one, I cannot access.
Some files have stat issues (see the
?
above). None of the files can be read except for the empty file:(Reading the 0 byte file went fine.)
Checking zpool status:
Most of those 941 data errors were from the battle I had with the previous snapshot that went bad. I deleted that bad snapshot yesterday, in the hope that it was just a fluke. That snapshot concerned a different fileset by the way:
tank/mysql/redacted2/data
How to reproduce the problem
I don't know yet. This is the first such issue I have encountered:
planb-20231101T0500Z
andplanb-20231101T0700Z
; the 0700Z one appears to be readable.@planb-20231101T0500Z
, but none that can be synced, because every sync starting from 0400Z tries to include 0500Z as well, and that fails.planb-20231030T0500Z
, so also at 05:00, but an even earlier failure was intank/mysql/redacted3/data@planb-20231027T1800Z
.Include any warning/errors/backtraces from the system logs
dmesg
Empty. Nothing after boot (Oct 17).
/proc/spl/kstat/zfs/dbgmsg
I got this from the debug log, which isn't much:
This is strange, because at 2023-11-01 that (
redacted2
) snapshot should've been gone already. I was now debugging theredacted
one.After clearing the log (
echo 0 > /proc/spl/kstat/zfs/dbgmsg
) and enabling (random) extra stuff (echo 1 > /sys/module/zfs/parameters/spa_load_print_vdev_tree
) I now have this:This means nothing to me, and could just be debug info from the extra
spa_load_print_vdev_tree
I enabled.zed
"When"
First failure (at 05:01:04Z), seen on the receiving end:
Second failure (at 07:00:56Z), seen on the receiving end:
The other side?
The other side has no snapshots after
tank/mysql/redacted/data@planb-20231101T0400Z
. The snapshots that do exist are likely fine (including 0400Z).How to fix?
As I mentioned, I yesterday had the same issue with a bigger snapshot (
redacted2
). There I destroyed the snapshot, and resumed with a snapshot which I still had on both sides (the 0400Z one).I started a scrub then too, which finished before the new issues started:
Interestingly, the mentioned problems in
zpool status -v
were still there.Questions
What other debugging can I do/enable/run/check?
I can probably leave this in this state for a short while, hopefully not too long, while I break out the debugging tools.
Can you help with with some zfs/zdb debugging foo so we can get to the bottom of this?
Thanks!
Walter Doekes
OSSO B.V.
The text was updated successfully, but these errors were encountered: