-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes raw send on encrypted datasets does not work when copying snapshots back #12594
Comments
This happens because when sending raw encrypted datasets the userspace accounting is present Edit: If you have critical data lost due to this case I could help you recover them. |
I am able to reproduce this. At a high level I wanted to send an unencrypted dataset to a new pool with encryption enabled, wipe the old pool, and raw send this encrypted dataset back to a fresh pool. But snapshots end up being sent back the other direction at times since it's on an active system. In this process I discovered this bug for myself. I've made a repro script here which just makes a file-based pool and puts it into a broken state:
Worse, I originally poked around testing this on my personal pool and it faulted the pool in such a way that even destroying those affected datasets didn't help me:
|
@putnam Hey! Thanks a ton for getting a local reproducer working! I, however, cannot get this to work (i.e., bug out) on my test platform: an Intel laptop (but I unfortunately has never managed to reproduce on that device). I don't have time right now, but I will try this on my production machine (which does have the problem). I (therefore) think there may be a hardware component to this bug/these bugs. In the meantime, can you check this on 0.8.6? (I'm low-key hoping you'll be willing to bisect this.) |
Ran @putnam's script on my own setup (system info at the top) and it did not result in any errors. The final I just added a few more sending data back and forth and it made it break for me. @aerusso I suspect if you pass snapshots back and forth a few more times (even if it varies per hardware) it will break eventually. I was thinking it would be easy to also write a bit of a fuzzing script to randomly send raw snapshots back and forth, unmounting, remounting, etc that should be able to generate many fail cases; not sure if that would be of help. my mods to the script below
@putnam for me it got rid of those errors after two scrubs if you want to try to fix your personal pool |
Thanks @aerusso and @digitalsignalperson for the feedback and updates. I wonder what is different between our setups. For anyone running that script please post your kernel and ZFS versions at the time you ran it. (uname -a; cat /sys/module/zfs/version) My kernel at the time of test: Debian 5.14.0-1-amd64 I did also find someone else wrote up a similar script (#11983) to attempt a reliable repro. This bug has been reported in several places and probably needs consolidation. It's also clear some efforts have already been made and maybe the root cause is already well-understood. See #11300 which has not been updated in ~4 months. The situation seems kind of bad. I don't know all the possible use cases where it might occur (probably many) but my situation is:
@digitalsignalperson I will do two scrubs (this is a large pool so it'll take ~3 days) and report back if it fixes the pool error. Thanks! |
@putnam i have same problem. did you find any solution? |
Thanks! The modified version "works" (breaks) reliably on my test platform. |
Confirming that two back-to-back scrubs cleared the corruption error. Not sure the technical reason why it took two scrubs, but glad it's cleared. For what it's worth, my system is an Epyc 7402P with 128GB of ECC RAM. |
not sure either about the two scrubs, but saw it in suggested/reported in one or some of the other similar encryption issues |
ZFS remembers the errors from the last completed scrub too, which is why it takes two with the errors gone for them to go away, AIUI. |
this bug has likely existed since the introduction of the encryption feature. |
Having the same issue on this end... had no idea that my encrypted backups were getting hosed until I went to restore some of my datasets from my backup. I do encrypted send w/ syncoid (--sendoptions="w") to backup to the backup pool. The only problem though is that I tried the double scrub but I'm still getting the same input/output error. Is there any other hope to recover the data from the backup pool? It sounded like from other comments that you need to first get the error to go away from zpool status, and then do two scrubs... is that correct in my thinking? I'm going to spool up a bunch of scrubs sequentially as a last hail mary, but any other pointers would be useful |
Usually, one would go "[remove whatever is causing errors]" "[scrub twice]" and then zpool status would no longer list those errors. |
Yeah, that makes sense... the confusing thing is that doing the scrub "cleared the error", at least as far as zfs was concerned. Then doing the two sequential scrubs after that, zfs reports no errors, so I would have thought that trying to mount the dataset after two error-less scrubs would allow me to mount the dataset, but I'm still having issues. Does clearing the error in this case refer to doing more than an initial scrub to make zfs think the error went away? Also, side note, is there any plausible way to forensically recover the dataset by manually decrypting it? I unfortunately know too little about under the hood to know if this is even possible... or how one would go about it |
I believe the reason for the counterintuitive behavior is that the error is coming from trying to decrypt things, which zpool scrub very notably does not do. From my understanding of the problem based on @gamanakis's patch and replies, I would assume it would be possible to write a patch to just ignore the failing bits and let you extract your data. (The existing reverted fix for this might even do that, I'm not sure.) |
Ok, well yeah, that does make a bit more sense in terms of scrub being unaware. I'll see if @gamanakis responds here with any helpful info... also trying to figure out if zdb can be useful in getting the data out without patching zfs |
I'd be interested to hear any solution. I wouldn't mind starting to use raw encrypted sends for offsite backup if there was a hacky workable recovery method. |
@marker5a You could cherry-pick the commit here: gamanakis@c379a3c on top of zfs-2.1.0, zfs-2.1.1, or zfs-2.1.2. That should resolve your problem. That commit just introduces a flag that marks the useraccounting metadata as invalid when being received, and so it forces their recalculation upon first mounting of the received dataset and avoids the error encountered otherwise. |
What's the reason not to try and get that approach merged in general? |
@gamanakis Thanks for speedy reply!!! I was about to go down that route by decided to sit on my hands and wait, lol. I'll give that a try and report back my findings... thanks! |
The justification for the reversion in 6217656 was the original fix
seems odd, reads like it's picking to break one thing (failure to mount raw encrypted sends in general) over another thing (failure to mount encrypted datasets created in-between releases using git master?). If we stick to releases is there any harm of the original patch? |
Because we cant always 100% trust zfs, I'm trying to create an intelligent zfs-compare tool that will compare the latest common snapshots in two pools. It will shasum the actual zvols and files, instead of relying on zfs. It will also transfer a remote dataset thats encrypted and only has a local key, so that the encryption key isnt needed remotely. Does anyone want this tool as well? |
@psy0rz I'd be curious to see how it works. |
Seems to be the same issue as: #10523 Might it make sense to somehow merge all those issues into one and get rid of some of the duplicates? There appear to be a number of issues open on this subject and so far no fix since more than a year? |
@rincebrain the person who wrote the encryption (@tcaputi) suggested in PR #11300 that instead of introducing the new flag we should zero out the dnodes holding user accounting on the receiving side. However, in the absence of a loaded key I fail to see how this is possible. If those dnodes are freed in the receiving side it starts searching for the key. I have pinged Tom again in this regard. |
@gamanakis Looking at the pull request there are a couple of things happening that apparently break when you do this. Wouldn't one solution be to just flag the raw received dataset so that this is done on first mount when the key has already been loaded? That way you don't run into missing key issues and evade a failing mount at the same time. It seems from a cursory glance at the issue and pull request that the data is all well and there and its just some metadata that is corrupted. I guess a perfect solution would be to also correctly send the metadata, if that isn't possible flagging it for cleanup on next mount seems to be the next sensible thing to do. |
Right, that was my initial approach (draft gamanakis/zfs@c379a3c).
This was the latter approach which I have trouble implementing. |
I think that this sort of flag is needed in any case and might be cleared by checking whether metadata is valid if in future releases valid metadata can be sent/received. The problem is that if you look at forwards/backwards compatbility there are already versions with raw send out there that will send wrong metadata. This will have to be handled anyway unless you want to do a major version break for this. So preemptively flagging the received sets and then deciding on first mount whether the data you received was good or needs to be fixed is I think the best way to go forward. |
There is not a problem with older versions. The flags are stored in a uint64_t and the placeholder for the new flag defaults to 0, ie inactivated by default. |
I think I wasn't clear in what I meant. In the case where the sending side has to be upgraded to fix the metadata issue we need to add the Dirty flag anyway so that we can check a received dataset because we can't reliably know if its from a new enough version right? I'm unfortunately not very knowledgeable about zfs internals. |
I misunderstood what you said, you are right I think. I will do some cleanup and open a new PR introducing the flag, |
I wonder how pathological it could be to add a case for "if I fail to decrypt specifically the userobj_accounting metadata, just throw it out", possibly guarded by a tunable. (That, or a zhack command to go forcibly set the "clear on next open" flag.) It'd be a shame for people to have to throw out pre-patch recvs. |
Both examples (1 and 2) in OP from @digitalsignalperson and the reproducer from @putnam complete without any errors with #12981 applied. |
Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12981 Closes #10523 Closes #11221 Closes #11294 Closes #12594 Issue #11300
Awesome thanks @gamanakis! I tested here in my vagrant box and seems to work. Looking forward to using raw sends |
Thanks @gamanakis !! I hope after this, #12720 can get some attention. I am still not able to use raw sends as the generated send stream contains odd/damaged objects that break the receive. |
I experienced this last night reformatting a desktop with the expectation that I could It's fun. The nas that these encrypted datasets were sent to could mount them just fine given the nature of the bug. But so could a laptop which I At this point my first guess is that the version of zfs your zpool were created on plays a part. But I'll have some more fun playing with it in a VM today now that I'm back online. At least with the nas mounting the data, I was able to rsync it to the desktop as a one-off. I had a thread on reddit/zfs here but for now I've worked around the issue for myself. Thank you for the PR @gamanakis. |
Just adding my own tests which are consistent with above comments when using ashift=9 and ashift=12 where using 12 causes the problem. The same archlinux usb stick was able to zfs recv the "broken encrypted dataset" and mount it perfectly fine using default zpool settings. It was only when I used ashift=12 in the zpool creation that the "Input/Output error" issue became apparent. Testing conditions: Linux 5.15.5 and zfs 2.1.1, then zfs 2.1.2
Then I did those tests again but for step 3 I included Setting ashift=12 on zfs 2.1.1 and 2.1.2 causes the issue for me in a VM where /dev/vda was presented as a device using 512b sectors to the VM. Seems consistent enough. My laptop was able to read my desktop's encrypted root dataset because it's zpool still used ashift=9 like the encrypted root dataset it received and could mount. |
Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12981 Closes openzfs#10523 Closes openzfs#11221 Closes openzfs#11294 Closes openzfs#12594 Issue openzfs#11300
In my tests it also happens with ashift=9 (default, checked with zdb) when raw sending back to the originating pool. |
I am curious, was your pool originally ashift=12 when it sent the snapshot away initially and is receiving it back as ashift=9 to experience the issue? |
Without PR 12981 I cannot raw receive in the originating pool, regardless of the ashift, it throws an Input/Output error when mounting. Let me try your case with 12981 applied. |
Ok, this seems to be a different issue. Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with ashift=12 results to failure when mounting pool2/encrypted (Input/Output error). I think you should open a new issue. I am not sure raw sending between pools with different ashift is possible, will take a look. |
I opened #13067 for this matter, did some debugging there too. |
Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12981 Closes openzfs#10523 Closes openzfs#11221 Closes openzfs#11294 Closes openzfs#12594 Issue openzfs#11300
Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12981 Closes openzfs#10523 Closes openzfs#11221 Closes openzfs#11294 Closes openzfs#12594 Issue openzfs#11300
System information
Describe the problem you're observing
I am able to send raw encrypted snapshots (incremental and replication streams) back and forth between file systems a limited number of times before getting a
cannot mount 'rpool/mydataset': Input/output error
and errors in zpool status.I have tried many sequences of sends/receives with raw encrypted snapshots, sometimes I can pass back and forth only 1 time, others more. Below I will share two repeatable examples.
This seems like manifestation of the issue in "Raw send on encrypted datasets does not work when copying snapshots back #10523" which was previously resolved.
Describe how to reproduce the problem
Example 1 - fails on first send back
Example 2 - more convoluted, but fails after a few back and forth
At this point the output of
zpool status -v
includesIf I rollback the last snapshot in question, then scrub once
status still shows
but if I scrub a second time
end up with
and if I repeat the last operation in question, it will get the same IO error again.
The steps are repeatable for me. I don't know if every step matters (e.g. extraneous load-key when I don't mount). I also have some other examples that fail at different points, but I figured these were simple enough to share.
The text was updated successfully, but these errors were encountered: