-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC changes to fix memory related crashes with encryption #15538
Conversation
module/zfs/arc.c
Outdated
HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr), | ||
(u_longlong_t)zb->zb_objset, | ||
(u_longlong_t)zb->zb_object); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return error here? Otherwise its going to immediately panic when abd_copy_to_buf
derefs b_pabd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind, maybe your purpose is to just report more when it does crash?
module/zfs/sa.c
Outdated
@@ -360,7 +360,7 @@ sa_attr_op(sa_handle_t *hdl, sa_bulk_attr_t *bulk, int count, | |||
} | |||
} | |||
if (error && error != ENOENT) { | |||
return ((error == ECKSUM) ? EIO : error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these two but are they related or snuck into the wrong commit?
I'll give this a shot on my exotic system which somehow reliably crashes in this way and report back. |
I am sad to report
If you're not expecting this to have fixed that, ignore me, but I was assuming the changes in arc_write would have been expected to give this a miss. e: Sorry, disregard, apparently I had loaded the wrong clone's modules. Trying again... e2: It didn't panic so far, but it did fail a test run of edit 3: It seems like, so far, about a 50% rate of the tests failing with dbgmsg logs like the above, but no panics yet, so that's nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, but how is this a valid fix for anything? You are merely burying the issues, whatever they are, deeper, unless you just collecting data for later debugging. I don't think we should pollute the code this way.
module/zfs/arc.c
Outdated
if (HDR_PROTECTED(hdr) == B_FALSE) { | ||
zfs_dbgmsg("allocating rabd on a not-encrypted HDR"); | ||
} | ||
ASSERT(HDR_PROTECTED(hdr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already asserted with the IMPLY() above.
@rincebrain If your test case is so reproducible, could somebody look on it closer to find what is going on? Reliably faulting assertion is a very good starting point for investigation. |
It's extremely reproducible on this one system, which is a very strange little sparc single core single thread box. I've been begging people to look into it for years at this point. |
@rincebrain what is the workload/test?s I'd like to attempt to reproduce (I don't have a sparc but could try a single core VM). |
#11679 You can make go bang on this very strange box a little over 50% of
the time by doing...really any of the native encryption send/recvs, but
`zfs_receive_raw` is what I usually run in a loop to make it go bang.
I've not found anything else that reproduces this quite so well, VM or
baremetal.
…On Tue, Dec 5, 2023 at 1:44 PM Don Brady ***@***.***> wrote:
It's extremely reproducible on this one system, which is a very strange
little sparc single core single thread box.
@rincebrain <https://github.com/rincebrain> what is the workload/test?s
I'd like to attempt to reproduce (I don't have a sparc but could try a
single core VM).
—
Reply to this email directly, view it on GitHub
<#15538 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7IDIWJZQVIFM2H32WLYH5TSLAVCNFSM6AAAAAA7PEVKMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBRGQYDSMZXHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Confirming that you are referring to |
I am, yes. The other ones in zfs_receive that use --raw will also work, I think, but that's the one I use. For whatever strange reason, the little sparc I have, that panics over 50% of the time on baseline, and while it doesn't panic with this PR, it does log the "this shouldn't happen" dbgmsg and fail the test. |
Thanks for confirming. Could you provide details on your sparc box? (processor gen, RAM, CPUs, 64bit?) |
It's a little Netra T1 105, running Debian sid sparc64. Kernel version doesn't seem to matter, I've been complaining about this since Linux 3.x, I think, and it's now running 5.10. Doesn't seem to be all sparcs that break quite so reliably, I have a much beefier sparc64 (a T2) that doesn't reproduce this nearly as reliably, but does have the nice property of compiling things for the little 1c1t much faster than it itself does. :) I've got a whole disk image of it from years ago that works but doesn't seem to reliably repro this, but I don't know if that's qemu being very unreliable at sparc emulation or what. |
- Bail early in arc_buf_fill() when hdr->b_crypt_hdr.b_rabd is NULL - In arc_write(), avoid arc_hdr_free_abd() when HDR_IO_IN_PROGRESS indicates it's still in use. Sponsored-By: Odoo SA Sponsored-By: Klara Inc. Signed-off-by: Don Brady <[email protected]>
3090be2
to
38f7449
Compare
@rincebrain -- I rebased to latest master branch and I removed all the unnecessary debug messages. |
Current git apparently just doesn't compile on my system after 60e389c. Since I've built git since that commit without a problem, I'm not really sure why it's only upset now, but that pragma didn't exist in gcc 10, it appears.
I'm going to guess the check in 3c1e193 is just broken and assumes CC=KERNEL_CC incorrectly, and I just broke it recently by installing an additional userland compiler that's new enough to support that. I've hacked around it with a hardcoded e: that test should probably be trying to build a dummy kernel module with that flag, not userland, to avoid trying to manually extract the kernel compiler if it's not explicitly given... |
Unpatched (except to work around the above) git:
With it, sometimes the test
and nothing in dbgmsg now with the debug prints removed. |
After spending a week in this area I can say that arc_untransform() is one big mess. ARC is simply unable to properly lock it, since its hash locks do not protect anonymous buffers. Protection of those should be handled on higher levels. The best idea I've got so far is #16104 . I hope it to be enough for now. In general ZIO and ARC layers seems to be perfectly fine with mixing plain-text/compressed/encrypted buffers, while DBUF layer was just not designed for that. |
Replaced by #16104. |
Motivation and Context
Odoo has reported various crashes in ZFS when using native encryption on the ZFS 2.1.x branches. In many occasions, it was a null pointer dereference, as reported in #12775
Description
The following changes have prevented reoccurrence of the issues.
pabd
inarc_buf_fill()
: Bail early inarc_buf_fill()
whenhdr->b_crypt_hdr.b_rabd
is NULL.arc_write()
, avoidarc_hdr_free_abd()
whenHDR_IO_IN_PROGRESS
indicates it's still in use.Sponsored-By: Odoo SA
Sponsored-By: Klara Inc.
How Has This Been Tested?
Tested with
ztest
runs with KSAN enabled and encryption forced for every datasetTested with ZTS
functional/cli_root
andfunctional/rsend
, both will exercise zfs encryption pathsOriginal patches tested on a system that was exhibiting the memory related errors to confirm it addressed the issue.
Types of changes
Checklist:
Signed-off-by
.