-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol: ensure device minors are properly cleaned up #16364
Conversation
@robn great detective work on this! I will take a look |
Hopefully we can get one more reviewer to approve this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this revision of the code and will need a deeper look, but I have feeling that both existing and new design are sub-optimal for FreeBSD. Unlike Linux FreeBSD has a way to destroy open device without waiting indefinitely for application to close the device. Application may stay with open device as long as it desire, but any request other than close will fail fast within devfs layer and never reach the ZVOL code, which can safely destroy all it wants. Its being a while since I've done it last time, but whether it is a geom of dev volmode, zvols should be able to signal wanted destruction before waiting for the close at the OS-independent code.
ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Last commit should fix the underlying problem, so these should be passing reliably again. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
@amotin thanks, I'll look into it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robn thanks for running this down and the clear description of the issue. Oh and I agree, there's definitely lots of room for improvement here if somebody has the time.
SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes #16364
Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes #14872 Closes #16364
Last commit should fix the underlying problem, so these should be passing reliably again. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes #16364
ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16364
SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16364
Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes openzfs#14872 Closes openzfs#16364
Last commit should fix the underlying problem, so these should be passing reliably again. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16364
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
While running the test suite to validate a ZIL change, I kept tripping #14872. Since it was important to me to ensure that I wasn't inadvertently making a bad situation worse, I decided I had to understand that issue first, and ideally fix it.
#14872 is not actually related to the ZIL; the assertion is just thing that noticed a problem at the end of a sequence of "impossible" events. This PR is attempting to tackle the root cause.
I should say that I am pretty unfamiliar with the zvol code, and since this was definitely not on my main quest line, I have not been as diligent with my research as I usually am. So I'll explain what I think is happening first, then talk about the patches and how they attempt to sort it out.
Fixes #14872.
May also fix #12621, #12733.
So as far as ZFS is concerned, a device node in
/dev
is for zvols what a mount is for filesystems: some connection point that makes a certain type of dataset available to the rest of the system. That is, creating or destroying a device node (what the code calls a "minor") is roughly analogous to mounting or unmounting a filesystem.The equivalent to “unmounting” is ultimately handled in
zvol_remove_minors_impl()
andzvol_remove_minor_impl()
, which are effectively identical; one just loops over all the zvols, the other just for a single zvol. They remove the node from/dev
, remove the associatedzvol_state_t
fromzvol_global_list
, free up the block device resources in the kernel and release any other memory.These functions contain this check:
As the comment says, if the zvol is in use (someone has called
open()
) or has been suspended (incoming requests held, used during rollback and recv), then it is skipped.The key problem here is that nothing ever comes back to actually remove the minor, so there’s a live node in
/dev
at a time when ZFS believes it’s gone.Happily, there’s not many cases where this matters in practice. Minors are removed when changing the
volmode
andsnapdev
properties, and when renaming a volume. Since the effect ofopen()
'ing a zvol device node is a name->objset_t
lookup, the worst case is that you get a handle on a block device that used to be available by that name, or you get nothing because it no longer exists.At pool export, it’s a different matter. At the top of
spa_export_common()
, we see:The
zvol_remove_minors()
call creates a task that callszvol_remove_minors_impl()
, and then waits until the the taskq it knows it will be dispatched on becomes empty. As above, if the zvol is in use at this moment, it is skipped, and ZFS continues with the export with the device node still visible place, and, crucially, azvol_state_t
for that zvol on the globalzvol_state_list
.There is a check a little further down in
spa_export_common()
to prevent export if the pool is still in use:However, it’s trivial to closing the last reference between after minor removal is skipped but before this check; there’s real work and possibly a whole transaction sync in that time.
So at this point there’s a device node in
/dev
, with the “old”zvol_state_t
behind it. Remarkably, an attempt to open it at this point will make it as far asdmu_objset_own()
. The pool hold will fail, because the pool isn’t there, soENOENT
will come back up and the open will be rejected.And if the pool is reimported? Well. The named objset probably actually exists now, so we now have this extraordinarily strange situation where the
zvol_state_t
from the previous import of the pool is being used for the current import. A very vague description of the result is that the zvol state and the pool state look more-or-less right to each other, so they mostly accept each other, but they are not what they seem, and may perform strangely.#14872 is merely a symptom of all this; something noticing the strange behaviour. It goes like this.
zvol_misc_trim
andzvol_misc_fua
both arrange for a write (a TRIM counts as a write for these purposes) to be issued very late in the pool’s lifecycle. In this case the kernel itself is actually the client; the filesystem on top of the zvol has been written to, and the kernel is flushing it.At export, the minor is removed, but the kernel has the device open, so it gets skipped, and the kernel can continue. It’s issuing sync writes, which of course go to the ZIL. If the timing is just right, these writes can continue after the call to
txg_wait_synced()
inspa_export_common()
. Since the next txg doesn’t pass, these writes are only on the ZIL attached to the zvol dataset.At import, the ZIL blocks are found on the zvol dataset, and the claim is issued, setting
zh_claim_txg
. The ZIL now must be replayed before the dataset can be used. The assertionASSERT(zh->zh_claim_txg == 0)
inzil_create()
ensures this, as the only thing that can zerozh_claim_txg
iszil_sync()
withzl_destroy_txg
set, and that is only done byzil_replay()
.ZIL replay is done at “mount” time, which for a zvol is when the minor is created. This happens in
zvol_os_create_minor()
, which is called fromzvol_create_minors_recursive()
during import. Early on, this callszvol_find_by_name_hash()
to see if thezvol_state_t
already exists for the name. It finds the one left over from the previous pool run, and so just returns, because the minor already exists. Because of that, it never gets to callingzil_replay()
.Soon, a write happens.
zvol_write()
callszil_commit()
, which callszil_create()
, and the unprocessed claim is discovered, and the assertion tripped.Description
First things first: the first two commits are trivial cleanup. They could easily stand alone, and I will break them into a separate PR if wanted. Meanwhile, the final commit is just re-enabling the disabled
zvol_misc
tests that this should fix.The real meat is in the third commit.
If the minor is not in use at the moment we check it, nothing changes - it just gets cleaned up on the spot.
If it is in use however, the behaviour changes. Instead of skipping the zvol, a new flag
ZVOL_REMOVING
gets set on thezvol_state_t
, and then arrangements are made for something to wait on a new per-zvol condvarzv_removing_cv
. When the last user finishes, if the flag is set, this condvar will be signaled and the waiter will go on to do the removal.Since we know we’re about to wait, we want to throw off any new clients. So
ZVOL_REMOVING
also causes new open or IO requests to returnENXIO
immediately.For
zvol_remove_minor_impl()
, the calling thread immediately starts waiting. This is a utility function that implements the minor removal forsnapdev
andvolmode
changes and volume rename, which are all admin actions, so blocking the thread won’t affect anything else.For
zvol_remove_minors_impl()
this is a little more complicated. This is always called on a task running fromspa_zvol_taskq
, and it runs through all the matching zvols on a loop. Rather than wait on the spot, we setZVOL_REMOVING
immediately and then dispatched a new task on that queue that immediately goes to sleep waiting for its zvol to finish.The
zvol_remove_minors_impl()
task is dispatched fromzvol_remove_minors()
, which is always called with itsasync
flag set, and only one of its callers actually cares about waiting for the minors to be removed. That caller is the snippet ofspa_export_common()
shown earlier, and it waits forspa_zvol_taskq
to be empty, and so will end up waiting until all minors are gone before completing the export.Notes
Having spent some time in this code, there’s a lot I don’t love about about how it does its async work. However, as noted above, I really don’t have time for anything more here at the moment, so what I’ve done is kind of the first step towards the improvements I’d like to do, while still solving the immediate problem.
void
returns. I’d want to check everything and make sure failures are sensible. This seems particularly important when there’s sequenceszvol_remove_minors()
andzvol_rename_minors()
both have an “async” flag option that is always set. This is pointless. This also means that I’ve technically broken synczvol_remove_minors()
, because it only waits for the initial task to complete, not any followups. I decided for this that I didn’t care; the one place that might benefit from it (spa_export_common
) is already doing its own thing to sync up. I could have just done adelay_list
inzvol_remove_minors_impl()
like I did inzvol_remove_minor_impl()
, but I suspect in the long run a taskq is more likely to be a better model.EBUSY
, or maybe there could be a more forcible detach.As with everything, I’ll get to it someday. For now, hopefully what’s here is enough to solve the problem at hand. If someone else wants a project, please take it on with my blessing 😬
How Has This Been Tested?
All testing done on Linux.
zvol_misc
test tag has been run a few hundred times on kernel 6.1.94. Previously it would trip the assert described in #14872 afterzvol_misc_trim
and/orzvol_misc_fua
maybe four out of five runs. Since then, no hits.zvol
test tag has been run a few dozen times, all passes. Full ZTS run twice, also all passes (modulo normal test suite flakiness).Meanwhile on FreeBSD, this has been compile checked only. The OS-specific changes are very small and conceptually identical to the Linux one, so I just followed the local style. Obviously needs a test, which I will try to do myself, but I’d appreciate it if someone could take a closer look.
Types of changes
Checklist:
Signed-off-by
.