hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

prakashsurya · 2023-06-15T17:20:12Z

We've seen zil_lwb_flush_wait_all hang on a few systems, the stack looks like this:

Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257040] INFO: task txg_sync:2620216 blocked for more than 120 seconds.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257183]       Tainted: P           OE     5.4.0-137-dx2023022419-17ff49f96-generic #154
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257509] txg_sync        D    0 2620216      2 0x80004080
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257511] Call Trace:
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257517]  __schedule+0x2de/0x710
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257521]  ? __wake_up_common_lock+0x8a/0xc0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257523]  schedule+0x42/0xb0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257531]  cv_wait_common+0x11e/0x160 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257532]  ? __wake_up_pollfree+0x40/0x40
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257535]  __cv_wait+0x15/0x20 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257719]  zil_lwb_flush_wait_all+0x51/0x90 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257831]  zil_sync+0x89/0x4c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257837]  ? spl_kmem_alloc_impl+0xdd/0x100 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257901]  ? dmu_objset_projectquota_enabled+0x50/0x50 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257905]  ? taskq_wait_check+0x31/0x40 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257909]  ? taskq_wait+0x2f/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257976]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258035]  dmu_objset_sync+0x339/0x460 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258099]  dsl_dataset_sync+0x5e/0x1e0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258168]  dsl_pool_sync+0xa0/0x410 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258254]  ? spa_errlog_sync+0x26c/0x290 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258330]  spa_sync_iterate_to_convergence+0xf1/0x250 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258410]  spa_sync+0x31c/0x6c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258483]  txg_sync_thread+0x22d/0x2d0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258550]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258555]  thread_generic_wrapper+0x83/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258558]  kthread+0x104/0x140
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258563]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258564]  ? kthread_park+0x90/0x90
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258567]  ret_from_fork+0x1f/0x40

As a result of txg_sync being blocked, all TXGs halt, and thus all IO to this pool effectively stops.. we've resolved this via a reboot.. and we haven't yet been able to capture a kernel crash dump when this situation occurs.

So far, this situation seems relatively rare.. only been reported 2 times.. but since it causes all IO to the pool to hang, it's relatively impactful when it does happen, as it requires an admin to detect the situation and manually reboot the system to recover.

The text was updated successfully, but these errors were encountered:

amotin · 2023-06-15T20:39:37Z

@prakashsurya What exactly ZFS version (commit) are we talking about?

prakashsurya · 2023-06-15T20:45:17Z

Good question, sorry for not including that.. looking at the most recent report, it was running this commit d816bc5 .. merged with our Delphix changes, of course, but I don't think we have much/any downstream changes in the ZIL code paths.

prakashsurya · 2023-06-15T20:47:41Z

In what looks like another instance of this hang, the system was running an older Delphix release.. and the openzfs commit is a582d52

amotin · 2023-06-15T20:52:50Z

The first looks to be from February 23, long before my recent ZIL commits from May, that I was worrying about. So hopefully not me. ;) But it may be a consequence, not the cause. It just tells that some of ZIL writes of flushes got stuck didn't complete and zil_sync() is waiting for them. It may be hardware, or who knows what else in software.

prakashsurya · 2023-06-15T20:56:06Z

Right.. and since we don't have a kernel crash dump, I think it'll be difficult to make forward progress toward a root cause.. In both cases that I've looked at, the system was rebooted without collecting a crash dump, unfortunately.

But yea, I agree, based on the dates, likely unrelated to any of those recent changes.

grwilson · 2023-07-18T02:02:01Z

Here's the analysis from another instance of this hang:

This thread holds the rangelock as reader:

0xffff8c80d7ef4b00 UNINTERRUPTIBLE       1
                  __schedule+0x2cd
                  __schedule+0x2cd
                  schedule+0x69
                  io_schedule+0x16
                  cv_wait_common+0xab
                  __cv_wait_io+0x18
                  txg_wait_synced_impl+0x9b
                  txg_wait_synced+0x10
                  dmu_offset_next+0x11f
                  zfs_holey_common+0xa7
                  zfs_holey+0x50
                  zpl_llseek+0x89
                  ksys_lseek+0x69
                  ksys_lseek+0x69
                  __x64_sys_lseek+0x1a
                  __x64_sys_lseek+0x1a
                  __x64_sys_lseek+0x1a
                  do_syscall_64+0x59
                  do_syscall_64+0x59
                  entry_SYSCALL_64+0x99

The txg is stuck here waiting for inflight ZIL I/O to complete:

0xffff8c7c5d316400 UNINTERRUPTIBLE       1
                  __schedule+0x2cd
                  __schedule+0x2cd
                  schedule+0x69
                  cv_wait_common+0xf8
                  __cv_wait+0x15
                  zil_lwb_flush_wait_all+0x51
                  zil_sync+0x8d
                  dmu_objset_sync+0x338
                  dsl_dataset_sync+0x5e
                  dsl_pool_sync+0xa0
                  spa_sync_iterate_to_convergence+0xee
                  spa_sync+0x327
                  txg_sync_thread+0x1f9
                  thread_generic_wrapper+0x61
                  kthread+0x127
                  ret_from_fork+0x1f
                  0x0+0x0

But why aren't the ZIL writes completing? Looking at other ZIL thread we see that this thread wants the rangelock as writer:

0xffff8c80ddbf0000 UNINTERRUPTIBLE       1
                  __schedule+0x2cd
                  __schedule+0x2cd
                  schedule+0x69
                  cv_wait_common+0xf8
                  __cv_wait+0x15
                  zfs_rangelock_enter_writer+0x46
                  zfs_rangelock_enter_impl+0xf9
                  zfs_rangelock_enter+0x11
                  zfs_write+0x21c
                  zpl_iter_write+0xe7
                  new_sync_write+0x114
                  new_sync_write+0x114
                  vfs_write+0x189
                  vfs_write+0x189
                  ksys_write+0x67
                  __x64_sys_write+0x1a
                  __x64_sys_write+0x1a
                  __x64_sys_write+0x1a
                  do_syscall_64+0x59
                  do_syscall_64+0x59
                  entry_SYSCALL_64+0x99

This means that any threads which may want the rangelock as reader will now block giving priority to the writer. Here's one such thread:

0xffff8c7ee4569900 UNINTERRUPTIBLE       1
                  __schedule+0x2cd
                  __schedule+0x2cd
                  schedule+0x69
                  cv_wait_common+0xf8
                  __cv_wait+0x15
                  zfs_rangelock_enter_reader+0xa1
                  zfs_rangelock_enter_impl+0x11d
                  zfs_rangelock_enter+0x11
                  zfs_get_data+0xf1
                  zil_lwb_commit+0x25f
                  zil_lwb_write_issue+0x30e
                  zil_commit_waiter_timeout+0x155
                  zil_commit_waiter+0x109
                  zil_commit_impl+0x6d
                  zil_commit+0x5c
                  zpl_writepages+0xf0
                  do_writepages+0xc2
                  __writeback_single_inode+0x44
                  writeback_sb_inodes+0x236
                  wb_writeback+0xbb
                  wb_workfn+0xd8
                  wb_workfn+0xd8
                  process_one_work+0x228
                  worker_thread+0x4d
                  kthread+0x127
                  ret_from_fork+0x1f
                  0x0+0x0

Unfortunately this thread is in the middle of issuing ZIL I/O which is why zil_lwb_flush_wait_all waiting and spa_sync is not making any progress.

And there we have the deadlock:

thread 1 holds does a txg_wait_synced while holding the rangelock as reader
thread 2 (i.e. sync thread) is waiting for inflight ZIL I/O to finish
thread 3 wants the rangelock as writer causing future readers to block
thread 4 is in the middle of issuing ZIL I/O but needs the rangelock as reader

This commit is made a change which now ties ZIL completion with spa_sync:

commit 152d6fda54e61042a70059c95c44b364aea0bbd8
Author: Kevin Jin <[email protected]>
Date:   Thu May 26 12:36:14 2022 -0400

    Fix inflated quiesce time caused by lwb_tx during zil_commit()

    In current zil_commit() process, transaction lwb_tx is assigned in
    zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done().
    Thus, during lwb write out process, the txg is held in open or quiesing
    state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio
    latency is high, it will cause txg_sync_thread() to starve.

    The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the
    'syncing' txg state. That is, in zil_sync().

    In this patch, it achieves the goal without holding transaction.
    A new function zil_lwb_flush_wait_all() is introduced. It waits for
    the completion of all the zil_lwb_flush_vdevs_done() by given txg.

    Reviewed-by: Alexander Motin <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Prakash Surya <[email protected]>
    Signed-off-by: jxdking <[email protected]>
    Closes #12321

Now consumers of txg_wait_synced could cause deadlocks as seen above. @jxdking can you take a look and comment?

amotin · 2023-07-18T19:45:10Z

@grwilson Thank you for your analysis, but I think it is not @jxdking's 152d6fd, but mine f63811f. The problem is that postponing zl_get_data() call to move it out of zl_issuer_lock also moved it past zl_lwb_inflight increment in zil_lwb_write_close() under the lock. As result, the LWB is already counted as inflight, that blocks the TXG sync, but can't be issued before we get the data for it, which data we can not get due to range lock being held.

I see two possible solutions: either use zfs_rangelock_tryenter() and report error if we can't get it, in which case ZIL code will simulate write error and fall back to txg_wait_synced(); or we could implement high-priority read range lock, similar to what spa_config_enter_mmp(), that would bypass writers. The first may create additional delays during extra syncs, while the second won't help if the range lock is originally held by writer. I am open for any better ideas.

amotin · 2023-07-19T14:00:39Z

In other words, there should be nothing that can block TXG commit after we've done next block allocation, else we either dead deadlock or a space leak in case of crash. I have chosen pretty big windmills to tilt this time. Need to think more.

grwilson · 2023-07-19T21:06:15Z

I am concerned that syncing context is now waiting for open context I/Os which seems counterintuitive.

amotin · 2023-07-19T21:14:24Z

I am concerned that syncing context is now waiting for open context I/Os which seems counterintuitive.

@grwilson There always was a wait. Just previously the I/Os were blocking TXG transition from quiescing to syncing by holding TXs open during I/O. That is what @jxdking was trying to relax.

grwilson · 2023-07-28T12:08:48Z

@amotin any further thoughts on how to address this?

amotin · 2023-07-28T13:08:40Z

@grwilson Yes, I think so. I have one more ZIL refactoring patch that just passed first basic tests yesterday. It supposed to fix all of those kinds of deadlocks by postponing block allocation until after all copies are done and so allowing unbound sleeps anywhere, since TXG sync no longer need to wait for it. I should publish it nearest days if all go well.

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

amotin · 2023-08-01T23:17:28Z

@grwilson #15122 passes by tests. Now is time for yours.

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15122

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15122

The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15122

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Co-authored-by: Alexander Motin <[email protected]>

The previous patch openzfs#14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue openzfs#14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15122

scineram · 2024-01-03T11:02:01Z

Is this an issue after #15122?

prakashsurya added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 15, 2023

amotin mentioned this issue Jul 19, 2023

ZIL: Fix config lock deadlock. #15080

Merged

13 tasks

amotin mentioned this issue Jul 30, 2023

ZIL: Second attempt to reduce scope of zl_issuer_lock. #15122

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

prakashsurya commented Jun 15, 2023 •

edited

Loading

amotin commented Jun 15, 2023

prakashsurya commented Jun 15, 2023

prakashsurya commented Jun 15, 2023

amotin commented Jun 15, 2023 •

edited

Loading

prakashsurya commented Jun 15, 2023

grwilson commented Jul 18, 2023

amotin commented Jul 18, 2023

amotin commented Jul 19, 2023

grwilson commented Jul 19, 2023

amotin commented Jul 19, 2023

grwilson commented Jul 28, 2023

amotin commented Jul 28, 2023

amotin commented Aug 1, 2023

scineram commented Jan 3, 2024

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

Comments

prakashsurya commented Jun 15, 2023 • edited Loading

amotin commented Jun 15, 2023

prakashsurya commented Jun 15, 2023

prakashsurya commented Jun 15, 2023

amotin commented Jun 15, 2023 • edited Loading

prakashsurya commented Jun 15, 2023

grwilson commented Jul 18, 2023

amotin commented Jul 18, 2023

amotin commented Jul 19, 2023

grwilson commented Jul 19, 2023

amotin commented Jul 19, 2023

grwilson commented Jul 28, 2023

amotin commented Jul 28, 2023

amotin commented Aug 1, 2023

scineram commented Jan 3, 2024

prakashsurya commented Jun 15, 2023 •

edited

Loading

amotin commented Jun 15, 2023 •

edited

Loading