Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS send hangs sometimes #16731

Open
rkojedzinszky opened this issue Nov 7, 2024 · 10 comments
Open

ZFS send hangs sometimes #16731

rkojedzinszky opened this issue Nov 7, 2024 · 10 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rkojedzinszky
Copy link
Contributor

rkojedzinszky commented Nov 7, 2024

System information

Type Version/Name
Distribution Name TrueNAS-CORE/FreeBSD
Distribution Version 13.3
Kernel Version 13.3-RELEASE-p7
Architecture amd64
OpenZFS Version 2.2.6 + truenas patches

Describe the problem you're observing

TrueNAS is using zettarepl to replicate zfs datasets to remote sites. During a cycle, sometimes, rarely, zfs send hangs. The symptom is that zfs send hangs, not sending anything to its output, is in idle state. I've applied a workaround, a simpe pipe command which reads output from zfs send and passes data through, and this command is reporting that no output is received from zfs send for minutes. Then it kills zfs send. Also, it is reporting that usually only a few thousand bytes are sent by zfs send, not more. Then, simply killing zfs send solves the problem, upon next cycle it will usually send the snapshots completely, without errors.

Must note here that zfs used by TrueNAS contains this PR. I suspect this may be the source of my issue.

I suspect that 6bdc725 may be the source of my issue.

Usually, I receive send errors once in a week or two, cannot reproduce, but I will now give a try without this patch, and see the difference.

Describe how to reproduce the problem

Unfortunately, cannot reproduce.

Include any warning/errors/backtraces from the system logs

@rkojedzinszky rkojedzinszky added the Type: Defect Incorrect behavior (e.g. crash, hang) label Nov 7, 2024
rkojedzinszky added a commit to dravanet/truenas-middleware that referenced this issue Nov 7, 2024
@vedadkajtaz
Copy link

I'm facing a somehow similar issue.

It occurs only on one out of 7 servers that all run the same OS (FreeBSD 14.1, OpenZFS 2.2.4), and use the same (home-made) zfs replication software.

It is basically a series of zfs send | ssh 'zfs receive'.

The zfs send seldom hangs (roughly once a week), on different datasets being sent.

Killing the piped ssh process does nothing either (which is expected, zfs send doesn't get an EPIPE since it doesn't send anything). Killing the zfs send works, and further iterations are okay, usually for a few days only.

There are no related dmesg messages, the pool is healthy (frequently scrub'd).

I'd be glad to help investigating the issue, but don't know where to look.

@rkojedzinszky
Copy link
Contributor Author

@vedadkajtaz thanks!

I have to correct myself, zfs-2.2.4 also contains the suspected patch. If you would be able to test openzfs without commit 6bdc725, that would help. I am running 3 boxes now with that reverted, only for a few days, without hanging zfs send. But, I definitely would need more time to declare this as a possible cause.

@rkojedzinszky
Copy link
Contributor Author

@vedadkajtaz did you have a chance to build zfs userspace with 6bdc725 reverted? Then, that will need some time, but according to your experiments, in a week you'll be able to report some results.

@vedadkajtaz
Copy link

@vedadkajtaz did you have a chance to build zfs userspace with 6bdc725 reverted? Then, that will need some time, but according to your experiments, in a week you'll be able to report some results.

Hi, I haven't done anything yet regarding this, possibly/likely next week, sorry.

@rkojedzinszky
Copy link
Contributor Author

@nabijaczleweli I can report that reverting the mentioned commit caused no zfs send issues on 3 FreeBSD based NAS servers for more than a week now. Can you have a look at the commit?

@nabijaczleweli
Copy link
Contributor

It looked sound back then so it looks sound now. No-one seems to have posted a strace (or backtrace) that would indicate where these hang, that commit basically doesn't touch the actually-sending-stuff thread at all, and all the setup is deterministic AFAICT. This bug hasn't left "oh i see this sometimes". I can't evaluate data you're withholding.

@vedadkajtaz
Copy link

I have a hung process (with stock binary, FreeBSD 14.1, OpenZFS 2.2.4) right now.

There is no strace on FreeBSD. truss was not helpful, no system calls whatsoever.
Here's the backtrace from gdb:

(gdb) bt
#0  0x000009ad512dae2c in ?? () from /lib/libthr.so.3
#1  0x000009ad512dfa2e in ?? () from /lib/libthr.so.3
#2  0x000009ad4dbcec83 in ?? () from /lib/libzfs.so.4
#3  0x000009ad4dbd0fbe in ?? () from /lib/libzfs.so.4
#4  0x000009ad4dbbf22a in zfs_iter_snapshots_sorted_v2 () from /lib/libzfs.so.4
#5  0x000009ad4dbd0876 in ?? () from /lib/libzfs.so.4
#6  0x000009ad4dbcc684 in ?? () from /lib/libzfs.so.4
#7  0x000009ad4dbcc1b6 in zfs_send () from /lib/libzfs.so.4
#8  0x000009a527a181f0 in ?? ()
#9  0x000009a527a12010 in ?? ()
#10 0x000009ad4f7b0a6a in __libc_start1 () from /lib/libc.so.7
#11 0x000009a527a1108d in ?? ()

Not super helpful without debugging symbols, but it's obviously stuck in libthr, which seems to indicate @rkojedzinszky is likely right.

@nabijaczleweli
Copy link
Contributor

nabijaczleweli commented Nov 17, 2024

Would it be a terrible bother to take a backtrace, with symbols, of all the threads, so we don't have to guess what's happening? Attaching the strace-equivalent should in general be easier and tell you in which syscall each thread is stuck, but I don't really know if FreeBSD possesses this ability.

@vedadkajtaz
Copy link

I'll rebuild (stock, ie. releng/14.1) binaries with debugging symbols over the next few days, and wait for the next hung.

@rkojedzinszky
Copy link
Contributor Author

@nabijaczleweli unfortunately, I can only add that since I am running my servers with the mentioned patch reverted, I am not facing with hung zfs processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants