-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add module parameter to disable prefetch in zfs_readdir #16318
base: master
Are you sure you want to change the base?
Conversation
Would it make sense to change this from a boolean to a scalar threshold value on the directory link count to determine whether or not to prefetch? |
Yes, it probably does make sense. I'm willing to try. (I didn't actually write the code, but I can probaby figure that out.) However before doing that it's worth asking whether we can solve the underlying problem. The problem occurs becasue NFS wants to know the name that the directory has in its parennt directory. There's a function to get the name of an inode. It's specific to the file system type. ZFS doesn't have it, so the backup is used. The backup ends up using zfs_readdir to search the directory looking for the inode. So the obvious queston is: can we create a zfs_getname that does it better. Is there somewhere in the ZFS data structures that we could cause the filename. that way when zfs_readdir (or our replacemennt) looks through the parent directory, it can cache the name of every inode as it goes by. Then when we come to look up that node, the name will already be there, and zfs_getname won't need to do a search at all. Since I'm not a zfs developer, I don't know the data structures. Is there a place the name can be put, and obviously cleared if there's a rename or delete? With a normal file, it can be in more than one directory, so a single entry wouldn't work. But for a directory, there's only one name that it has in its parent (i.e ".."), so we don't have that ambiguity. If someone could point me to what needs to be done, I'm willing to code it and test it. I do have experience doing kernel code. (I was one of the original Linux developers when Linus was first writing linux.) |
For clarify, the actual function is int get_name(const struct path *path, char *name, struct dentry *child);
The problem occurs when data is in cache on the client, but the dentry for the child directory isn't in the dentry cache anymolre. So the data can't be stored ni the dentry, since it may not exist for some of the directories whose names we'd like to cache. |
have a very large number of subdirectories. | ||
A reasonable value would be 20000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantic, please consider specifying what the units are.
I've changed it to have a limit rather than a boolean. If non-zero, any directory larger than the limit disables prefetch. Note that it fails checkstyle. Checkstyle demands that I use a posix type, replaceing ulong with ulong_t, but the macro doesn't support ulong_t. There are examples in other parts of the code using ulong. I have tested this with 0, a value smaller than my directory size, and a value larger than my directory size. The performance is sufficiently different that I can tell it works as intended. |
there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it. |
Perhaps too pedantic, but in general directories have a size in bytes as well as a size in link count. However, if looks like ZFS sets the size to equal the link count. |
I looked a bit further into writing an alternative zfs_get_name. It looks like one could use zap_value_search. However the basic functions for iterating through the zap seem to be the same. I'm sure it would be faster, but it's not clear whether it would be enough to justify the extra code. |
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors? |
How do I squash?
On Jul 11, 2024, at 8:07 PM, Tony Hutter ***@***.***> wrote:
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors?
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCCRC2JXFIHGYJZMMBDZL4M2NAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGE2DINJSG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
the complaint about non-posix definition was hard to understand. All lthe POSIX alternatives generate compile errors. And I found plenty of uses of ulong in other files. Here's my attempt to use ulong_t param_ops_ulong_t’ undeclared here Here's another file with the same type of declaration not using POSIX. in ./module/os/linux/zfs/zvol_os.c I was unable to figure out what to do. |
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a To squash commits run |
ulong is the one that compiles but fails style I haven't been able to figure out why mine fails but it works in other files. I suspect there's an include file I need. I think I've squashed it correctly. |
The original version, that allows one to disable all prefetch is running in production. I tried the current vetsion on a test server. I’ll retry on test tomorrow to make sure nothing has broken in the update, though I don’t see how that could have happened, since the only thing that changed was spacing.
The first chance to reboot production is mid August.
On Jul 11, 2024, at 9:00 PM, Tony Hutter ***@***.***> wrote:
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a ulong_t for now in module_param() as a workaround.
To squash commits run git rebase -i and squash all your commits into your first commit. After you've successfully squashed them, you can do a force push (git push --force) to update this PR with the single commit.
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCEI4Q6JT6UHYCQPFLLZL4TDDAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGIZTGOJUG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
@clhedrick looks like a diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 11f665d22..efe2371fa 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -4257,6 +4257,7 @@ EXPORT_SYMBOL(zfs_map);
module_param(zfs_delete_blocks, ulong, 0644);
MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
+/* CSTYLED */
module_param(zfs_readdir_dnode_prefetch_limit, ulong, 0644);
MODULE_PARM_DESC(zfs_readdir_dnode_prefetch_limit,
"No zfs_readdir prefetch if non-zero and size > this"); Would you mind adding that in? Then we should be pretty much good to go. |
OK. I just tested on a 6.8 kernel with the limit 0, 20000 and 400000, with a directory of size 200000. I got the expected results. 4 min with 0 and 400000, 20 sec with 20000. |
This is tricky to test. The problem occurs when files are cached on the client, but metadata is not, and the dnoce is not cached on the server. This is realistic for us, because our clients have very large memory. Data can stay cached for days if not weeks, but the attribute cache time out after something like 90 sec. I believe the problem occurs when the client does a getattr to see if the data is still fresh. The getattr uses what is in effect a inode number. That triggers the reconnect code in the NFS server, which searches the parent directory for the directory, in order to find its name. So to duplicate the problem, clear cache on the server but not the cllient, and wait at least 120 sec on the client, for the attribute cache to timeout. Clearing cache on the client would cause it to do full lookups on all files, which would be fast. |
Just so I'm clear, diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..588bf9339 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1781,9 +1781,11 @@ When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with more entries than that value. For example, if set to 1000, then stop
+prefetching file metadata after 1000 file's metadata have been prefetched for
+the directory.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int |
The documentation is correct. I compare the directory size with the limit. What you describe could be implemented. I would use a counter in the loop. It would take a lot of work on lots of different situations to know whether this is better or not. It's not obvious that it is. A more complex fix would be to implement a zfs-specific get_name. That would basically disable prefetch only in this specific NFS operation. Whether that's worth doing or not depends partly upon whether the prefetch is a good idea in the first place for very large directories. I'm not so sure it is. Consider "ls -l". If information on all files is already in the ARC, the prefetches are wasted. If not, it will queue up a bunch of prefetches, in the order that they occur in the directory traversal. The best case seems to be"ls -l". But ls will sort the files alphabetically. I think it's just as likely that the first file will have to wait for a bench of prefetches are finished to get its data. Prefetch won't be any faster than a demand fetch. It's useful only if the demand comes later, so that by the time it comes the data is in memory, where it would have had to wait otherwise. But it probably doesn't matter for typical file access. The "ls -l" case seems like the best hope, but I'm skeptical because of the ordering issue. You'd need some pretty careful testing in quite a bunch of situations. But I'm skeptical about the usefulness of the prefetch. Certainly not for us, where metadata is always in NVMe. It you can suggest a way to get a directory's name without doing a traversal of the parent at all, that would be worth doing. I'd be willing to code it. But I don't see any obvious way to do that given the data structures. |
Ok, I misunderstood. So it's literally is the 'size' from stat(/directory). (https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/What_is_directory_size) May I suggest the following change?: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..083ffeaa4 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,18 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value. Directory size in this case is the
+size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself. Obviously I don't actually call stat, although I'm prepared to believe that this is the number that stat uses. I just checked a few directories, and it seems close, off by one maybe. I found code incrementing or drecrementing it for actons with a single file. |
I double checked and that appears to be correct. On ZFS it's basically the number of files/subdirs in the directory, but on other filesystems it's like the metadata size needed for the directory. So it's not strictly defined 🤷♂️ How about: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..a6a00e107 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,20 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
+Directory size in this case is the size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+On ZFS, this directory size value is approximately the number of files
+and subdirectories in the directory.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
Can you edit the commit message to fix the commitcheck error?:
Other than that, it should be good to go.. |
Add paramter zfs_readdir_dnode_prefetch_limit, defaulting to 0, to control whether zfs_readdir prefetched metadata for objects it look at when reading a directory. If zero, metadata is prefetched for all directory entries. If non-zero, metadata is prefetched only if directory has fewer entries than this. Setting it to non-0 can be important for NFS servers with directories containing many subdirectories. Signed-off-by: Charles Hedrick <[email protected]> Co-authored-by: Chris Siebenmann<[email protected]>
ok. I also noted in the documentation that it only applies to Linux. I have no reason to think the BSD needs it. It's needed because of details of the NFS implementation. i would hope that BSD hasn't chosen the same approach. At any rate, the BSD readdir doesn't prefetch. |
what's up with this? Test failures can't be due to this patch, as it obvviously has no effect if not enabled. Is this going to be merged? |
@clhedrick as you mentioned originally the issue here is really the lack of a proper
The implementation for this should be similar to If you're game to take a crack at this you'll want to register a new |
It would be lovely to have this as a per-dataset property, just saying :) |
Assuming implementing |
maybe.. So get_name is (dentry *parent, char *namebuff, dentry *child) Is any locking needed around it? Is the returned dnode looked or held of anything? |
What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name. Given the data structures I don't see how you could do a lookup without the name or searching. |
Indeed, sorry my initial reading on this was a bit to hasty. You're right, the only way we have to do this lookup is by iterating over the entire directory. That said, I still think it may be worthwhile to add our own |
I think the place to look is zap_value_search. But there's another thing to think about: would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found. |
This should work. The directory ZAP stores the dataset object number as the first integer, and we use this same object number when setting the inode number. The virtual
If we've gone through all the expense of prefetching the directory it makes sense to me to populate it and mark it connected. |
My feeling is that if we make prefetch connect the dnode, performance of this loop isn't so critical, so we don't need our own get_name. As long as the directory dnode is connected, none of this code will be called. This is for realistic cases. In principle, I think the limit to prefetch is needed independent of the NFS issue described here. ZFS claims to support huge directories. But if you actually tried to use them, even if you aren't using NFS and thus don't have the issue described in this request, the first time you do an "ls" of a directory with a zillion files the prefetches will thrash the ARC and possibly fill task queues and other things. I can't say whether this would ever happen in reality. I note that the BSD code for readdir doesn't prefetch. Anyway, connecting This is getting beyond what I think I can do. For the moment we'll stick with this patch. It's hard to know whether anyone else is seeing this issue, since it wouldn't show any symptom other than more CPU usage than you'd like. A lot of our folks do AI. the training datasets seem to have a structure that triggers this. I doubt we're the only place doing this. |
It's actually not so clear that we need the prefetch. There may be enough info ni the inode. But I don't understand what is going on in the reconnect code, so I can't tell what would be needed to actually do what I propose, and whether it would be expensive enough to cause other issues. |
As I can see, the prefetch code is identical for Linux and FreeBSD -- first To address the last case of unneeded prefetches when the data are already in ARC to mind comes mechanism I implemented some time ago for speculative prefetcher in |
We have this issue on our server. In 60 seconds, there were 84231614 times osq_lock() stack traces. The nfsd processes were 90%+ busy. |
what revision is needed? I'm using the patch with 2.2.6. The diff doesn't apply automatically to master, but it's obvious what is needed. If it's going to accepted, I'll fix it to work with master, but so far it doesn't seem like people are ready to take it. Obviously we'll continue to use it, and it sounds like at least one other site needs it. Others might as well but not realize it. My assessment is that the obvious alternative will be only marginally better. I'd rather rewrite things so a directory traversal isn't needed at all, but I'm not convinced that can be done without changing non-zfs kernel code. |
@clhedrick I don't know what Brian meant with that, may be just the merge conflict. But on my side I am not happy of just disabling prefetch because it is not needed in some case. I don't like users turning random knobs. I'd prefer some automatic detection, similar to what should be there already, may be with some improvements, see my later comment. |
Anything that's worth doing is beyond what I can do. One reasonable approach would be to prefetch each item only once. But the prefetch is async, and I don't see any place to tell whether one has already been scheduled. Our cache hit rates for metadata are so high that I doubt there's any advantage to prefetching for us. For myself, I'd prefer just to comment out
If someone else has this issue, that might be easiest. (I haven't found a way to get a diff for this change from github.) |
Our directories do not have have many entries, so the locking contention during the prefetch was the problem. |
Add paramter zfs_readdir_dnode_prefetch_enabled, defaulting to 1, to control whether zfs_readdir prefetched metadata for all objects it look at when reading a directory.
Setting it to 0 can be important for NFS servers with directories containing many subdirectories.
Motivation and Context
This improves #16254, although a more radical redesign is possible. We have an NFS servers with directories containing 200,000 to 500,000 subdirectories. These arise in AI training scenarios. Our NFS servers can have 20 or more threads running at 100% for hours due to this.
The` problem has been found to be an interaction between the NFS code and zfs_readdir. If information about a directory is cached on the client but not the server, when the server tries to valid the NFS file id, it will add an entry in the dnode cache for the directory. Because he new dnode is not connected to the rest of the directory hierarchy, the NFS (actually exportfs) code goes up the tree trying to connect it. The ends up calling zfs_readdir to find the directory in its parent Because zfs_readdir reads the parent looking for the directory, the amount of time it takes is proportional to the size of the parent. If every directory is processed, the amount of time will be N**2 in the size of the parent. Reading a directory would have only moderate overhead except that zfs_readdir does a prefetch on every item as it looks at it, even if the item is already in the ARC. Of course if it's in the ARC, no disk I/O will be done. But there's still a substantial amount of code.
We've found a factor of 10 to 20 speedup by skipping the prefetches. This is enough to move us from a painful situation to one we can live with.
Of course a linear speedup in an N**2 problem isn't a full solution, but it's enough to survive the size directories we see in practice. A full solution would almost certainly require help from the kernel developers, which is going to be hard to get, since directory searching is fast enough for other Linux file systems that the problem only comes up with ZFS.
Description
This change add a module parameter, zfs_readdir_dnode_prefetch_enabled, normally 1, which can be changed to 0 to disable the prefetch that happens in zfs_readdir.
How Has This Been Tested?
We have done testing with a directory containing 200,000 subdirectories, with the results discussed above. It has been running in production for several weeks.
Types of changes
Checklist:
Signed-off-by
.