Low zfs list performance #8898

GregorKopka · 2019-06-13T13:32:25Z

GregorKopka
Jun 13, 2019

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	rolling
Linux Kernel	4.9.95-gentoo
Architecture	x86_64
ZFS Version	0.7.13-r0-gentoo
SPL Version	0.7.13-r0-gentoo

Describe the problem you're observing

Low performance when running zfs list -o name -H -r -tall $pool

According to my tests the performance of having everything already cached in ARC (so it can be served from RAM) is only one order of magnitude faster than with having to read everything directly from HDD in the first place.

Granted, this pool has ~238k snapshots in 621 datasets, but nevertheless... ZFS shouldn't need >5 minutes to list these with all the data needed already being in ARC.

Plus the output of arcstat.py, while testing this, dosn't make any sense (see below).

Describe how to reproduce the problem

On an otherwise completely idle system
Linux 4.9.95-gentoo #2 SMP Wed Feb 20 11:21:13 -00 2019 x86_64 Intel(R) Xeon(R) CPU E31245 @ 3.30GHz GenuineIntel GNU/Linux

With all zfs/spl parameters at default, non-default pool properties of

NAME  PROPERTY              VALUE                  SOURCE
pool  mountpoint            none                   local
pool  compression           lz4                    local
pool  atime                 off                    local
pool  xattr                 sa                     local
pool  sync                  disabled               local

I import the pool and run

$ time zfs list -o name -H -r -tall $pool | wc -l
238667

real    56m29.179s
user    0m9.795s
sys     7m29.488s

which is ~14.36ms per dataset listed (~70 datasets/s).
This isn't great but, as the pool is HDDs based and zfs list reading through the metadata sequentially (as the next read needs data from the current one) being limited to the IOPS of one disk, somewhat expected.

On that first run I see a zpool iostat 10 output of

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
$poolv7      468G  2.25T     77      0   312K      0
$poolv7      468G  2.25T     78      0   315K      0
$poolv7      468G  2.25T    112      0   586K      0
$poolv7      468G  2.25T    107      0   429K      0

while arcstat.py 10 outputs strange numbers

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:02:55   37K   13K     36   13K   36     0    0   13K   36    96M  498M
11:03:05   32K   11K     36   11K   36     0   50   11K   36    98M  498M
11:03:15   42K   15K     36   15K   36     0    0   15K   36   102M  498M
11:03:25   54K   19K     36   19K   36     0  100   19K   36   104M  498M

Directly afterwards I repeat the operation, now with with the ARC having cached everything:

$ time zfs list -o name -H -r -tall $pool | wc -l
238667

real    7m14.887s
user    0m8.312s
sys     7m6.578s

which is ~1.84ms/dataset (~543 datasets/s).
Less than 10 times faster than having to read from disk,

On the cached run I see a representative top output of

top - 12:04:10 up  5:36,  4 users,  load average: 1.05, 0.78, 0.85
Tasks: 325 total,   2 running, 323 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us, 12.5 sy,  0.0 ni, 87.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
root      20   0  271256  41868   3388 R 100.0   0.3   1:09.89 zfs
 2098 root      20   0       0      0      0 S   1.7   0.0   0:31.61 dbu_evict
 2099 root      39  19       0      0      0 S   0.3   0.0   0:02.50 dbuf_evict
13634 root      20   0  222220   3412   2676 R   0.3   0.0   0:06.97 top

and according to zpool iostat no IO to the drives (=fully cached).
The output of arcstat.py 10 continues to output very strange numbers:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:04:11  452K  157K     34  157K   34     0    0  157K   34   937M  981M
12:04:21  447K  159K     35  159K   35     0    0  159K   35   936M  981M
12:04:31  442K  159K     36  159K   36     0    0  159K   36   935M  981M
12:04:41  442K  160K     36  160K   36     0    0  160K   36   936M  981M

Repeating this with the adition of -s name

$ time zfs list -o name -s name -H -r -tall $pool | wc -l
238667

real    5m30.326s
user    0m1.083s
sys     5m29.232s

reduces the runtime a little, but it's still taking ~1.4ms/dataset (or only ~715 datasets/s) and the same strange numbers in arcstats.py output as on the cached run above.

Include any warning/errors/backtraces from the system logs

Nothing in the logs.

richardelling · 2019-06-14T00:00:46Z

richardelling
Jun 14, 2019
Collaborator

IIRC, this is one of the reasons zfs.list was added as part of channel programs.
https://www.delphix.com/blog/delphix-engineering/zfs-channel-programs

0 replies

GregorKopka · 2019-06-23T17:41:03Z

GregorKopka
Jun 23, 2019
Author

Regarding the workaround of using a channel program: as that runs as an atomic operation in the tgx sync context I would expect it to block the txg sync - that would be an an even worse outcome, even in case of it performing an order of magnitude better.

As zfs list is likely one of the most frequently used userland commands it should operate with a reasonable performance in the first place. From this perspective I don't see a question in this issue.

I also repeated the test after setting zfs_compressed_arc_enabled=0 and re-importing the pool: ~4 seconds less total runtime, so no real difference to running compressed arc. Unless setting the module parameter and exporting;importing the pool is turned into a NOP somehow... repeated decompression of ARC contents dosn't seem to be the bottleneck.

0 replies

GregorKopka · 2019-06-23T19:32:12Z

GregorKopka
Jun 23, 2019
Author

I did.

0 replies

zviratko · 2019-06-24T09:30:15Z

zviratko
Jun 24, 2019

I've been seeing the same strange behaviour (I think from the beggining, which was zfs-0.6.5.9).
"zfs list -t all" on my backup storage runs for 30+ minutes the first time and about 2 minutes the second time with everything cached, yet arcstat claims that I am getting lots of misses... and nothing hits the drives so I think it gets cached elsewhere (like pagecache).
In any case, the performance of doing zfs list is really bad, I think a bit of inherent parallelism could at least make it go faster, what I'm seeing now in the "uncached" state is an occasional read from the drives (in my case 10 mirror vdevs), so it's more than likely purely sequential - meaning it won't scale with number of drives or different layouts at all.

0 replies

richardelling · 2019-06-24T19:04:14Z

richardelling
Jun 24, 2019
Collaborator

If you run strace you can see what is happening more clearly. zfs list iterates through the list of datasets. For fun, try strace -c zfs list and you'll see how many times ioctl is called, it should be once per dataset plus a few.

The reason channel programs can be more efficient is because they can run through all datasets at a time, rather than an ioctl per dataset (as an iterator)

0 replies

GregorKopka · 2019-08-29T20:42:28Z

GregorKopka
Aug 29, 2019
Author

@richardelling I wonder what would happen when using a channel program to replace the zfs list -o name -H -r -tall $pool that takes ~50 minutes (see opening post, cold ARC minus cached in ARC runtime) for collecting the needed data from disk...

Wouldn't it (as a ZCP runs atomically in one TXG) block the TXG sync and through this stall everything else accessing the pool (or worse)?

0 replies

richardelling · 2019-08-29T23:58:48Z

richardelling
Aug 29, 2019
Collaborator

@GregorKopka good question. We know that running from userland through the iterator is slow (50 minutes). We don't know what that looks like when the work is all in kernel. Try it :-)

0 replies

abrasive · 2020-06-09T11:24:40Z

abrasive
Jun 9, 2020

Here is a simple program that does something along the lines of zfs list (only names of snapshots and datasets, mind you):

args = ...
pool = args["argv"][1]

datasets = {}
function get_all(root)
    local children = {}
    local snaps = {}
    datasets[root] = {}
    for child in zfs.list.children(root) do
        table.insert(children, child)
        get_all(child)
    end
    for snap in zfs.list.snapshots(root) do
        table.insert(snaps, snap)
    end
    datasets[root].children = children
    datasets[root].snapshots = snaps
end
get_all(pool)
return datasets

I've tried this out on an idle zpool comprising a single spindle - I export and reimport the pool to flush any caches. This has ~100 filesystems and ~10k snapshots. Results:

jhl@box /tmp $ sh test_script
Exporting and reimporting pool

zfs list -r -t all shed

real	3m1.662s
user	0m1.027s
sys	0m19.265s

zfs list -r -t all shed

real	0m18.644s
user	0m0.798s
sys	0m17.764s

Exporting and reimporting pool

sudo zfs program -n shed zfs-list.chan shed

real	0m12.862s
user	0m0.000s
sys	0m0.003s

sudo zfs program -n shed zfs-list.chan shed

real	0m11.274s
user	0m0.001s
sys	0m0.002s

The cache seems to make little difference to the channel program in this case.

If I try it on another pool (4 spindles, 100 fs, 5000 snaps), zfs list takes 2 seconds and the channel program takes 150 ms. (This is under a light write load and I can't export it.)

0 replies

abrasive · 2020-06-09T11:34:15Z

abrasive
Jun 9, 2020

Whoops - that was an unfair comparison; I should have used zfs list -o name -s name to prevent it from fetching creation-time metadata. Now the difference is less:

Exporting and reimporting pool

zfs list -r -t all -o name -s name shed

real	0m42.845s
user	0m0.073s
sys	0m13.672s
zfs list -r -t all -o name -s name shed

real	0m13.276s
user	0m0.054s
sys	0m13.203s

Similarly on my 4-spindle pool a zfs list takes 0.5s as opposed to zfs program with 0.15s.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low zfs list performance #8898

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Low zfs list performance #8898

GregorKopka Jun 13, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Replies: 9 comments

richardelling Jun 14, 2019 Collaborator

GregorKopka Jun 23, 2019 Author

GregorKopka Jun 23, 2019 Author

zviratko Jun 24, 2019

richardelling Jun 24, 2019 Collaborator

GregorKopka Aug 29, 2019 Author

richardelling Aug 29, 2019 Collaborator

abrasive Jun 9, 2020

abrasive Jun 9, 2020

GregorKopka
Jun 13, 2019

richardelling
Jun 14, 2019
Collaborator

GregorKopka
Jun 23, 2019
Author

GregorKopka
Jun 23, 2019
Author

zviratko
Jun 24, 2019

richardelling
Jun 24, 2019
Collaborator

GregorKopka
Aug 29, 2019
Author

richardelling
Aug 29, 2019
Collaborator

abrasive
Jun 9, 2020

abrasive
Jun 9, 2020