BRT: Cloning sparse files should take time proportional to allocated size #16014

rrevans · 2024-03-20T12:12:59Z

Describe the feature would like to see added to OpenZFS

Copying large sparse files with BRT should take time proportional to the number of allocated blocks rather than time proportional to the number of records in the file (i.e. apparent size).

On my system a 128G file with one 4k block of data takes 15 seconds:

$ dd if=/dev/random of=x seek=32M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	0m3.851s
user	0m0.000s
sys	0m3.823s

And 512G file takes about 4x longer:

$ dd if=/dev/random of=x seek=128M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	0m15.071s
user	0m0.001s
sys	0m15.036s

Lastly a 2T file takes about 4x longer again:

$ dd if=/dev/random of=x seek=512M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	1m2.359s
user	0m0.000s
sys	1m2.337s

This is Fedora 39 6.6.8-200.fc39.x86_64 running 8f2f6cd non-DEBUG with a default-created pool with 128k recordsize on a single consumer NVMe vdev.

(I'm filing this as a feature because the performance is reasonable enough, but maybe some would argue that this is a regression vs. existing behavior with copying using sparse-aware tools.)

How will this feature improve OpenZFS?

While block cloning sparse files works reasonably well, ZFS can easily create petabyte-sized sparse files, and recent cp can copy them efficiently with SEEK_DATA. Ideally block cloning should have similar performance.

This would help users with very large, sparsely allocated files such as VM images or large sparse array files. Also users that are trying to reflink extremely large, empty sparse files might otherwise be surprised by linear copy performance vs. existing cp before reflinking was added to ZFS.

Additional context

See dnode_next_offset for possible implementation, though that function doesn't correctly handle traversing files with dirty blocks. It might be possible to query dbuf state for the range of in-memory dbufs to see if the cloned range is dirty before attempting cloning or use something like dnode_is_dirty as a workaround.

The text was updated successfully, but these errors were encountered:

amotin · 2024-03-20T13:34:31Z

Skipping holes sounds like a good idea. I guess cloning of holes should already be faster than cloning of data, since they are not counted in BRT, but may be we could save some more work for sync thread without too much redesign if we just skip cloning holes over holes.

Unrelated to this approach, but couple weeks ago I produced 7 PRs trying to optimize block cloning: #15941, #15950, #15951, #15954, #15955, #15967, #15976. I would appreciate some review.

rrevans · 2024-03-20T14:29:39Z

I will take a look - this at first glance look like good optimization and cleanup. Do you have any getting started docs, talks, or other resources beyond brt.c?(Which is pretty good already!)

I'm also considering looking into what it would take to make bclone work on objects with dirty records before sync, so it would be nice to learn the internals. I am against the idea that any userspace operation outcome should differ for dirty vs. non dirty object state or that any userspace operation should need to force a txg sync to accomplish some ordinary task.

rincebrain · 2024-03-21T01:27:35Z

This and this are probably a good start, depending on what precisely you were looking for.

JuniorJPDJ · 2024-10-22T00:46:51Z

This shouldn't be closed after #16007 merge? Can someone retest this? :D

amotin · 2024-10-22T01:00:08Z

This shouldn't be closed after #16007 merge? Can someone retest this? :D

No. It really was a bug fix, just as written. It can be that some of my optimizations in the area may improve performance in this scenario, but not up to a level when huge file of holes would be cloned instantly. The cloning code is still pretty sequential. It does not have special optimizations for sparse files.

rrevans added the Type: Feature Feature request or new feature label Mar 20, 2024

amotin mentioned this issue Mar 22, 2024

BRT: Fix holes cloning #16007

Closed

13 tasks

rrevans mentioned this issue May 17, 2024

Criteria for re-enabling block cloning (toggle zfs_bclone_enabled) by default #16189

Closed

behlendorf self-assigned this May 25, 2024

behlendorf added the BRT BRT tracking label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRT: Cloning sparse files should take time proportional to allocated size #16014

BRT: Cloning sparse files should take time proportional to allocated size #16014

rrevans commented Mar 20, 2024

amotin commented Mar 20, 2024

rrevans commented Mar 20, 2024

rincebrain commented Mar 21, 2024

JuniorJPDJ commented Oct 22, 2024

amotin commented Oct 22, 2024

BRT: Cloning sparse files should take time proportional to allocated size #16014

BRT: Cloning sparse files should take time proportional to allocated size #16014

Comments

rrevans commented Mar 20, 2024

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

amotin commented Mar 20, 2024

rrevans commented Mar 20, 2024

rincebrain commented Mar 21, 2024

JuniorJPDJ commented Oct 22, 2024

amotin commented Oct 22, 2024