Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BRT: Fix FICLONE/FICLONERANGE shortened copy #15842

Merged
merged 2 commits into from
Feb 6, 2024

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Issue #15728 describes an issue with the Linux integration of BRT which needs to be resolved.

Description

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state.

A test case has been added which exercises this functionality by verifying that cp --reflink=never|auto|always works correctly.

How Has This Been Tested?

A test case has been added which is modeled after the original reproducer from #15728. Without this change applied the test fails, with the test passes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member

robn commented Feb 1, 2024

Any reason for cp --reflink vs clonefile? Not that I'm precious about it; its just there to ensure we get exactly what we want, rather than whatever the local cp of the day does.

@behlendorf
Copy link
Contributor Author

Any reason for cp --reflink vs clonefile? Not that I'm precious about it; its just there to ensure we get exactly what we want, rather than whatever the local cp of the day does.

That's a good point. At the time I was concerned about cp working as expected so I wanted to test that. But you're right, clonefile would be more precise and portable. Let me play with that a bit.

@behlendorf
Copy link
Contributor Author

After giving this some more thought I've updated the PR with the following changes.

  1. Added a new zfs_bclone_wait_dirty kmod tunable to control the behavior where there are dirty blocks. Mainly I added this to provide an easy way to compare the performance between waiting on dirty blocks and instead returning and error and making the copy.
  2. Switched the default behavior to not wait on dirty blocks, zfs_bclone_wait_dirty=0. EINVAL is returned when FICLONE or FICLONERANGE can't clone the entire range due to dirty blocks.
  3. Updated the test case to verify both the zfs_bclone_wait_dirty=0 and zfs_bclone_wait_dirty=1 behavior. The test still uses cp --reflink, we should probably consider adding another test which used clonefile.
  4. Restructured how the wait_dirty flag is passed. Really we don't need to pass it at all, but could simply check zfs_bclone_wait_dirty in zfs_clone_range if there's no good real to let the caller control this. For now I left it to get feedback.

Copy link
Contributor

@rrevans rrevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I patched this and confirmed it fixes the original reproducer case.

@behlendorf
Copy link
Contributor Author

Refreshed. The updated version addresses all of the outstanding feedback. I also simplified the patch to no longer allow the caller to pass zfs_bclone_wait_dirty. If at some point in the future we decide this is needed we can always add it.

@rrevans
Copy link
Contributor

rrevans commented Feb 5, 2024

The updated version addresses all of the outstanding feedback.

Thanks the patch looks very good to me.

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <[email protected]>
Issue openzfs#15728
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 5, 2024
Relocate declaration of zfs_bclone_enabled and zfs_bclone_wait_dirty
to the platform independant code.  Add some additional documention
to these tunables at the same time.

Signed-off-by: Brian Behlendorf <[email protected]>
module/zfs/zfs_vnops.c Show resolved Hide resolved
@behlendorf behlendorf merged commit 6dccdf5 into openzfs:master Feb 6, 2024
23 of 25 checks passed
@Alexandero89
Copy link

@behlendorf could you maybe check the github actions. There seems something went wrong.

  • buildbot/CentOS 8 x86_64 (TEST) is stated as "Build started" but when you click on Details it has already finished long ago
  • Cleanup action run out of space in the last run (see here)

tonyhutter added a commit that referenced this pull request Feb 22, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #15728
Closes #15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#15728
Closes openzfs#15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#15728
Closes openzfs#15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#15728
Closes openzfs#15842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants