Skip to content

A proposal for how to implement copy offloading in the Linux kernel

Notifications You must be signed in to change notification settings

bvanassche/linux-kernel-copy-offload

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Implementing Copy Offloading in the Linux Kernel Block Layer

Introduction

Efforts to support copy offloading in the Linux kernel block layer started considerable time ago. Despite this copy offloading support is not yet upstream.

This document is about how to implement copy offloading.

Block Layer Core

The following functionality of the block layer needs to be considered when implementing copy offloading:

  • Request queueing.
  • I/O scheduling.
  • Request splitting. bio_split_to_limits() splits a bio if splitting is necessary to meet the request queue limits.
  • Request merging. blk_mq_sched_try_merge() attempts to merge a bio into an existing request.
  • Request cloning.
  • Request plugging.
  • Tracking I/O statistics.
  • Timeout handling.
  • Block driver stacking.

A possible approach is as follows:

  • Fall back to a non-offloaded copy operation if necessary, e.g. if copy offloading is not supported, if data is encrypted and the ciphertext depends on the LBA or if the copy request would have to be split. The following code may be a good starting point for a non-offloaded copy operation: drivers/md/dm-kcopyd.c.

An mechanism is needed to pass copy offload requests from file systems to block drivers. So far the following has been considered:

  • Implement copy offloading as a single operation in the block layer, e.g. REQ_OP_COPY.
  • Implement copy offloading as two operations, e.g. REQ_OP_COPY_SRC and REQ_OP_COPY_DST.

These two approaches compare as follows:

Single operation Two operations
No deadlock risk. Block drivers must complete the REQ_OP_COPY_SRC operation before the REQ_OP_COPY_DST operation has completed or there is a risk of a deadlock. Hence, this approach is slower and requires all block drivers to track state information about ongoing copy operations.
Two data ranges have to be specified in the bio payload. A data buffer will have to be attached to the bio (bi_io_vec) with a custom data format. One data range per bio. This data range can be specified in bi_iter.bi_sector and bi_iter.bi_size.
Drivers like dm-linear will have to be modified to support the new bio payload format No device mapper drivers have to be modified.

The following code needs to be modified no matter how copy offloading is implemented:

  • Request cloning. The code for checking the limits before request are cloned compares blk_rq_sectors() with max_sectors. This is inappropriate for REQ_COPY_* requests.
  • Request splitting. bio_split() assumes that bi_iter.bi_size represents the number of bytes affected on the medium.
  • Code related to partially completing a request, e.g. blk_update_request().
  • The code for merging block layer requests.
  • blk_mq_end_request() since it calls blk_update_request().
  • The plugging code because of the following test in the plugging code: blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE.
  • The I/O accounting code (task_io_account_read()) since that code uses bio_has_data() and hence skips discard, secure erase and write zeroes requests:
static inline bool bio_has_data(struct bio *bio)
{
	return bio && bio->bi_iter.bi_size &&
	    bio_op(bio) != REQ_OP_DISCARD &&
	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
	    bio_op(bio) != REQ_OP_WRITE_ZEROES;
}

Block drivers will need to use the special_vec member of struct request to pass the copy offload parameters to the storage device. That member is used e.g. when a REQ_OP_DISCARD operation is submitted to an NVMe driver. The SCSI sd driver uses special_vec while processing an UNMAP or WRITE SAME command.

Device Mapper

The device mapper may have to split a request. As an example, LVM is based on the dm-linear driver. A request that is submitted to an LVM volume has to be split if it affects multiple block devices. Copy offload requests that affect multiple block devices should be split or should be onloaded.

The call chain for bio-based dm drivers is as follows:

dm_submit_bio(bio)
-> __split_and_process_bio(md, map, bio)
  -> __split_and_process_non_flush(clone_info)
    -> __clone_and_map_data_bio(clone_info, target_info, sector, len)
      -> clone_bio(dm_target_io, bio, sector, len)
      -> __map_bio(dm_target_io)
        -> ti->type->map(dm_target_io, clone)

NVMe

Process copy offload commands by translating REQ_COPY_OUT requests into simple copy commands.

SCSI

From inside sd_revalidate_disk(), query the third-party copy VPD page. Extract the following parameters (see also SPC-6):

  • MAXIMUM CSCD DESCRIPTOR COUNT
  • MAXIMUM SEGMENT DESCRIPTOR COUNT
  • MAXIMUM DESCRIPTOR LIST LENGTH
  • Supported third-party copy commands.
  • SUPPORTED CSCD DESCRIPTOR ID (0 or more)
  • ROD type descriptor (0 or more)
  • TOTAL CONCURRENT COPIES
  • MAXIMUM IDENTIFIED CONCURRENT COPIES
  • MAXIMUM SEGMENT LENGTH

From inside sd_init_command(), translate REQ_COPY_OUT into either EXTENDED COPY or POPULATE TOKEN + WRITE USING TOKEN.

Set the parameters in the copy offload commands as follows:

  • We may have to set the STR bit. From SPC-6: "A sequential striped (STR) bit set to one specifies to the copy manager that the majority of the block device references in the parameter list represent sequential access of several block devices that are striped. This may be used by the copy manager to perform reads from a copy source block device at any time and in any order during processing of an EXTENDED COPY command as described in 6.6.5.3. A STR bit set to zero specifies to the copy manager that disk references, if any, may not be sequential."
  • Set the LIST ID USAGE field to 3 and the LIST ID to 0. This means that neither "held data" nor the RECEIVE COPY STATUS command are supported. This improves security because the data that is being copied cannot be accessed via the LIST ID.
  • We may have to set the G_SENSE (good with sense data) bit. From SPC-6: " If the G _SENSE bit is set to one and the copy manager completes the EXTENDED COPY command with GOOD status, then the copy manager shall include sense data with the GOOD status in which the sense key is set to COMPLETED, the additional sense code is set to EXTENDED COPY INFORMATION AVAILABLE, and the COMMAND-SPECIFIC INFORMATION field is set to the number of segment descriptors the copy manager has processed."
  • Clear the IMMED bit.

System Call Interface

To submit copy offload requests from user space, we need:

  • A system call for passing these requests, e.g. copy_file_range() or io_uring.
  • Add a copy offload parameter format description to the user space ABI. The parameters include source device, source ranges, destination device and destination ranges.
  • A flag that indicates whether or not it is acceptable to fall back to onloading the copy operation.

Sysfs Interface

To do: define which aspects of copy offloading should be configurable through new sysfs parameters under /sys/block/*/queue/.

See Also

About

A proposal for how to implement copy offloading in the Linux kernel

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published