Efforts to support copy offloading in the Linux kernel block layer started considerable time ago. Despite this copy offloading support is not yet upstream.
This document is about how to implement copy offloading.
The following functionality of the block layer needs to be considered when implementing copy offloading:
- Request queueing.
- I/O scheduling.
- Request splitting. bio_split_to_limits() splits a bio if splitting is necessary to meet the request queue limits.
- Request merging. blk_mq_sched_try_merge() attempts to merge a bio into an existing request.
- Request cloning.
- Request plugging.
- Tracking I/O statistics.
- Timeout handling.
- Block driver stacking.
A possible approach is as follows:
- Fall back to a non-offloaded copy operation if necessary, e.g. if copy
offloading is not supported, if data is encrypted and the ciphertext
depends on the LBA or if the copy request would have to be split. The
following code may be a good starting point for a non-offloaded copy
operation:
drivers/md/dm-kcopyd.c
.
An mechanism is needed to pass copy offload requests from file systems to block drivers. So far the following has been considered:
- Implement copy offloading as a single operation in the block layer, e.g.
REQ_OP_COPY
. - Implement copy offloading as two operations, e.g.
REQ_OP_COPY_SRC
andREQ_OP_COPY_DST
.
These two approaches compare as follows:
Single operation | Two operations |
---|---|
No deadlock risk. | Block drivers must complete the REQ_OP_COPY_SRC operation before the REQ_OP_COPY_DST operation has completed or there is a risk of a deadlock. Hence, this approach is slower and requires all block drivers to track state information about ongoing copy operations. |
Two data ranges have to be specified in the bio payload. A data buffer will have to be attached to the bio (bi_io_vec ) with a custom data format. |
One data range per bio. This data range can be specified in bi_iter.bi_sector and bi_iter.bi_size . |
Drivers like dm-linear will have to be modified to support the new bio payload format | No device mapper drivers have to be modified. |
The following code needs to be modified no matter how copy offloading is implemented:
- Request cloning. The code for checking the limits before request are cloned
compares
blk_rq_sectors()
withmax_sectors
. This is inappropriate forREQ_COPY_*
requests. - Request splitting.
bio_split()
assumes thatbi_iter.bi_size
represents the number of bytes affected on the medium. - Code related to partially completing a request, e.g.
blk_update_request()
. - The code for merging block layer requests.
blk_mq_end_request()
since it callsblk_update_request()
.- The plugging code because of the following test in the plugging code:
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE
. - The I/O accounting code (task_io_account_read()) since that code uses bio_has_data() and hence skips discard, secure erase and write zeroes requests:
static inline bool bio_has_data(struct bio *bio)
{
return bio && bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&
bio_op(bio) != REQ_OP_SECURE_ERASE &&
bio_op(bio) != REQ_OP_WRITE_ZEROES;
}
Block drivers will need to use the special_vec
member of struct request to
pass the copy offload parameters to the storage device. That member is used
e.g. when a REQ_OP_DISCARD operation is submitted to an NVMe driver. The SCSI
sd driver uses special_vec
while processing an UNMAP or WRITE SAME command.
The device mapper may have to split a request. As an example, LVM is based on the dm-linear driver. A request that is submitted to an LVM volume has to be split if it affects multiple block devices. Copy offload requests that affect multiple block devices should be split or should be onloaded.
The call chain for bio-based dm drivers is as follows:
dm_submit_bio(bio)
-> __split_and_process_bio(md, map, bio)
-> __split_and_process_non_flush(clone_info)
-> __clone_and_map_data_bio(clone_info, target_info, sector, len)
-> clone_bio(dm_target_io, bio, sector, len)
-> __map_bio(dm_target_io)
-> ti->type->map(dm_target_io, clone)
Process copy offload commands by translating REQ_COPY_OUT requests into simple copy commands.
From inside sd_revalidate_disk()
, query the third-party copy VPD page. Extract
the following parameters (see also SPC-6):
- MAXIMUM CSCD DESCRIPTOR COUNT
- MAXIMUM SEGMENT DESCRIPTOR COUNT
- MAXIMUM DESCRIPTOR LIST LENGTH
- Supported third-party copy commands.
- SUPPORTED CSCD DESCRIPTOR ID (0 or more)
- ROD type descriptor (0 or more)
- TOTAL CONCURRENT COPIES
- MAXIMUM IDENTIFIED CONCURRENT COPIES
- MAXIMUM SEGMENT LENGTH
From inside sd_init_command()
, translate REQ_COPY_OUT into either EXTENDED
COPY or POPULATE TOKEN + WRITE USING TOKEN.
Set the parameters in the copy offload commands as follows:
- We may have to set the STR bit. From SPC-6: "A sequential striped (STR) bit set to one specifies to the copy manager that the majority of the block device references in the parameter list represent sequential access of several block devices that are striped. This may be used by the copy manager to perform reads from a copy source block device at any time and in any order during processing of an EXTENDED COPY command as described in 6.6.5.3. A STR bit set to zero specifies to the copy manager that disk references, if any, may not be sequential."
- Set the LIST ID USAGE field to 3 and the LIST ID to 0. This means that neither "held data" nor the RECEIVE COPY STATUS command are supported. This improves security because the data that is being copied cannot be accessed via the LIST ID.
- We may have to set the G_SENSE (good with sense data) bit. From SPC-6: " If the G _SENSE bit is set to one and the copy manager completes the EXTENDED COPY command with GOOD status, then the copy manager shall include sense data with the GOOD status in which the sense key is set to COMPLETED, the additional sense code is set to EXTENDED COPY INFORMATION AVAILABLE, and the COMMAND-SPECIFIC INFORMATION field is set to the number of segment descriptors the copy manager has processed."
- Clear the IMMED bit.
To submit copy offload requests from user space, we need:
- A system call for passing these requests, e.g. copy_file_range() or io_uring.
- Add a copy offload parameter format description to the user space ABI. The parameters include source device, source ranges, destination device and destination ranges.
- A flag that indicates whether or not it is acceptable to fall back to onloading the copy operation.
To do: define which aspects of copy offloading should be configurable through new sysfs parameters under /sys/block/*/queue/.
- Martin Petersen, Copy Offload, linux-scsi, 28 May 2014.
- Mikulas Patocka, ANNOUNCE: SCSI XCOPY support for the kernel and device mapper, 15 July 2014.
- Mikulas Patocka, [PATCH 0/15] copy offload patches, linux-scsi mailing list, 2015-12-10.
- kcopyd documentation, kernel.org.
- Martin K. Petersen, Copy Offload - Here Be Dragons, 2019-08-21.
- Martin K. Petersen, Re: [dm-devel] [RFC PATCH v2 1/2] block: add simple copy support, linux-nvme mailing list, 2020-12-08.
- NVM Express Organization, NVMe - TP 4065b Simple Copy Command 2021.01.25 - Ratified.pdf, 2021-01-25.
- Selvakumar S, [RFC PATCH v5 0/4] add simple copy support, linux-nvme, 2021-02-19.
- Mikulas Patocka, Re: [PATCH 3/7] block: copy offload support infrastructure, linux-nvme, 2021-08-17.
- Jake Edge, Merging copy ofload, LWN.net, 2023-06-21.
- Nitesh Shetty, [PATCH v20 00/12] Implement copy offload support, linux-block mailing list, 2024-05-20.