Skip to content

Commit

Permalink
Adding Direct IO Support
Browse files Browse the repository at this point in the history
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
	  VDEV disks is guaranteed to not change. There is no concern
	  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
	the page contents while the write operation is occurring, data
	integrity can not be guaranteed. However, there is a module
	parameter `zfs_vdev_direct_write_verify_pct` that contols the
	percentage of O_DIRECT writes that can occur to a top-level
	VDEV before a checksum verify is run before the contents of the
	user buffers are committed to disk. In the event of a checksum
	verification failure the write will be redirected through the
	ARC. The deafault value for `zfs_vdev_direct_write_verify_pct`
	is 2 percent of Direct I/O writes to a top-level VDEV. The
	number of O_DIRECT write checksum verification errors can be
	observed by doing `zpool status -d`, which will list all
	verification errors that have occurred on a top-level VDEV.
	Along with `zpool status`, a ZED event will be issues as
	`dio_verify` when a checksum verification error occurs.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
	   the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
	   write/read IO requests when the O_DIRECT flag is used.
always   - Treats every write/read IO request as though it passed
           O_DIRECT and will do O_DIRECT if the alignment restrictions
	   are met otherwise will redirect through the ARC. This
	   property will not allow a request to fail.

Signed-off-by: Brian Atkinson <[email protected]>
Co-authored-by: Mark Maybee <[email protected]>
Co-authored-by: Matt Macy <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
  • Loading branch information
4 people committed Apr 22, 2024
1 parent f4f1561 commit e4f9167
Show file tree
Hide file tree
Showing 110 changed files with 9,712 additions and 7,316 deletions.
30 changes: 25 additions & 5 deletions cmd/zpool/zpool_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ get_usage(zpool_help_t idx)
"[<device> ...]\n"));
case HELP_STATUS:
return (gettext("\tstatus [--power] [-c [script1,script2,...]] "
"[-DegiLpPstvx] [-T d|u] [pool] ...\n"
"[-dDegiLpPstvx] [-T d|u] [pool] ...\n"
"\t [interval [count]]\n"));
case HELP_UPGRADE:
return (gettext("\tupgrade\n"
Expand Down Expand Up @@ -2207,6 +2207,7 @@ typedef struct status_cbdata {
boolean_t cb_print_unhealthy;
boolean_t cb_print_status;
boolean_t cb_print_slow_ios;
boolean_t cb_print_dio_verify;
boolean_t cb_print_vdev_init;
boolean_t cb_print_vdev_trim;
vdev_cmd_data_list_t *vcdl;
Expand Down Expand Up @@ -2439,7 +2440,7 @@ print_status_config(zpool_handle_t *zhp, status_cbdata_t *cb, const char *name,
uint_t c, i, vsc, children;
pool_scan_stat_t *ps = NULL;
vdev_stat_t *vs;
char rbuf[6], wbuf[6], cbuf[6];
char rbuf[6], wbuf[6], cbuf[6], dbuf[6];
char *vname;
uint64_t notpresent;
spare_cbdata_t spare_cb;
Expand Down Expand Up @@ -2557,6 +2558,17 @@ print_status_config(zpool_handle_t *zhp, status_cbdata_t *cb, const char *name,
printf(" %5s", "-");
}
}
if (VDEV_STAT_VALID(vs_dio_verify_errors, vsc) &&
cb->cb_print_dio_verify) {
zfs_nicenum(vs->vs_dio_verify_errors, dbuf,
sizeof (dbuf));

if (cb->cb_literal)
printf(" %5llu",
(u_longlong_t)vs->vs_dio_verify_errors);
else
printf(" %5s", dbuf);
}
}

if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT,
Expand Down Expand Up @@ -9169,6 +9181,10 @@ status_callback(zpool_handle_t *zhp, void *data)
printf_color(ANSI_BOLD, " %5s", gettext("POWER"));
}

if (cbp->cb_print_dio_verify) {
printf_color(ANSI_BOLD, " %5s", gettext("DIO"));
}

if (cbp->vcdl != NULL)
print_cmd_columns(cbp->vcdl, 0);

Expand Down Expand Up @@ -9217,10 +9233,11 @@ status_callback(zpool_handle_t *zhp, void *data)
}

/*
* zpool status [-c [script1,script2,...]] [-DegiLpPstvx] [--power] [-T d|u] ...
* [pool] [interval [count]]
* zpool status [-c [script1,script2,...]] [-dDegiLpPstvx] [--power] ...
* [-T d|u] [pool] [interval [count]]
*
* -c CMD For each vdev, run command CMD
* -d Display Direct I/O write verify errors
* -D Display dedup status (undocumented)
* -e Display only unhealthy vdevs
* -g Display guid for individual vdev name.
Expand Down Expand Up @@ -9253,7 +9270,7 @@ zpool_do_status(int argc, char **argv)
};

/* check options */
while ((c = getopt_long(argc, argv, "c:DegiLpPstT:vx", long_options,
while ((c = getopt_long(argc, argv, "c:dDegiLpPstT:vx", long_options,
NULL)) != -1) {
switch (c) {
case 'c':
Expand All @@ -9280,6 +9297,9 @@ zpool_do_status(int argc, char **argv)
}
cmd = optarg;
break;
case 'd':
cb.cb_print_dio_verify = B_TRUE;
break;
case 'D':
cb.cb_dedup_stats = B_TRUE;
break;
Expand Down
46 changes: 37 additions & 9 deletions cmd/ztest.c
Original file line number Diff line number Diff line change
Expand Up @@ -2264,6 +2264,13 @@ ztest_replay_write(void *arg1, void *arg2, boolean_t byteswap)
if (ztest_random(4) != 0) {
int prefetch = ztest_random(2) ?
DMU_READ_PREFETCH : DMU_READ_NO_PREFETCH;

/*
* We will randomly set when to do O_DIRECT on a read.
*/
if (ztest_random(4) == 0)
prefetch |= DMU_DIRECTIO;

ztest_block_tag_t rbt;

VERIFY(dmu_read(os, lr->lr_foid, offset,
Expand Down Expand Up @@ -2815,6 +2822,13 @@ ztest_io(ztest_ds_t *zd, uint64_t object, uint64_t offset)
enum ztest_io_type io_type;
uint64_t blocksize;
void *data;
uint32_t dmu_read_flags = DMU_READ_NO_PREFETCH;

/*
* We will randomly set when to do O_DIRECT on a read.
*/
if (ztest_random(4) == 0)
dmu_read_flags |= DMU_DIRECTIO;

VERIFY0(dmu_object_info(zd->zd_os, object, &doi));
blocksize = doi.doi_data_block_size;
Expand Down Expand Up @@ -2880,7 +2894,7 @@ ztest_io(ztest_ds_t *zd, uint64_t object, uint64_t offset)
(void) pthread_rwlock_unlock(&ztest_name_lock);

VERIFY0(dmu_read(zd->zd_os, object, offset, blocksize, data,
DMU_READ_NO_PREFETCH));
dmu_read_flags));

(void) ztest_write(zd, object, offset, blocksize, data);
break;
Expand Down Expand Up @@ -5045,6 +5059,13 @@ ztest_dmu_read_write(ztest_ds_t *zd, uint64_t id)
uint64_t stride = 123456789ULL;
uint64_t width = 40;
int free_percent = 5;
uint32_t dmu_read_flags = DMU_READ_PREFETCH;

/*
* We will randomly set when to do O_DIRECT on a read.
*/
if (ztest_random(4) == 0)
dmu_read_flags |= DMU_DIRECTIO;

/*
* This test uses two objects, packobj and bigobj, that are always
Expand Down Expand Up @@ -5123,10 +5144,10 @@ ztest_dmu_read_write(ztest_ds_t *zd, uint64_t id)
* Read the current contents of our objects.
*/
error = dmu_read(os, packobj, packoff, packsize, packbuf,
DMU_READ_PREFETCH);
dmu_read_flags);
ASSERT0(error);
error = dmu_read(os, bigobj, bigoff, bigsize, bigbuf,
DMU_READ_PREFETCH);
dmu_read_flags);
ASSERT0(error);

/*
Expand Down Expand Up @@ -5244,9 +5265,9 @@ ztest_dmu_read_write(ztest_ds_t *zd, uint64_t id)
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);

VERIFY0(dmu_read(os, packobj, packoff,
packsize, packcheck, DMU_READ_PREFETCH));
packsize, packcheck, dmu_read_flags));
VERIFY0(dmu_read(os, bigobj, bigoff,
bigsize, bigcheck, DMU_READ_PREFETCH));
bigsize, bigcheck, dmu_read_flags));

ASSERT0(memcmp(packbuf, packcheck, packsize));
ASSERT0(memcmp(bigbuf, bigcheck, bigsize));
Expand Down Expand Up @@ -5336,6 +5357,13 @@ ztest_dmu_read_write_zcopy(ztest_ds_t *zd, uint64_t id)
dmu_buf_t *bonus_db;
arc_buf_t **bigbuf_arcbufs;
dmu_object_info_t doi;
uint32_t dmu_read_flags = DMU_READ_PREFETCH;

/*
* We will randomly set when to do O_DIRECT on a read.
*/
if (ztest_random(4) == 0)
dmu_read_flags |= DMU_DIRECTIO;

size = sizeof (ztest_od_t) * OD_ARRAY_SIZE;
od = umem_alloc(size, UMEM_NOFAIL);
Expand Down Expand Up @@ -5466,10 +5494,10 @@ ztest_dmu_read_write_zcopy(ztest_ds_t *zd, uint64_t id)
*/
if (i != 0 || ztest_random(2) != 0) {
error = dmu_read(os, packobj, packoff,
packsize, packbuf, DMU_READ_PREFETCH);
packsize, packbuf, dmu_read_flags);
ASSERT0(error);
error = dmu_read(os, bigobj, bigoff, bigsize,
bigbuf, DMU_READ_PREFETCH);
bigbuf, dmu_read_flags);
ASSERT0(error);
}
compare_and_update_pbbufs(s, packbuf, bigbuf, bigsize,
Expand Down Expand Up @@ -5529,9 +5557,9 @@ ztest_dmu_read_write_zcopy(ztest_ds_t *zd, uint64_t id)
void *bigcheck = umem_alloc(bigsize, UMEM_NOFAIL);

VERIFY0(dmu_read(os, packobj, packoff,
packsize, packcheck, DMU_READ_PREFETCH));
packsize, packcheck, dmu_read_flags));
VERIFY0(dmu_read(os, bigobj, bigoff,
bigsize, bigcheck, DMU_READ_PREFETCH));
bigsize, bigcheck, dmu_read_flags));

ASSERT0(memcmp(packbuf, packcheck, packsize));
ASSERT0(memcmp(bigbuf, bigcheck, bigsize));
Expand Down
179 changes: 179 additions & 0 deletions config/kernel-get-user-pages.m4
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
dnl #
dnl # get_user_pages_unlocked() function was not available till 4.0.
dnl # In earlier kernels (< 4.0) get_user_pages() is available().
dnl #
dnl # 4.0 API change,
dnl # long get_user_pages_unlocked(struct task_struct *tsk,
dnl # struct mm_struct *mm, unsigned long start, unsigned long nr_pages,
dnl # int write, int force, struct page **pages)
dnl #
dnl # 4.8 API change,
dnl # long get_user_pages_unlocked(unsigned long start,
dnl # unsigned long nr_pages, int write, int force, struct page **page)
dnl #
dnl # 4.9 API change,
dnl # long get_user_pages_unlocked(usigned long start, int nr_pages,
dnl # struct page **pages, unsigned int gup_flags)
dnl #

dnl#
dnl# Check available get_user_pages/_unlocked interfaces.
dnl#
AC_DEFUN([ZFS_AC_KERNEL_SRC_GET_USER_PAGES], [
ZFS_LINUX_TEST_SRC([get_user_pages_unlocked_gup_flags], [
#include <linux/mm.h>
], [
unsigned long start = 0;
unsigned long nr_pages = 1;
unsigned int gup_flags = 0;
struct page **pages = NULL;
long ret __attribute__ ((unused));
ret = get_user_pages_unlocked(start, nr_pages, pages,
gup_flags);
])
ZFS_LINUX_TEST_SRC([get_user_pages_unlocked_write_flag], [
#include <linux/mm.h>
], [
unsigned long start = 0;
unsigned long nr_pages = 1;
int write = 0;
int force = 0;
long ret __attribute__ ((unused));
struct page **pages = NULL;
ret = get_user_pages_unlocked(start, nr_pages, write, force,
pages);
])
ZFS_LINUX_TEST_SRC([get_user_pages_unlocked_task_struct], [
#include <linux/mm.h>
], [
struct task_struct *tsk = NULL;
struct mm_struct *mm = NULL;
unsigned long start = 0;
unsigned long nr_pages = 1;
int write = 0;
int force = 0;
struct page **pages = NULL;
long ret __attribute__ ((unused));
ret = get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
force, pages);
])
ZFS_LINUX_TEST_SRC([get_user_pages_unlocked_task_struct_gup_flags], [
#include <linux/mm.h>
], [
struct task_struct *tsk = NULL;
struct mm_struct *mm = NULL;
unsigned long start = 0;
unsigned long nr_pages = 1;
struct page **pages = NULL;
unsigned int gup_flags = 0;
long ret __attribute__ ((unused));
ret = get_user_pages_unlocked(tsk, mm, start, nr_pages,
pages, gup_flags);
])
ZFS_LINUX_TEST_SRC([get_user_pages_task_struct], [
#include <linux/mm.h>
], [
struct task_struct *tsk = NULL;
struct mm_struct *mm = NULL;
struct vm_area_struct **vmas = NULL;
unsigned long start = 0;
unsigned long nr_pages = 1;
int write = 0;
int force = 0;
struct page **pages = NULL;
int ret __attribute__ ((unused));
ret = get_user_pages(tsk, mm, start, nr_pages, write,
force, pages, vmas);
])
])

dnl #
dnl # Supported get_user_pages/_unlocked interfaces checked newest to oldest.
dnl # We first check for get_user_pages_unlocked as that is available in
dnl # newer kernels.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_GET_USER_PAGES], [
dnl #
dnl # Current API (as of 4.9) of get_user_pages_unlocked
dnl #
AC_MSG_CHECKING([whether get_user_pages_unlocked() takes gup flags])
ZFS_LINUX_TEST_RESULT([get_user_pages_unlocked_gup_flags], [
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_GET_USER_PAGES_UNLOCKED_GUP_FLAGS, 1,
[get_user_pages_unlocked() takes gup flags])
], [
AC_MSG_RESULT(no)
dnl #
dnl # 4.8 API change, get_user_pages_unlocked
dnl #
AC_MSG_CHECKING(
[whether get_user_pages_unlocked() takes write flag])
ZFS_LINUX_TEST_RESULT([get_user_pages_unlocked_write_flag], [
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_GET_USER_PAGES_UNLOCKED_WRITE_FLAG, 1,
[get_user_pages_unlocked() takes write flag])
], [
AC_MSG_RESULT(no)
dnl #
dnl # 4.0-4.3, 4.5-4.7 API, get_user_pages_unlocked
dnl #
AC_MSG_CHECKING(
[whether get_user_pages_unlocked() takes task_struct])
ZFS_LINUX_TEST_RESULT(
[get_user_pages_unlocked_task_struct], [
AC_MSG_RESULT(yes)
AC_DEFINE(
HAVE_GET_USER_PAGES_UNLOCKED_TASK_STRUCT, 1,
[get_user_pages_unlocked() takes task_struct])
], [
AC_MSG_RESULT(no)
dnl #
dnl # 4.4 API, get_user_pages_unlocked
dnl #
AC_MSG_CHECKING(
[whether get_user_pages_unlocked() takes task_struct, gup_flags])
ZFS_LINUX_TEST_RESULT(
[get_user_pages_unlocked_task_struct_gup_flags], [
AC_MSG_RESULT(yes)
AC_DEFINE(
HAVE_GET_USER_PAGES_UNLOCKED_TASK_STRUCT_GUP_FLAGS, 1,
[get_user_pages_unlocked() takes task_struct, gup_flags])
], [
AC_MSG_RESULT(no)
dnl #
dnl # get_user_pages
dnl #
AC_MSG_CHECKING(
[whether get_user_pages() takes struct task_struct])
ZFS_LINUX_TEST_RESULT(
[get_user_pages_task_struct], [
AC_MSG_RESULT(yes)
AC_DEFINE(
HAVE_GET_USER_PAGES_TASK_STRUCT, 1,
[get_user_pages() takes task_struct])
], [
dnl #
dnl # If we cannot map the user's
dnl # pages in then we cannot do
dnl # Direct I/O
dnl #
ZFS_LINUX_TEST_ERROR([Direct I/O])
])
])
])
])
])
])
4 changes: 2 additions & 2 deletions config/kernel-vfs-direct_IO.m4
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dnl #
dnl # Check for direct IO interfaces.
dnl # Check for Direct I/O interfaces.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_SRC_VFS_DIRECT_IO], [
ZFS_LINUX_TEST_SRC([direct_io_iter], [
Expand Down Expand Up @@ -100,7 +100,7 @@ AC_DEFUN([ZFS_AC_KERNEL_VFS_DIRECT_IO], [
AC_DEFINE(HAVE_VFS_DIRECT_IO_IOVEC, 1,
[aops->direct_IO() uses iovec])
],[
ZFS_LINUX_TEST_ERROR([direct IO])
ZFS_LINUX_TEST_ERROR([Direct I/O])
AC_MSG_RESULT([no])
])
])
Expand Down
Loading

0 comments on commit e4f9167

Please sign in to comment.