404 Page not found.
+Please use left menu or search to find interested page.
+diff --git a/.buildinfo b/.buildinfo new file mode 100644 index 000000000..f20b93365 --- /dev/null +++ b/.buildinfo @@ -0,0 +1,4 @@ +# Sphinx build info version 1 +# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. +config: 801ad5d59720abd82bdd89fc0284a5b4 +tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/404.html b/404.html new file mode 100644 index 000000000..cff9e6429 --- /dev/null +++ b/404.html @@ -0,0 +1,116 @@ + + +
+ + + +Please use left menu or search to find interested page.
+End-to-end checksums are a key feature of ZFS and an important +differentiator for ZFS over other RAID implementations and filesystems. +Advantages of end-to-end checksums include:
+detects data corruption upon reading from media
blocks that are detected as corrupt are automatically repaired if
+possible, by using the RAID protection in suitably configured pools,
+or redundant copies (see the zfs copies
property)
periodic scrubs can check data to detect and repair latent media +degradation (bit rot) and corruption from other sources
checksums on ZFS replication streams, zfs send
and
+zfs receive
, ensure the data received is not corrupted by
+intervening storage or transport mechanisms
The checksum algorithms in ZFS can be changed for datasets (filesystems +or volumes). The checksum algorithm used for each block is stored in the +block pointer (metadata). The block checksum is calculated when the +block is written, so changing the algorithm only affects writes +occurring after the change.
+The checksum algorithm for a dataset can be changed by setting the
+checksum
property:
zfs set checksum=sha256 pool_name/dataset_name
+
Checksum |
+Ok for dedup +and nopwrite? |
+Compatible with +other ZFS +implementations? |
+Notes |
+
---|---|---|---|
on |
+see notes |
+yes |
+
|
+
off |
+no |
+yes |
+Do not do use
+ |
+
fletcher2 |
+no |
+yes |
+Deprecated
+implementation
+of Fletcher
+checksum, use
+ |
+
fletcher4 |
+no |
+yes |
+Fletcher
+algorithm, also
+used for
+ |
+
sha256 |
+yes |
+yes |
+Default for +deduped +datasets |
+
noparity |
+no |
+yes |
+Do not use
+ |
+
sha512 |
+yes |
+requires pool
+feature
+ |
+salted
+ |
+
skein |
+yes |
+requires pool
+feature
+ |
+salted
+ |
+
edonr |
+see notes |
+requires pool
+feature
+ |
+salted
+ In an abundance of
+caution, Edon-R requires
+verification when used
+with dedup, so it will
+automatically use
+ |
+
blake3 |
+yes |
+requires pool
+feature
+ |
+salted
+ |
+
ZFS has the ability to offload checksum operations to the Intel +QuickAssist Technology (QAT) adapters.
+Some ZFS features use microbenchmarks when the zfs.ko
kernel module
+is loaded to determine the optimal algorithm for checksums. The results
+of the microbenchmarks are observable in the /proc/spl/kstat/zfs
+directory. The winning algorithm is reported as the “fastest” and
+becomes the default. The default can be overridden by setting zfs module
+parameters.
Checksum |
+Results Filename |
+
|
+
---|---|---|
Fletcher4 |
+/proc/spl/kstat/zfs/fletcher_4_bench |
+zfs_fletcher_4_impl |
+
all-other |
+/proc/spl/kstat/zfs/chksum_bench |
+zfs_blake3_impl, +zfs_sha256_impl, +zfs_sha512_impl |
+
While it may be tempting to disable checksums to improve CPU +performance, it is widely considered by the ZFS community to be an +extrodinarily bad idea. Don’t disable checksums.
+ZFS on-disk formats were originally versioned with a single number, +which increased whenever the format changed. The numbered approach was +suitable when development of ZFS was driven by a single organisation.
+For distributed development of OpenZFS, version numbering was +unsuitable. Any change to the number would have required agreement, +across all implementations, of each change to the on-disk format.
+OpenZFS feature flags – an alternative to traditional version numbering +– allow a uniquely named pool property for each change to the on-disk +format. This approach supports:
+format changes that are independent
format changes that depend on each other.
Where all features that are used by a pool are supported by multiple +implementations of OpenZFS, the on-disk format is portable across those +implementations.
+Features that are exclusive when enabled should be periodically ported +to all distributions.
+ZFS Feature Flags +(Christopher Siden, 2012-01, in the Internet +Archive Wayback Machine) in particular: “… Legacy version numbers still +exist for pool versions 1-28 …”.
+zpool-features(7) man page - OpenZFS
+zpool-features (5) – illumos
+Feature Flag | Read-Only Compatible | OpenZFS (Linux, FreeBSD 13+) | FreeBSD pre OpenZFS | Illumos | Joyent | NetBSD | Nexenta | OmniOS CE | OpenZFS on OS X | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.6.5.11 | 0.7.13 | 0.8.6 | 2.0.7 | 2.1.14 | 2.2.2 | master | 12.1.0 | 12.2.0 | master | master | 9.3 | main | 4.0.5-FP | master | r151046 | r151048 | master | 2.1.6 | 2.2.0 | 2.2.2 | main | ||
org.zfsonlinux:allocation_classes | yes | no | no | yes | yes | yes | yes | yes | no | yes | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.delphix:async_destroy | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
org.openzfs:blake3 | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.fudosecurity:block_cloning | yes | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes |
com.datto:bookmark_v2 | no | no | no | yes | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.delphix:bookmark_written | no | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:bookmarks | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.nexenta:class_of_storage | yes | no | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no |
org.openzfs:device_rebuild | yes | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:device_removal | no | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | no | no | yes | yes | yes | yes | yes | yes | yes | yes |
org.openzfs:draid | no | no | no | no | no | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
org.illumos:edonr | no | yes1 | yes1 | yes1 | yes1 | yes1 | yes1 | yes | no | no | yes | yes | no | no | no | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:embedded_data | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:empty_bpobj | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:enabled_txg | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.datto:encryption | no | no | no | yes | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.delphix:extensible_dataset | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.joyent:filesystem_limits | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:head_errlog | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:hole_birth | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
org.open-zfs:large_blocks | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | yes | yes | yes | yes | yes |
org.zfsonlinux:large_dnode | no | no | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.delphix:livelist | yes | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:log_spacemap | yes | no | no | no | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
org.illumos:lz4_compress | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.nexenta:meta_devices | yes | no | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no |
com.joyent:multi_vdev_crash_dump | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:obsolete_counts | yes | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | no | no | yes | yes | yes | yes | yes | yes | yes | yes |
org.zfsonlinux:project_quota | yes | no | no | yes | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
org.openzfs:raidz_expansion | no | no | no | no | no | no | no | yes | no | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes |
com.delphix:redacted_datasets | no | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:redaction_bookmarks | no | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
com.delphix:redaction_list_spill | no | no | no | no | no | no | no | yes | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes |
com.datto:resilver_defer | yes | no | no | yes | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
org.illumos:sha512 | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | yes | yes | yes | yes | yes |
org.illumos:skein | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:spacemap_histogram | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
com.delphix:spacemap_v2 | yes | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
org.zfsonlinux:userobj_accounting | yes | no | yes | yes | yes | yes | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.nexenta:vdev_properties | yes | no | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no |
com.klarasystems:vdev_zaps_v2 | no | no | no | no | no | no | yes | yes | no | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes |
com.nexenta:wbc | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | yes | no | no | no | no | no | no | no |
org.openzfs:zilsaxattr | yes | no | no | no | no | no | yes | yes | no | no | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
com.delphix:zpool_checkpoint | yes | no | no | yes | yes | yes | yes | yes | yes | yes | yes | yes | no | no | no | no | yes | yes | yes | yes | yes | yes | yes |
org.freebsd:zstd_compress | no | no | no | no | yes | yes | yes | yes | no | no | no | no | no | no | no | no | no | no | no | yes | yes | yes | yes |
Table generates by parsing manpages for feature flags, and is entirely dependent on good, accurate documentation.
Last updated on 2023-12-25T19:17:15.361178Z using compatibility_matrix.py.
tl;dr: RAIDZ is effective for large block sizes and sequential workloads.
+RAIDZ is a variation on RAID-5 that allows for better distribution of parity +and eliminates the RAID-5 “write hole” (in which data and parity become +inconsistent after a power loss). +Data and parity is striped across all disks within a raidz group.
+A raidz group can have single, double, or triple parity, meaning that the raidz
+group can sustain one, two, or three failures, respectively, without losing any
+data. The raidz1
vdev type specifies a single-parity raidz group; the raidz2
+vdev type specifies a double-parity raidz group; and the raidz3
vdev type
+specifies a triple-parity raidz group. The raidz
vdev type is an alias for
+raidz1.
A raidz group of N disks of size X with P parity disks can hold +approximately (N-P)*X bytes and can withstand P devices failing without +losing data. The minimum number of devices in a raidz group is one more +than the number of parity disks. The recommended number is between 3 and 9 +to help increase performance.
+Actual used space for a block in RAIDZ is based on several points:
+minimal write size is disk sector size (can be set via ashift vdev parameter)
stripe width in RAIDZ is dynamic, and starts with at least one data block part, or up to
+disks count
minus parity number
parts of data block
one block of data with size of recordsize
is
+splitted equally via sector size
parts
+and written on each stripe on RAIDZ vdev
each stripe of data will have a part of block
in addition to data one, two or three blocks of parity should be written, +one per disk; so, for raidz2 of 5 disks there will be 3 blocks of data and +2 blocks of parity
Due to these inputs, if recordsize
is less or equal to sector size,
+then RAIDZ’s parity size will be effictively equal to mirror with same redundancy.
+For example, for raidz1 of 3 disks with ashift=12
and recordsize=4K
+we will allocate on disk:
one 4K block of data
one 4K parity block
and usable space ratio will be 50%, same as with double mirror.
+Another example for ashift=12
and recordsize=128K
for raidz1 of 3 disks:
total stripe width is 3
one stripe can have up to 2 data parts of 4K size because of 1 parity blocks
we will have 128K/8k = 16 stripes with 8K of data and 4K of parity each
16 stripes each with 12k, means we write 192k to store 128k
so usable space ratio in this case will be 66%.
+The more disks RAIDZ has, the wider the stripe, the greater the space +efficiency.
+You can find actual parity cost per RAIDZ size here:
+(source)
+Because of full stripe width, one block write will write stripe part on each disk. +One RAIDZ vdev has a write IOPS of one slowest disk because of that in worst case.
+Todo
+This page is a draft.
+This page contains tips for troubleshooting ZFS on Linux and what info +developers might want for bug triage.
+Log files can be very useful for troubleshooting. In some cases, +interesting information is stored in multiple log files that are +correlated to system events.
+Pro tip: logging infrastructure tools like elasticsearch, fluentd, +influxdb, or splunk can simplify log analysis and event correlation.
+Typically, Linux kernel log messages are available from dmesg -T
,
+/var/log/syslog
, or where kernel log messages are sent (eg by
+rsyslogd
).
The ZFS kernel modules use an internal log buffer for detailed logging
+information. This log information is available in the pseudo file
+/proc/spl/kstat/zfs/dbgmsg
for ZFS builds where ZFS module parameter
+zfs_dbgmsg_enable =
+1
Symptom: zfs
or zpool
command appear hung, does not return, and
+is not killable
Likely cause: kernel thread hung or panic
+Log files of interest: Generic Kernel Log, +ZFS Kernel Module Debug Messages
+Important information: if a kernel thread is stuck, then a backtrace of +the stuck thread can be in the logs. In some cases, the stuck thread is +not logged until the deadman timer expires. See also debug +tunables
+ZFS uses an event-based messaging interface for communication of
+important events to other consumers running on the system. The ZFS Event
+Daemon (zed) is a userland daemon that listens for these events and
+processes them. zed is extensible so you can write shell scripts or
+other programs that subscribe to events and take action. For example,
+the script usually installed at /etc/zfs/zed.d/all-syslog.sh
writes
+a formatted event message to syslog
. See the man page for zed(8)
+for more information.
A history of events is also available via the zpool events
command.
+This history begins at ZFS kernel module load and includes events from
+any pool. These events are stored in RAM and limited in count to a value
+determined by the kernel tunable
+zfs_event_len_max.
+zed
has an internal throttling mechanism to prevent overconsumption
+of system resources processing ZFS events.
More detailed information about events is observable using
+zpool events -v
The contents of the verbose events is subject to
+change, based on the event and information available at the time of the
+event.
Each event has a class identifier used for filtering event types.
+Commonly seen events are those related to pool management with class
+sysevent.fs.zfs.*
including import, export, configuration updates,
+and zpool history
updates.
Events related to errors are reported as class ereport.*
These can
+be invaluable for troubleshooting. Some faults can cause multiple
+ereports as various layers of the software deal with the fault. For
+example, on a simple pool without parity protection, a faulty disk could
+cause an ereport.io
during a read from the disk that results in an
+erport.fs.zfs.checksum
at the pool level. These events are also
+reflected by the error counters observed in zpool status
If you see
+checksum or read/write errors in zpool status
then there should be
+one or more corresponding ereports in the zpool events
output.
Note
+This page describes functionality which has been added for the +OpenZFS 2.1.0 release, it is not in the OpenZFS 2.0.0 release.
+dRAID is a variant of raidz that provides integrated distributed hot +spares which allows for faster resilvering while retaining the benefits +of raidz. A dRAID vdev is constructed from multiple internal raidz +groups, each with D data devices and P parity devices. These groups +are distributed over all of the children in order to fully utilize the +available disk performance. This is known as parity declustering and +it has been an active area of research. The image below is simplified, +but it helps illustrate this key difference between dRAID and raidz.
+ +Additionally, a dRAID vdev must shuffle its child vdevs in such a way +that regardless of which drive has failed, the rebuild IO (both read +and write) will distribute evenly among all surviving drives. This +is accomplished by using carefully chosen precomputed permutation +maps. This has the advantage of both keeping pool creation fast and +making it impossible for the mapping to be damaged or lost.
+Another way dRAID differs from raidz is that it uses a fixed stripe +width (padding as necessary with zeros). This allows a dRAID vdev to +be sequentially resilvered, however the fixed stripe width significantly +effects both usable capacity and IOPS. For example, with the default +D=8 and 4k disk sectors the minimum allocation size is 32k. If using +compression, this relatively large allocation size can reduce the +effective compression ratio. When using ZFS volumes and dRAID the +default volblocksize property is increased to account for the allocation +size. If a dRAID pool will hold a significant amount of small blocks, +it is recommended to also add a mirrored special vdev to store those +blocks.
+In regards to IO/s, performance is similar to raidz since for any +read all D data disks must be accessed. Delivered random IOPS can be +reasonably approximated as floor((N-S)/(D+P))*<single-drive-IOPS>.
+In summary dRAID can provide the same level of redundancy and +performance as raidz, while also providing a fast integrated distributed +spare.
+A dRAID vdev is created like any other by using the zpool create
+command and enumerating the disks which should be used.
# zpool create <pool> draid[1,2,3] <vdevs...>
+
Like raidz, the parity level is specified immediately after the draid
+vdev type. However, unlike raidz additional colon separated options can be
+specified. The most important of which is the :<spares>s
option which
+controls the number of distributed hot spares to create. By default, no
+spares are created. The :<data>d
option can be specified to set the
+number of data devices to use in each RAID stripe (D+P). When unspecified
+reasonable defaults are chosen.
# zpool create <pool> draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>
+
parity - The parity level (1-3). Defaults to one.
data - The number of data devices per redundancy group. In general +a smaller value of D will increase IOPS, improve the compression ratio, +and speed up resilvering at the expense of total usable capacity. +Defaults to 8, unless N-P-S is less than 8.
children - The expected number of children. Useful as a cross-check +when listing a large number of devices. An error is returned when the +provided number of children differs.
spares - The number of distributed hot spares. Defaults to zero.
For example, to create an 11 disk dRAID pool with 4+1 redundancy and a +single distributed spare the command would be:
+# zpool create tank draid:4d:1s:11c /dev/sd[a-k]
+# zpool status tank
+
+ pool: tank
+ state: ONLINE
+config:
+
+ NAME STATE READ WRITE CKSUM
+ tank ONLINE 0 0 0
+ draid1:4d:11c:1s-0 ONLINE 0 0 0
+ sda ONLINE 0 0 0
+ sdb ONLINE 0 0 0
+ sdc ONLINE 0 0 0
+ sdd ONLINE 0 0 0
+ sde ONLINE 0 0 0
+ sdf ONLINE 0 0 0
+ sdg ONLINE 0 0 0
+ sdh ONLINE 0 0 0
+ sdi ONLINE 0 0 0
+ sdj ONLINE 0 0 0
+ sdk ONLINE 0 0 0
+ spares
+ draid1-0-0 AVAIL
+
Note that the dRAID vdev name, draid1:4d:11c:1s
, fully describes the
+configuration and all of disks which are part of the dRAID are listed.
+Furthermore, the logical distributed hot spare is shown as an available
+spare disk.
One of the major advantages of dRAID is that it supports both sequential +and traditional healing resilvers. When performing a sequential resilver +to a distributed hot spare the performance scales with the number of disks +divided by the stripe width (D+P). This can greatly reduce resilver times +and restore full redundancy in a fraction of the usual time. For example, +the following graph shows the observed sequential resilver time in hours +for a 90 HDD based dRAID filled to 90% capacity.
+ +When using dRAID and a distributed spare, the process for handling a +failed disk is almost identical to raidz with a traditional hot spare. +When a disk failure is detected the ZFS Event Daemon (ZED) will start +rebuilding to a spare if one is available. The only difference is that +for dRAID a sequential resilver is started, while a healing resilver must +be used for raidz.
+# echo offline >/sys/block/sdg/device/state
+# zpool replace -s tank sdg draid1-0-0
+# zpool status
+
+ pool: tank
+ state: DEGRADED
+status: One or more devices is currently being resilvered. The pool will
+ continue to function, possibly in a degraded state.
+action: Wait for the resilver to complete.
+ scan: resilver (draid1:4d:11c:1s-0) in progress since Tue Nov 24 14:34:25 2020
+ 3.51T scanned at 13.4G/s, 1.59T issued 6.07G/s, 6.13T total
+ 326G resilvered, 57.17% done, 00:03:21 to go
+config:
+
+ NAME STATE READ WRITE CKSUM
+ tank DEGRADED 0 0 0
+ draid1:4d:11c:1s-0 DEGRADED 0 0 0
+ sda ONLINE 0 0 0 (resilvering)
+ sdb ONLINE 0 0 0 (resilvering)
+ sdc ONLINE 0 0 0 (resilvering)
+ sdd ONLINE 0 0 0 (resilvering)
+ sde ONLINE 0 0 0 (resilvering)
+ sdf ONLINE 0 0 0 (resilvering)
+ spare-6 DEGRADED 0 0 0
+ sdg UNAVAIL 0 0 0
+ draid1-0-0 ONLINE 0 0 0 (resilvering)
+ sdh ONLINE 0 0 0 (resilvering)
+ sdi ONLINE 0 0 0 (resilvering)
+ sdj ONLINE 0 0 0 (resilvering)
+ sdk ONLINE 0 0 0 (resilvering)
+ spares
+ draid1-0-0 INUSE currently in use
+
While both types of resilvering achieve the same goal it’s worth taking +a moment to summarize the key differences.
+A traditional healing resilver scans the entire block tree. This +means the checksum for each block is available while it’s being +repaired and can be immediately verified. The downside is this +creates a random read workload which is not ideal for performance.
A sequential resilver instead scans the space maps in order to +determine what space is allocated and what must be repaired. +This rebuild process is not limited to block boundaries and can +sequentially reads from the disks and make repairs using larger +I/Os. The price to pay for this performance improvement is that +the block checksums cannot be verified while resilvering. Therefore, +a scrub is started to verify the checksums after the sequential +resilver completes.
For a more in depth explanation of the differences between sequential +and healing resilvering check out these sequential resilver slides +which were presented at the OpenZFS Developer Summit.
+Distributed spare space can be made available again by simply replacing +any failed drive with a new drive. This process is called rebalancing +and is essentially a resilver. When performing rebalancing a healing +resilver is recommended since the pool is no longer degraded. This +ensures all checksums are verified when rebuilding to the new disk +and eliminates the need to perform a subsequent scrub of the pool.
+# zpool replace tank sdg sdl
+# zpool status
+
+ pool: tank
+ state: DEGRADED
+status: One or more devices is currently being resilvered. The pool will
+ continue to function, possibly in a degraded state.
+action: Wait for the resilver to complete.
+ scan: resilver in progress since Tue Nov 24 14:45:16 2020
+ 6.13T scanned at 7.82G/s, 6.10T issued at 7.78G/s, 6.13T total
+ 565G resilvered, 99.44% done, 00:00:04 to go
+config:
+
+ NAME STATE READ WRITE CKSUM
+ tank DEGRADED 0 0 0
+ draid1:4d:11c:1s-0 DEGRADED 0 0 0
+ sda ONLINE 0 0 0 (resilvering)
+ sdb ONLINE 0 0 0 (resilvering)
+ sdc ONLINE 0 0 0 (resilvering)
+ sdd ONLINE 0 0 0 (resilvering)
+ sde ONLINE 0 0 0 (resilvering)
+ sdf ONLINE 0 0 0 (resilvering)
+ spare-6 DEGRADED 0 0 0
+ replacing-0 DEGRADED 0 0 0
+ sdg UNAVAIL 0 0 0
+ sdl ONLINE 0 0 0 (resilvering)
+ draid1-0-0 ONLINE 0 0 0 (resilvering)
+ sdh ONLINE 0 0 0 (resilvering)
+ sdi ONLINE 0 0 0 (resilvering)
+ sdj ONLINE 0 0 0 (resilvering)
+ sdk ONLINE 0 0 0 (resilvering)
+ spares
+ draid1-0-0 INUSE currently in use
+
After the resilvering completes the distributed hot spare is once again +available for use and the pool has been restored to its normal healthy +state.
+There are a number of ways to control the ZFS Buildbot at a commit +level. This page provides a summary of various options that the ZFS +Buildbot supports and how it impacts testing. More detailed information +regarding its implementation can be found at the ZFS Buildbot Github +page.
+By default, all commits in your ZFS pull request are compiled by the
+BUILD builders. Additionally, the top commit of your ZFS pull request is
+tested by TEST builders. However, there is the option to override which
+types of builder should be used on a per commit basis. In this case, you
+can add
+Requires-builders: <none|all|style|build|arch|distro|test|perf|coverage|unstable>
+to your commit message. A comma separated list of options can be
+provided. Supported options are:
all
: This commit should be built by all available builders
none
: This commit should not be built by any builders
style
: This commit should be built by STYLE builders
build
: This commit should be built by all BUILD builders
arch
: This commit should be built by BUILD builders tagged as
+‘Architectures’
distro
: This commit should be built by BUILD builders tagged as
+‘Distributions’
test
: This commit should be built and tested by the TEST builders
+(excluding the Coverage TEST builders)
perf
: This commit should be built and tested by the PERF builders
coverage
: This commit should be built and tested by the Coverage
+TEST builders
unstable
: This commit should be built and tested by the Unstable
+TEST builders (currently only the Fedora Rawhide TEST builder)
A couple of examples on how to use Requires-builders:
in commit
+messages can be found below.
This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Requires-builders: none
+
This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Requires-builders: style test
+
Currently, the ZFS Buildbot attempts to choose the correct SPL branch to
+build based on a pull request’s base branch. In the cases where a
+specific SPL version needs to be built, the ZFS buildbot supports
+specifying an SPL version for pull request testing. By opening a pull
+request against ZFS and adding Requires-spl:
in a commit message,
+you can instruct the buildbot to use a specific SPL version. Below are
+examples of a commit messages that specify the SPL version.
This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Requires-spl: refs/pull/123/head
+
spl-branch-name
from zfsonlinux/spl
repositoryThis is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Requires-spl: spl-branch-name
+
Currently, Kernel.org builders will clone and build the master branch of
+Linux. In cases where a specific version of the Linux kernel needs to be
+built, the ZFS buildbot supports specifying the Linux kernel to be built
+via commit message. By opening a pull request against ZFS and adding
+Requires-kernel:
in a commit message, you can instruct the buildbot
+to use a specific Linux kernel. Below is an example commit message that
+specifies a specific Linux kernel tag.
This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Requires-kernel: v4.14
+
Each builder will execute or skip build steps based on its default +preferences. In some scenarios, it might be possible to skip various +build steps. The ZFS buildbot supports overriding the defaults of all +builders in a commit message. The list of available overrides are:
+Build-linux: <Yes|No>
: All builders should build Linux for this
+commit
Build-lustre: <Yes|No>
: All builders should build Lustre for this
+commit
Build-spl: <Yes|No>
: All builders should build the SPL for this
+commit
Build-zfs: <Yes|No>
: All builders should build ZFS for this
+commit
Built-in: <Yes|No>
: All Linux builds should build in SPL and ZFS
Check-lint: <Yes|No>
: All builders should perform lint checks for
+this commit
Configure-lustre: <options>
: Provide <options>
as configure
+flags when building Lustre
Configure-spl: <options>
: Provide <options>
as configure
+flags when building the SPL
Configure-zfs: <options>
: Provide <options>
as configure
+flags when building ZFS
A couple of examples on how to use overrides in commit messages can be +found below.
+This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Build-lustre: Yes
+Configure-lustre: --disable-ldiskfs
+Build-spl: No
+
This is a commit message
+
+This text is part of the commit message body.
+
+Signed-off-by: Contributor <contributor@email.com>
+Build-lustre: No
+Build-spl: No
+
At the top level of the ZFS source tree, there is the TEST +file which +contains variables that control if and how a specific test should run. +Below is a list of each variable and a brief description of what each +variable controls.
+TEST_PREPARE_WATCHDOG
- Enables the Linux kernel watchdog
TEST_PREPARE_SHARES
- Start NFS and Samba servers
TEST_SPLAT_SKIP
- Determines if splat
testing is skipped
TEST_SPLAT_OPTIONS
- Command line options to provide to splat
TEST_ZTEST_SKIP
- Determines if ztest
testing is skipped
TEST_ZTEST_TIMEOUT
- The length of time ztest
should run
TEST_ZTEST_DIR
- Directory where ztest
will create vdevs
TEST_ZTEST_OPTIONS
- Options to pass to ztest
TEST_ZTEST_CORE_DIR
- Directory for ztest
to store core dumps
TEST_ZIMPORT_SKIP
- Determines if zimport
testing is skipped
TEST_ZIMPORT_DIR
- Directory used during zimport
TEST_ZIMPORT_VERSIONS
- Source versions to test
TEST_ZIMPORT_POOLS
- Names of the pools for zimport
to use
+for testing
TEST_ZIMPORT_OPTIONS
- Command line options to provide to
+zimport
TEST_XFSTESTS_SKIP
- Determines if xfstest
testing is skipped
TEST_XFSTESTS_URL
- URL to download xfstest
from
TEST_XFSTESTS_VER
- Name of the tarball to download from
+TEST_XFSTESTS_URL
TEST_XFSTESTS_POOL
- Name of pool to create and used by
+xfstest
TEST_XFSTESTS_FS
- Name of dataset for use by xfstest
TEST_XFSTESTS_VDEV
- Name of the vdev used by xfstest
TEST_XFSTESTS_OPTIONS
- Command line options to provide to
+xfstest
TEST_ZFSTESTS_SKIP
- Determines if zfs-tests
testing is
+skipped
TEST_ZFSTESTS_DIR
- Directory to store files and loopback devices
TEST_ZFSTESTS_DISKS
- Space delimited list of disks that
+zfs-tests
is allowed to use
TEST_ZFSTESTS_DISKSIZE
- File size of file based vdevs used by
+zfs-tests
TEST_ZFSTESTS_ITERS
- Number of times test-runner
should
+execute its set of tests
TEST_ZFSTESTS_OPTIONS
- Options to provide zfs-tests
TEST_ZFSTESTS_RUNFILE
- The runfile to use when running
+zfs-tests
TEST_ZFSTESTS_TAGS
- List of tags to provide to test-runner
TEST_ZFSSTRESS_SKIP
- Determines if zfsstress
testing is
+skipped
TEST_ZFSSTRESS_URL
- URL to download zfsstress
from
TEST_ZFSSTRESS_VER
- Name of the tarball to download from
+TEST_ZFSSTRESS_URL
TEST_ZFSSTRESS_RUNTIME
- Duration to run runstress.sh
TEST_ZFSSTRESS_POOL
- Name of pool to create and use for
+zfsstress
testing
TEST_ZFSSTRESS_FS
- Name of dataset for use during zfsstress
+tests
TEST_ZFSSTRESS_FSOPT
- File system options to provide to
+zfsstress
TEST_ZFSSTRESS_VDEV
- Directory to store vdevs for use during
+zfsstress
tests
TEST_ZFSSTRESS_OPTIONS
- Command line options to provide to
+runstress.sh
The official source for OpenZFS is maintained at GitHub by the +openzfs organization. The primary +git repository for the project is the zfs repository.
+There are two main components in this repository:
+code which has been adapted and extended for Linux and FreeBSD. The +vast majority of the core OpenZFS code is self-contained and can be +used without modification.
+implementing the fundamental interfaces required by OpenZFS. It’s
+this layer which allows OpenZFS to be used across multiple
+platforms. SPL used to be maintained in a separate repository, but
+was merged into the zfs
+repository in the 0.8
major release.
The first thing you’ll need to do is prepare your environment by +installing a full development tool chain. In addition, development +headers for both the kernel and the following packages must be +available. It is important to note that if the development kernel +headers for the currently running kernel aren’t installed, the modules +won’t compile properly.
+The following dependencies should be installed to build the latest ZFS +2.1 release.
+RHEL/CentOS 7:
sudo yum install epel-release gcc make autoconf automake libtool rpm-build libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python python2-devel python-setuptools python-cffi libffi-devel git ncompress libcurl-devel
+sudo yum install --enablerepo=epel python-packaging dkms
+
RHEL/CentOS 8, Fedora:
sudo dnf install --skip-broken epel-release gcc make autoconf automake libtool rpm-build libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python3 python3-devel python3-setuptools python3-cffi libffi-devel git ncompress libcurl-devel
+sudo dnf install --skip-broken --enablerepo=epel --enablerepo=powertools python3-packaging dkms
+
Debian, Ubuntu:
sudo apt install build-essential autoconf automake libtool gawk alien fakeroot dkms libblkid-dev uuid-dev libudev-dev libssl-dev zlib1g-dev libaio-dev libattr1-dev libelf-dev linux-headers-generic python3 python3-dev python3-setuptools python3-cffi libffi-dev python3-packaging git libcurl4-openssl-dev debhelper-compat dh-python po-debconf python3-all-dev python3-sphinx
+
FreeBSD:
pkg install autoconf automake autotools git gmake python devel/py-sysctl sudo
+
There are two options for building OpenZFS; the correct one largely +depends on your requirements.
+Packages: Often it can be useful to build custom packages from +git which can be installed on a system. This is the best way to +perform integration testing with systemd, dracut, and udev. The +downside to using packages it is greatly increases the time required +to build, install, and test a change.
tree. This speeds up development by allowing developers to rapidly +iterate on a patch. When working in-tree developers can leverage +incremental builds, load/unload kernel modules, execute utilities, +and verify all their changes with the ZFS Test Suite.
+The remainder of this page focuses on the in-tree option which is +the recommended method of development for the majority of changes. See +the custom packages page for additional +information on building custom packages.
+Start by cloning the ZFS repository from GitHub. The repository has a +master branch for development and a series of *-release +branches for tagged releases. After checking out the repository your +clone will default to the master branch. Tagged releases may be built +by checking out zfs-x.y.z tags with matching version numbers or +matching release branches.
+git clone https://github.com/openzfs/zfs
+
For developers working on a change always create a new topic branch +based off of master. This will make it easy to open a pull request with +your change latter. The master branch is kept stable with extensive +regression testing of every pull +request before and after it’s merged. Every effort is made to catch +defects as early as possible and to keep them out of the tree. +Developers should be comfortable frequently rebasing their work against +the latest master branch.
+In this example we’ll use the master branch and walk through a stock +in-tree build. Start by checking out the desired branch then build +the ZFS and SPL source in the traditional autotools fashion.
+cd ./zfs
+git checkout master
+sh autogen.sh
+./configure
+make -s -j$(nproc)
+
--with-linux=PATH
and --with-linux-obj=PATH
can be
+passed to configure to specify a kernel installed in a non-default
+location.--enable-debug
can be passed to configure to enable all ASSERTs and
+additional correctness tests.Optional Build packages
+make rpm #Builds RPM packages for CentOS/Fedora
+make deb #Builds RPM converted DEB packages for Debian/Ubuntu
+make native-deb #Builds native DEB packages for Debian/Ubuntu
+
KVERS
, KSRC
and KOBJ
+environment variables can be exported to specify the kernel installed
+in non-default location.Note
+Support for native Debian packaging will be available starting from +openzfs-2.2 release.
+You can run zfs-tests.sh
without installing ZFS, see below. If you
+have reason to install ZFS after building it, pay attention to how your
+distribution handles kernel modules. On Ubuntu, for example, the modules
+from this repository install in the extra
kernel module path, which
+is not in the standard depmod
search path. Therefore, for the
+duration of your testing, edit /etc/depmod.d/ubuntu.conf
and add
+extra
to the beginning of the search path.
You may then install using
+sudo make install; sudo ldconfig; sudo depmod
. You’d uninstall with
+sudo make uninstall; sudo ldconfig; sudo depmod
.
If you wish to run the ZFS Test Suite (ZTS), then ksh
and a few
+additional utilities must be installed.
RHEL/CentOS 7:
sudo yum install ksh bc bzip2 fio acl sysstat mdadm lsscsi parted attr nfs-utils samba rng-tools pax perf
+sudo yum install --enablerepo=epel dbench
+
RHEL/CentOS 8, Fedora:
sudo dnf install --skip-broken ksh bc bzip2 fio acl sysstat mdadm lsscsi parted attr nfs-utils samba rng-tools pax perf
+sudo dnf install --skip-broken --enablerepo=epel dbench
+
Debian:
sudo apt install ksh bc bzip2 fio acl sysstat mdadm lsscsi parted attr dbench nfs-kernel-server samba rng-tools pax linux-perf selinux-utils quota
+
Ubuntu:
sudo apt install ksh bc bzip2 fio acl sysstat mdadm lsscsi parted attr dbench nfs-kernel-server samba rng-tools pax linux-tools-common selinux-utils quota
+
FreeBSD:
pkg install base64 bash checkbashisms fio hs-ShellCheck ksh93 pamtester devel/py-flake8 sudo
+
There are a few helper scripts provided in the top-level scripts +directory designed to aid developers working with in-tree builds.
+zfs-helper.sh: Certain functionality (i.e. /dev/zvol/) depends on +the ZFS provided udev helper scripts being installed on the system. +This script can be used to create symlinks on the system from the +installation location to the in-tree helper. These links must be in +place to successfully run the ZFS Test Suite. The -i and -r +options can be used to install and remove the symlinks.
sudo ./scripts/zfs-helpers.sh -i
+
zfs.sh: The freshly built kernel modules can be loaded using
+zfs.sh
. This script can later be used to unload the kernel
+modules with the -u option.
sudo ./scripts/zfs.sh
+
zloop.sh: A wrapper to run ztest repeatedly with randomized +arguments. The ztest command is a user space stress test designed to +detect correctness issues by concurrently running a random set of +test cases. If a crash is encountered, the ztest logs, any associated +vdev files, and core file (if one exists) are collected and moved to +the output directory for analysis.
sudo ./scripts/zloop.sh
+
zfs-tests.sh: A wrapper which can be used to launch the ZFS Test
+Suite. Three loopback devices are created on top of sparse files
+located in /var/tmp/
and used for the regression test. Detailed
+directions for the ZFS Test Suite can be found in the
+README
+located in the top-level tests directory.
./scripts/zfs-tests.sh -vx
+
tip: The delegate tests will be skipped unless group read +permission is set on the zfs directory and its parents.
+The following instructions assume you are building from an official +release tarball +(version 0.8.0 or newer) or directly from the git +repository. Most users should not +need to do this and should preferentially use the distribution packages. +As a general rule the distribution packages will be more tightly +integrated, widely tested, and better supported. However, if your +distribution of choice doesn’t provide packages, or you’re a developer +and want to roll your own, here’s how to do it.
+The first thing to be aware of is that the build system is capable of +generating several different types of packages. Which type of package +you choose depends on what’s supported on your platform and exactly what +your needs are.
+DKMS packages contain only the source code and scripts for +rebuilding the kernel modules. When the DKMS package is installed +kernel modules will be built for all available kernels. Additionally, +when the kernel is upgraded new kernel modules will be automatically +built for that kernel. This is particularly convenient for desktop +systems which receive frequent kernel updates. The downside is that +because the DKMS packages build the kernel modules from source a full +development environment is required which may not be appropriate for +large deployments.
kmods packages are binary kernel modules which are compiled +against a specific version of the kernel. This means that if you +update the kernel you must compile and install a new kmod package. If +you don’t frequently update your kernel, or if you’re managing a +large number of systems, then kmod packages are a good choice.
kABI-tracking kmod Packages are similar to standard binary kmods +and may be used with Enterprise Linux distributions like Red Hat and +CentOS. These distributions provide a stable kABI (Kernel Application +Binary Interface) which allows the same binary modules to be used +with new versions of the distribution provided kernel.
By default the build system will generate user packages and both DKMS +and kmod style kernel packages if possible. The user packages can be +used with either set of kernel packages and do not need to be rebuilt +when the kernel is updated. You can also streamline the build process by +building only the DKMS or kmod packages as shown below.
+Be aware that when building directly from a git repository you must +first run the autogen.sh script to create the configure script. This +will require installing the GNU autotools packages for your +distribution. To perform any of the builds, you must install all the +necessary development tools and headers for your distribution.
+It is important to note that if the development kernel headers for the +currently running kernel aren’t installed, the modules won’t compile +properly.
+ +Make sure that the required packages are installed to build the latest +ZFS 2.1 release:
+RHEL/CentOS 7:
sudo yum install epel-release gcc make autoconf automake libtool rpm-build libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python python2-devel python-setuptools python-cffi libffi-devel ncompress
+sudo yum install --enablerepo=epel dkms python-packaging
+
RHEL/CentOS 8, Fedora:
sudo dnf install --skip-broken epel-release gcc make autoconf automake libtool rpm-build kernel-rpm-macros libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) kernel-abi-stablelists-$(uname -r | sed 's/\.[^.]\+$//') python3 python3-devel python3-setuptools python3-cffi libffi-devel ncompress
+sudo dnf install --skip-broken --enablerepo=epel --enablerepo=powertools python3-packaging dkms
+
RHEL/CentOS 9:
sudo dnf config-manager --set-enabled crb
+sudo dnf install --skip-broken epel-release gcc make autoconf automake libtool rpm-build kernel-rpm-macros libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) kernel-abi-stablelists-$(uname -r | sed 's/\.[^.]\+$//') python3 python3-devel python3-setuptools python3-cffi libffi-devel
+sudo dnf install --skip-broken --enablerepo=epel python3-packaging dkms
+
Building rpm-based DKMS and user packages can be done as follows:
+$ cd zfs
+$ ./configure
+$ make -j1 rpm-utils rpm-dkms
+$ sudo yum localinstall *.$(uname -p).rpm *.noarch.rpm
+
The key thing to know when building a kmod package is that a specific +Linux kernel must be specified. At configure time the build system will +make an educated guess as to which kernel you want to build against. +However, if configure is unable to locate your kernel development +headers, or you want to build against a different kernel, you must +specify the exact path with the –with-linux and –with-linux-obj +options.
+$ cd zfs
+$ ./configure
+$ make -j1 rpm-utils rpm-kmod
+$ sudo yum localinstall *.$(uname -p).rpm
+
The process for building kABI-tracking kmods is almost identical to for +building normal kmods. However, it will only produce binaries which can +be used by multiple kernels if the distribution supports a stable kABI. +In order to request kABI-tracking package the –with-spec=redhat +option must be passed to configure.
+NOTE: This type of package is not available for Fedora.
+$ cd zfs
+$ ./configure --with-spec=redhat
+$ make -j1 rpm-utils rpm-kmod
+$ sudo yum localinstall *.$(uname -p).rpm
+
Make sure that the required packages are installed:
+sudo apt install build-essential autoconf automake libtool gawk alien fakeroot dkms libblkid-dev uuid-dev libudev-dev libssl-dev zlib1g-dev libaio-dev libattr1-dev libelf-dev linux-headers-generic python3 python3-dev python3-setuptools python3-cffi libffi-dev python3-packaging debhelper-compat dh-python po-debconf python3-all-dev python3-sphinx
+
The key thing to know when building a kmod package is that a specific +Linux kernel must be specified. At configure time the build system will +make an educated guess as to which kernel you want to build against. +However, if configure is unable to locate your kernel development +headers, or you want to build against a different kernel, you must +specify the exact path with the –with-linux and –with-linux-obj +options.
+To build RPM converted Debian packages:
+$ cd zfs
+$ ./configure --enable-systemd
+$ make -j1 deb-utils deb-kmod
+$ sudo apt-get install --fix-missing ./*.deb
+
Starting from openzfs-2.2 release, native Debian packages can be built +as follows:
+$ cd zfs
+$ ./configure
+$ make native-deb-utils native-deb-kmod
+$ rm ../openzfs-zfs-dkms_*.deb
+$ sudo apt-get install --fix-missing ../*.deb
+
Native Debian packages build with pre-configured paths for Debian and
+Ubuntu. It’s best not to override the paths during configure.
+KVERS
, KSRC
and KOBJ
environment variables can be exported
+to specify the kernel installed in non-default location.
Building RPM converted deb-based DKMS and user packages can be done as +follows:
+$ cd zfs
+$ ./configure --enable-systemd
+$ make -j1 deb-utils deb-dkms
+$ sudo apt-get install --fix-missing ./*.deb
+
Starting from openzfs-2.2 release, native deb-based DKMS and user +packages can be built as follows:
+$ sudo apt-get install dh-dkms
+$ cd zfs
+$ ./configure
+$ make native-deb-utils
+$ sudo apt-get install --fix-missing ../*.deb
+
The released tarball contains the latest fully tested and released +version of ZFS. This is the preferred source code location for use in +production systems. If you want to use the official released tarballs, +then use the following commands to fetch and prepare the source.
+$ wget http://archive.zfsonlinux.org/downloads/zfsonlinux/zfs/zfs-x.y.z.tar.gz
+$ tar -xzf zfs-x.y.z.tar.gz
+
The Git master branch contains the latest version of the software, and +will probably contain fixes that, for some reason, weren’t included in +the released tarball. This is the preferred source code location for +developers who intend to modify ZFS. If you would like to use the git +version, you can clone it from Github and prepare the source like this.
+$ git clone https://github.com/zfsonlinux/zfs.git
+$ cd zfs
+$ ./autogen.sh
+
Once the source has been prepared you’ll need to decide what kind of +packages you’re building and jump the to appropriate section above. Note +that not all package types are supported for all platforms.
+This is a very basic rundown of how to use Git and GitHub to make +changes.
+Recommended reading: ZFS on Linux +CONTRIBUTING.md
+If you’ve never used Git before, you’ll need a little setup to start +things off.
+git config --global user.name "My Name"
+git config --global user.email myemail@noreply.non
+
The easiest way to get started is to click the fork icon at the top of +the main repository page. From there you need to download a copy of the +forked repository to your computer:
+git clone https://github.com/<your-account-name>/zfs.git
+
This sets the “origin” repository to your fork. This will come in handy +when creating pull requests. To make pulling from the “upstream” +repository as changes are made, it is very useful to establish the +upstream repository as another remote (man git-remote):
+cd zfs
+git remote add upstream https://github.com/zfsonlinux/zfs.git
+
In order to make changes it is recommended to make a branch, this lets +you work on several unrelated changes at once. It is also not +recommended to make changes to the master branch unless you own the +repository.
+git checkout -b my-new-branch
+
From here you can make your changes and move on to the next step.
+Recommended reading: C Style and Coding Standards for +SunOS, +ZFS on Linux Developer +Resources, +OpenZFS Developer +Resources
+Before committing and pushing, you may want to test your patches. There
+are several tests you can run against your branch such as style
+checking, and functional tests. All pull requests go through these tests
+before being pushed to the main repository, however testing locally
+takes the load off the build/test servers. This step is optional but
+highly recommended, however the test suite should be run on a virtual
+machine or a host that currently does not use ZFS. You may need to
+install shellcheck
and flake8
to run the checkstyle
+correctly.
sh autogen.sh
+./configure
+make checkstyle
+
Recommended reading: Building +ZFS, ZFS Test +Suite +README
+When you are done making changes to your branch there are a few more +steps before you can make a pull request.
+git commit --all --signoff
+
This command opens an editor and adds all unstaged files from your +branch. Here you need to describe your change and add a few things:
+# Please enter the commit message for your changes. Lines starting
+# with '#' will be ignored, and an empty message aborts the commit.
+# On branch my-new-branch
+# Changes to be committed:
+# (use "git reset HEAD <file>..." to unstage)
+#
+# modified: hello.c
+#
+
The first thing we need to add is the commit message. This is what is +displayed on the git log, and should be a short description of the +change. By style guidelines, this has to be less than 72 characters in +length.
+Underneath the commit message you can add a more descriptive text to +your commit. The lines in this section have to be less than 72 +characters.
+When you are done, the commit should look like this:
+Add hello command
+
+This is a test commit with a descriptive commit message.
+This message can be more than one line as shown here.
+
+Signed-off-by: My Name <myemail@noreply.non>
+Closes #9998
+Issue #9999
+# Please enter the commit message for your changes. Lines starting
+# with '#' will be ignored, and an empty message aborts the commit.
+# On branch my-new-branch
+# Changes to be committed:
+# (use "git reset HEAD <file>..." to unstage)
+#
+# modified: hello.c
+#
+
You can also reference issues and pull requests if you are filing a pull +request for an existing issue as shown above. Save and exit the editor +when you are done.
+Home stretch. You’ve made your change and made the commit. Now it’s time +to push it.
+git push --set-upstream origin my-new-branch
+
This should ask you for your github credentials and upload your changes +to your repository.
+The last step is to either go to your repository or the upstream +repository on GitHub and you should see a button for making a new pull +request for your recently committed branch.
+Sometimes things don’t always go as planned and you may need to update
+your pull request with a correction to either your commit message, or
+your changes. This can be accomplished by re-pushing your branch. If you
+need to make code changes or git add
a file, you can do those now,
+along with the following:
git commit --amend
+git push --force
+
This will return you to the commit editor screen, and push your changes +over top of the old ones. Do note that this will restart the process of +any build/test servers currently running and excessively pushing can +cause delays in processing of all pull requests.
+When you wish to make changes in the future you will want to have an +up-to-date copy of the upstream repository to make your changes on. Here +is how you keep updated:
+git checkout master
+git pull upstream master
+git push origin master
+
This will make sure you are on the master branch of the repository, grab +the changes from upstream, then push them back to your repository.
+This is a very basic introduction to Git and GitHub, but should get you
+on your way to contributing to many open source projects. Not all
+projects have style requirements and some may have different processes
+to getting changes committed so please refer to their documentation to
+see if you need to do anything different. One topic we have not touched
+on is the git rebase
command which is a little more advanced for
+this wiki article.
Additional resources: Github Help, +Atlassian Git Tutorials
+Commit exceptions used to explicitly reference a given Linux commit. +These exceptions are useful for a variety of reasons.
+This page is used to generate +OpenZFS Tracking +page.
+<openzfs issue>|-|<comment>
- The OpenZFS commit isn’t applicable
+to Linux, or the OpenZFS -> ZFS on Linux commit matching is unable to
+associate the related commits due to lack of information (denoted by
+a -).
<openzfs issue>|<commit>|<comment>
- The fix was merged to Linux
+prior to their being an OpenZFS issue.
<openzfs issue>|!|<comment>
- The commit is applicable but not
+applied for the reason described in the comment.
OpenZFS issue id |
+status/ZFS commit |
+comment |
+
---|---|---|
11453 |
+! |
+check_disk() on illumos +isn’t available on ZoL / +OpenZFS 2.0 |
+
11276 |
+da68988 |
++ |
11052 |
+2efea7c |
++ |
11051 |
+3b61ca3 |
++ |
10853 |
+8dc2197 |
++ |
10844 |
+61c3391 |
++ |
10842 |
+d10b2f1 |
++ |
10841 |
+944a372 |
++ |
10809 |
+ee36c70 |
++ |
10808 |
+2ef0f8c |
++ |
10701 |
+0091d66 |
++ |
10601 |
+cc99f27 |
++ |
10573 |
+48d3eb4 |
++ |
10572 |
+edc1e71 |
++ |
10566 |
+ab7615d |
++ |
10554 |
+bec1067 |
++ |
10500 |
+03916905 |
++ |
10449 |
+379ca9c |
++ |
10406 |
+da2feb4 |
++ |
10154 |
+
|
+Not applicable to Linux |
+
10067 |
+
|
+The only ZFS change was to +zfs remap, which was +removed on Linux. |
+
9884 |
+
|
+Not applicable to Linux |
+
9851 |
+
|
+Not applicable to Linux |
+
9691 |
+d9b4bf0 |
++ |
9683 |
+
|
+Not applicable to Linux due +to devids not being used |
+
9680 |
+
|
+Applied and rolled back in +OpenZFS, additional changes +needed. |
+
9672 |
+29445fe3 |
++ |
9647 |
+a448a25 |
++ |
9626 |
+59e6e7ca |
++ |
9635 |
+
|
+Not applicable to Linux |
+
9623 |
+22448f08 |
++ |
9621 |
+305bc4b3 |
++ |
9539 |
+5228cf01 |
++ |
9512 |
+b4555c77 |
++ |
9487 |
+48fbb9dd |
++ |
9466 |
+272b5d73 |
++ |
9440 |
+f664f1e |
+Illumos ticket 9440 never +landed in openzfs/openzfs, +but in ZoL / OpenZFS 2.0 |
+
9433 |
+0873bb63 |
++ |
9421 |
+64c1dcef |
++ |
9237 |
+
|
+Introduced by 8567 which +was never applied to Linux |
+
9194 |
+
|
+Not applicable the ‘-o +ashift=value’ option is +provided on Linux |
+
9077 |
+
|
+Not applicable to Linux |
+
9027 |
+4a5d7f82 |
++ |
9018 |
+3ec34e55 |
++ |
8984 |
+! |
+WIP to support NFSv4 ACLs |
+
8969 |
+
|
+Not applicable to Linux |
+
8942 |
+650258d7 |
++ |
8941 |
+390d679a |
++ |
8862 |
+3b9edd7 |
++ |
8858 |
+
|
+Not applicable to Linux |
+
8856 |
+
|
+Not applicable to Linux due +to Encryption (b525630) |
+
8809 |
+! |
+Adding libfakekernel needs +to be done by refactoring +existing code. |
+
8727 |
+b525630 |
++ |
8713 |
+871e0732 |
++ |
8661 |
+1ce23dca |
++ |
8648 |
+f763c3d1 |
++ |
8602 |
+a032ac4 |
++ |
8601 |
+d99a015 |
+Equivalent fix included in +initial commit |
+
8590 |
+935e2c2 |
++ |
8569 |
+
|
+This change isn’t relevant +for Linux. |
+
8567 |
+
|
+An alternate fix was +applied for Linux. |
+
8552 |
+935e2c2 |
++ |
8521 |
+ee6370a7 |
++ |
8502 |
+! |
+Apply when porting OpenZFS +7955 |
+
9485 |
+1258bd7 |
++ |
8477 |
+92e43c1 |
++ |
8454 |
+
|
+An alternate fix was +applied for Linux. |
+
8423 |
+50c957f |
++ |
8408 |
+5f1346c |
++ |
8379 |
+
|
+This change isn’t relevant +for Linux. |
+
8376 |
+
|
+This change isn’t relevant +for Linux. |
+
8311 |
+! |
+Need to assess +applicability to Linux. |
+
8304 |
+
|
+This change isn’t relevant +for Linux. |
+
8300 |
+44f09cd |
++ |
8265 |
+
|
+The large_dnode feature has +been implemented for Linux. |
+
8168 |
+78d95ea |
++ |
8138 |
+44f09cd |
+The spelling fix to the zfs +man page came in with the +mdoc conversion. |
+
8108 |
+
|
+An equivalent Linux +specific fix was made. |
+
8068 |
+a1d477c24c |
+merged with zfs device +evacuation/removal |
+
8064 |
+
|
+This change isn’t relevant +for Linux. |
+
8022 |
+e55ebf6 |
++ |
8021 |
+7657def |
++ |
8013 |
+
|
+The change is illumos +specific and not applicable +for Linux. |
+
7982 |
+
|
+The change is illumos +specific and not applicable +for Linux. |
+
7970 |
+c30e58c |
++ |
7956 |
+cda0317 |
++ |
7955 |
+! |
+Need to assess +applicability to Linux. If +porting, apply 8502. |
+
7869 |
+df7eecc |
++ |
7816 |
+
|
+The change is illumos +specific and not applicable +for Linux. |
+
7803 |
+
|
+This functionality is
+provided by
+ |
+
7801 |
+0eef1bd |
+Commit f25efb3 in +openzfs/master has a small +change for linting which is +being ported. |
+
7779 |
+
|
+The change isn’t relevant,
+ |
+
7740 |
+32d41fb |
++ |
7739 |
+582cc014 |
++ |
7730 |
+e24e62a |
++ |
7710 |
+
|
+None of the illumos build +system is used under Linux. |
+
7602 |
+44f09cd |
++ |
7591 |
+541a090 |
++ |
7586 |
+c443487 |
++ |
7570 |
+
|
+Due to differences in the +block layer all discards +are handled asynchronously +under Linux. This +functionality could be +ported but it’s unclear to +what purpose. |
+
7542 |
+
|
+The Linux libshare code +differs significantly from +the upstream OpenZFS code. +Since this change doesn’t +address a Linux specific +issue it doesn’t need to be +ported. The eventual plan +is to retire all of the +existing libshare code and +use the ZED to more +flexibly control filesystem +sharing. |
+
7512 |
+
|
+None of the illumos build +system is used under Linux. |
+
7497 |
+
|
+DTrace is isn’t readily +available under Linux. |
+
7446 |
+! |
+Need to assess +applicability to Linux. |
+
7430 |
+68cbd56 |
++ |
7402 |
+690fe64 |
++ |
7345 |
+058ac9b |
++ |
7278 |
+
|
+Dynamic ARC tuning is +handled slightly +differently under Linux and +this case is covered by +arc_tuning_update() |
+
7238 |
+
|
+zvol_swap test already +disabled in ZoL |
+
7194 |
+d7958b4 |
++ |
7164 |
+b1b85c87 |
++ |
7041 |
+33c0819 |
++ |
7016 |
+d3c2ae1 |
++ |
6914 |
+
|
+Under Linux the +arc_meta_limit can be tuned +with the +zfs_arc_meta_limit_percent +module option. |
+
6875 |
+! |
+WIP to support NFSv4 ACLs |
+
6843 |
+f5f087e |
++ |
6841 |
+4254acb |
++ |
6781 |
+15313c5 |
++ |
6765 |
+! |
+WIP to support NFSv4 ACLs |
+
6764 |
+! |
+WIP to support NFSv4 ACLs |
+
6763 |
+! |
+WIP to support NFSv4 ACLs |
+
6762 |
+! |
+WIP to support NFSv4 ACLs |
+
6648 |
+6bb24f4 |
++ |
6578 |
+6bb24f4 |
++ |
6577 |
+6bb24f4 |
++ |
6575 |
+6bb24f4 |
++ |
6568 |
+6bb24f4 |
++ |
6528 |
+6bb24f4 |
++ |
6494 |
+
|
+The |
+
6468 |
+6bb24f4 |
++ |
6465 |
+6bb24f4 |
++ |
6434 |
+472e7c6 |
++ |
6421 |
+ca0bf58 |
++ |
6418 |
+131cc95 |
++ |
6391 |
+ee06391 |
++ |
6390 |
+85802aa |
++ |
6388 |
+0de7c55 |
++ |
6386 |
+485c581 |
++ |
6385 |
+f3ad9cd |
++ |
6369 |
+6bb24f4 |
++ |
6368 |
+2024041 |
++ |
6346 |
+058ac9b |
++ |
6334 |
+1a04bab |
++ |
6290 |
+017da6 |
++ |
6250 |
+
|
+Linux handles crash dumps +in a fundamentally +different way than Illumos. +The proposed changes are +not needed. |
+
6249 |
+6bb24f4 |
++ |
6248 |
+6bb24f4 |
++ |
6220 |
+
|
+The b_thawed debug code was +unused under Linux and +removed. |
+
6209 |
+
|
+The Linux user space mutex +implementation is based on +phtread primitives. |
+
6095 |
+f866a4ea |
++ |
6091 |
+c11f100 |
++ |
6037 |
+a8bd6dc |
++ |
5984 |
+480f626 |
++ |
5966 |
+6bb24f4 |
++ |
5961 |
+22872ff |
++ |
5882 |
+83e9986 |
++ |
5815 |
+
|
+This patch could be adapted +if needed use equivalent +Linux functionality. |
+
5770 |
+c3275b5 |
++ |
5769 |
+dd26aa5 |
++ |
5768 |
+
|
+The change isn’t relevant,
+ |
+
5766 |
+4dd1893 |
++ |
5693 |
+0f7d2a4 |
++ |
5692 |
+! |
+This functionality should
+be ported in such a way
+that it can be integrated
+with |
+
5684 |
+6bb24f4 |
++ |
5503 |
+0f676dc |
+Proposed patch in 5503 +never upstreamed, +alternative fix deployed +with OpenZFS 7072 |
+
5502 |
+f0ed6c7 |
+Proposed patch in 5502 +never upstreamed, +alternative fix deployed +in ZoL with commit f0ed6c7 |
+
5410 |
+0bf8501 |
++ |
5409 |
+b23d543 |
++ |
5379 |
+
|
+This particular issue never +impacted Linux due to the +need for a modified +zfs_putpage() +implementation. |
+
5316 |
+
|
+The illumos idmap facility +isn’t available under +Linux. This patch could +still be applied to +minimize code delta or all +HAVE_IDMAP chunks could be +removed on Linux for better +readability. |
+
5313 |
+ec8501e |
++ |
5312 |
+! |
+This change should be made +but the ideal time to do it +is when the spl repository +is folded in to the zfs +repository (planned for +0.8). At this time we’ll +want to cleanup many of the +includes. |
+
5219 |
+ef56b07 |
++ |
5179 |
+3f4058c |
++ |
5154 |
+9a49d3f |
+Illumos ticket 5154 never +landed in openzfs/openzfs, +alternative fix deployed +in ZoL with commit 9a49d3f |
+
5149 |
+
|
+Equivalent Linux
+functionality is provided
+by the
+ |
+
5148 |
+
|
+Discards are handled +differently under Linux, +there is no DKIOCFREE +ioctl. |
+
5136 |
+e8b96c6 |
++ |
4752 |
+aa9af22 |
++ |
4745 |
+411bf20 |
++ |
4698 |
+4fcc437 |
++ |
4620 |
+6bb24f4 |
++ |
4573 |
+10b7549 |
++ |
4571 |
+6e1b9d0 |
++ |
4570 |
+b1d13a6 |
++ |
4391 |
+78e2739 |
++ |
4465 |
+cda0317 |
++ |
4263 |
+6bb24f4 |
++ |
4242 |
+
|
+Neither vnodes or their +associated events exist +under Linux. |
+
4206 |
+2820bc4 |
++ |
4188 |
+2e7b765 |
++ |
4181 |
+44f09cd |
++ |
4161 |
+
|
+The Linux user space +reader/writer +implementation is based on +phtread primitives. |
+
4128 |
+! |
+The +ldi_ev_register_callbacks() +interface doesn’t exist +under Linux. It may be +possible to receive similar +notifications via the scsi +error handlers or possibly +a different interface. |
+
4072 |
+
|
+None of the illumos build +system is used under Linux. |
+
3998 |
+417104bd |
+Illumos ticket 3998 never +landed in openzfs/openzfs, +alternative fix deployed +in ZoL. |
+
3947 |
+7f9d994 |
++ |
3928 |
+
|
+Neither vnodes or their +associated events exist +under Linux. |
+
3871 |
+d1d7e268 |
++ |
3747 |
+090ff09 |
++ |
3705 |
+
|
+The Linux implementation +uses the lz4 workspace kmem +cache to resolve the stack +issue. |
+
3606 |
+c5b247f |
++ |
3580 |
+
|
+Linux provides generic +ioctl handlers get/set +block device information. |
+
3543 |
+8dca0a9 |
++ |
3512 |
+67629d0 |
++ |
3507 |
+43a696e |
++ |
3444 |
+6bb24f4 |
++ |
3371 |
+44f09cd |
++ |
3311 |
+6bb24f4 |
++ |
3301 |
+
|
+The Linux implementation of
+ |
+
3258 |
+9d81146 |
++ |
3254 |
+! |
+WIP to support NFSv4 ACLs |
+
3246 |
+cc92e9d |
++ |
2933 |
+
|
+None of the illumos build +system is used under Linux. |
+
2897 |
+fb82700 |
++ |
2665 |
+32a9872 |
++ |
2130 |
+460a021 |
++ |
1974 |
+
|
+This change was entirely +replaced in the ARC +restructuring. |
+
1898 |
+
|
+The zfs_putpage() function +was rewritten to properly +integrate with the Linux +VM. |
+
1700 |
+
|
+Not applicable to Linux, +the discard implementation +is entirely different. |
+
1618 |
+ca67b33 |
++ |
1337 |
+2402458 |
++ |
1126 |
+e43b290 |
++ |
763 |
+3cee226 |
++ |
742 |
+! |
+WIP to support NFSv4 ACLs |
+
701 |
+460a021 |
++ |
348 |
+
|
+The Linux implementation of
+ |
+
243 |
+
|
+Manual updates have been +made separately for Linux. |
+
184 |
+
|
+The zfs_putpage() function +was rewritten to properly +integrate with the Linux +VM. |
+
The ZFS on Linux project is an adaptation of the upstream OpenZFS +repository designed to work in +a Linux environment. This upstream repository acts as a location where +new features, bug fixes, and performance improvements from all the +OpenZFS platforms can be integrated. Each platform is responsible for +tracking the OpenZFS repository and merging the relevant improvements +back in to their release.
+For the ZFS on Linux project this tracking is managed through an +OpenZFS tracking +page. The page is updated regularly and shows a list of OpenZFS commits +and their status in regard to the ZFS on Linux master branch.
+This page describes the process of applying outstanding OpenZFS commits +to ZFS on Linux and submitting those changes for inclusion. As a +developer this is a great way to familiarize yourself with ZFS on Linux +and to begin quickly making a valuable contribution to the project. The +following guide assumes you have a github +account, +are familiar with git, and are used to developing in a Linux +environment.
+Clone the source. Start by making a local clone of the +spl and +zfs repositories.
+$ git clone -o zfsonlinux https://github.com/zfsonlinux/spl.git
+$ git clone -o zfsonlinux https://github.com/zfsonlinux/zfs.git
+
Add remote repositories. Using the GitHub web interface +fork the +zfs repository in to your +personal GitHub account. Add your new zfs fork and the +openzfs repository as remotes +and then fetch both repositories. The OpenZFS repository is large and +the initial fetch may take some time over a slow connection.
+$ cd zfs
+$ git remote add <your-github-account> git@github.com:<your-github-account>/zfs.git
+$ git remote add openzfs https://github.com/openzfs/openzfs.git
+$ git fetch --all
+
Build the source. Compile the spl and zfs master branches. These +branches are always kept stable and this is a useful verification that +you have a full build environment installed and all the required +dependencies are available. This may also speed up the compile time +latter for small patches where incremental builds are an option.
+$ cd ../spl
+$ sh autogen.sh && ./configure --enable-debug && make -s -j$(nproc)
+$
+$ cd ../zfs
+$ sh autogen.sh && ./configure --enable-debug && make -s -j$(nproc)
+
Consult the OpenZFS +tracking page and +select a patch which has not yet been applied. For your first patch you +will want to select a small patch to familiarize yourself with the +process.
+There are 2 methods:
+ +Please read about manual merge first to learn the +whole process.
+You can start to +cherry-pick by your own, +but we have made a special +script, +which tries to +cherry-pick the patch +automatically and generates the description.
+Prepare environment:
Mandatory git settings (add to ~/.gitconfig
):
[merge]
+ renameLimit = 999999
+[user]
+ email = mail@yourmail.com
+ name = Your Name
+
Download the script:
+wget https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/openzfs-merge.sh
+
Run:
./openzfs-merge.sh -d path_to_zfs_folder -c openzfs_commit_hash
+
This command will fetch all repositories, create a new branch
+autoport-ozXXXX
(XXXX - OpenZFS issue number), try to cherry-pick,
+compile and check cstyle on success.
If it succeeds without any merge conflicts - go to autoport-ozXXXX
+branch, it will have ready to pull commit. Congratulations, you can go
+to step 7!
Otherwise you should go to step 2.
+Resolve all merge conflicts manually. Easy method - install
+Meld or any other diff tool and run
+git mergetool
.
Check all compile and cstyle errors (See Testing a +patch).
Commit your changes with any description.
Update commit description (last commit will be changed):
./openzfs-merge.sh -d path_to_zfs_folder -g openzfs_commit_hash
+
Add any porting notes (if you have modified something):
+git commit --amend
Push your commit to github:
+git push <your-github-account> autoport-ozXXXX
Create a pull request to ZoL master branch.
Go to Testing a patch section.
Create a new branch. It is important to create a new branch for +every commit you port to ZFS on Linux. This will allow you to easily +submit your work as a GitHub pull request and it makes it possible to +work on multiple OpenZFS changes concurrently. All development branches +need to be based off of the ZFS master branch and it’s helpful to name +the branches after the issue number you’re working on.
+$ git checkout -b openzfs-<issue-nr> master
+
Generate a patch. One of the first things you’ll notice about the
+ZFS on Linux repository is that it is laid out differently than the
+OpenZFS repository. Organizationally it is much flatter, this is
+possible because it only contains the code for OpenZFS not an entire OS.
+That means that in order to apply a patch from OpenZFS the path names in
+the patch must be changed. A script called zfs2zol-patch.sed has been
+provided to perform this translation. Use the git format-patch
+command and this script to generate a patch.
$ git format-patch --stdout <commit-hash>^..<commit-hash> | \
+ ./scripts/zfs2zol-patch.sed >openzfs-<issue-nr>.diff
+
Apply the patch. In many cases the generated patch will apply +cleanly to the repository. However, it’s important to keep in mind the +zfs2zol-patch.sed script only translates the paths. There are often +additional reasons why a patch might not apply. In some cases hunks of +the patch may not be applicable to Linux and should be dropped. In other +cases a patch may depend on other changes which must be applied first. +The changes may also conflict with Linux specific modifications. In all +of these cases the patch will need to be manually modified to apply +cleanly while preserving the its original intent.
+$ git am ./openzfs-<commit-nr>.diff
+
Update the commit message. By using git format-patch
to generate
+the patch and then git am
to apply it the original comment and
+authorship will be preserved. However, due to the formatting of the
+OpenZFS commit you will likely find that the entire commit comment has
+been squashed in to the subject line. Use git commit --amend
to
+cleanup the comment and be careful to follow these standard
+guidelines.
The summary line of an OpenZFS commit is often very long and you should
+truncate it to 50 characters. This is useful because it preserves the
+correct formatting of git log --pretty=oneline
command. Make sure to
+leave a blank line between the summary and body of the commit. Then
+include the full OpenZFS commit message wrapping any lines which exceed
+72 characters. Finally, add a Ported-by
tag with your contact
+information and both a OpenZFS-issue
and OpenZFS-commit
tag with
+appropriate links. You’ll want to verify your commit contains all of the
+following information:
The subject line from the original OpenZFS patch in the form: +“OpenZFS <issue-nr> - short description”.
The original patch authorship should be preserved.
The OpenZFS commit message.
The following tags:
+Authored by: Original patch author
Reviewed by: All OpenZFS reviewers from the original patch.
Approved by: All OpenZFS reviewers from the original patch.
Ported-by: Your name and email address.
OpenZFS-issue: https ://www.illumos.org/issues/issue
OpenZFS-commit: https +://github.com/openzfs/openzfs/commit/hash
Porting Notes: An optional section describing any changes +required when porting.
For example, OpenZFS issue 6873 was applied to +Linux from this +upstream OpenZFS +commit.
+OpenZFS 6873 - zfs_destroy_snaps_nvl leaks errlist
+
+Authored by: Chris Williamson <chris.williamson@delphix.com>
+Reviewed by: Matthew Ahrens <mahrens@delphix.com>
+Reviewed by: Paul Dagnelie <pcd@delphix.com>
+Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name>
+
+lzc_destroy_snaps() returns an nvlist in errlist.
+zfs_destroy_snaps_nvl() should nvlist_free() it before returning.
+
+OpenZFS-issue: https://www.illumos.org/issues/6873
+OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ee06391
+
Build the source. Verify the patched source compiles without errors +and all warnings are resolved.
+$ make -s -j$(nproc)
+
Run the style checker. Verify the patched source passes the style +checker, the command should return without printing any output.
+$ make cstyle
+
Open a Pull Request. When your patch builds cleanly and passes the +style checks open a new pull +request. +The pull request will be queued for automated +testing. As part of the +testing the change is built for a wide range of Linux distributions and +a battery of functional and stress tests are run to detect regressions.
+$ git push <your-github-account> openzfs-<issue-nr>
+
Fix any issues. Testing takes approximately 2 hours to fully +complete and the results are posted in the GitHub pull +request. All the tests +are expected to pass and you should investigate and resolve any test +failures. The test +scripts +are all available and designed to run locally in order reproduce an +issue. Once you’ve resolved the issue force update the pull request to +trigger a new round of testing. Iterate until all the tests are passing.
+# Fix issue, amend commit, force update branch.
+$ git commit --amend
+$ git push --force <your-github-account> openzfs-<issue-nr>
+
Review. Lastly one of the ZFS on Linux maintainers will make a final +review of the patch and may request additional changes. Once the +maintainer is happy with the final version of the patch they will add +their signed-off-by, merge it to the master branch, mark it complete on +the tracking page, and thank you for your contribution to the project!
+Often an issue will be first fixed in ZFS on Linux or a new feature +developed. Changes which are not Linux specific should be submitted +upstream to the OpenZFS GitHub repository for review. The process for +this is described in the OpenZFS +README.
+ZFSBootMenu
+This tutorial is based on the GRUB bootloader. Due to its independent +implementation of a read-only ZFS driver, GRUB only supports a subset +of ZFS features on the boot pool. [In general, bootloader treat disks +as read-only to minimize the risk of damaging on-disk data.]
+ZFSBootMenu is an alternative bootloader +free of such limitations and has support for boot environments. Do not +follow instructions on this page if you plan to use ZBM, +as the layouts are not compatible. Refer +to their site for installation details.
+Customization
+Unless stated otherwise, it is not recommended to customize system +configuration before reboot.
+Only use well-tested pool features
+You should only use well-tested pool features. Avoid using new features if data integrity is paramount. See, for example, this comment.
+Disable Secure Boot. ZFS modules can not be loaded if Secure Boot is enabled.
Download latest extended variant of Alpine Linux +live image, +verify checksum +and boot from it.
+gpg --auto-key-retrieve --keyserver hkps://keyserver.ubuntu.com --verify alpine-extended-*.asc
+
+dd if=input-file of=output-file bs=1M
+
Login as root user. There is no password.
Configure Internet
+setup-interfaces -r
+# You must use "-r" option to start networking services properly
+# example:
+network interface: wlan0
+WiFi name: <ssid>
+ip address: dhcp
+<enter done to finish network config>
+manual netconfig: n
+
If you are using wireless network and it is not shown, see Alpine
+Linux wiki for
+further details. wpa_supplicant
can be installed with apk
+add wpa_supplicant
without internet connection.
Configure SSH server
+setup-sshd
+# example:
+ssh server: openssh
+allow root: "prohibit-password" or "yes"
+ssh key: "none" or "<public key>"
+
Configurations set here will be copied verbatim to the installed system.
+Set root password or /root/.ssh/authorized_keys
.
Choose a strong root password, as it will be copied to the
+installed system. However, authorized_keys
is not copied.
Connect from another computer
+ssh root@192.168.1.91
+
Configure NTP client for time synchronization
+setup-ntp busybox
+
Set up apk-repo. A list of available mirrors is shown. +Press space bar to continue
+setup-apkrepos
+
Throughout this guide, we use predictable disk names generated by +udev
+apk update
+apk add eudev
+setup-devd udev
+
It can be removed after reboot with setup-devd mdev && apk del eudev
.
Target disk
+List available disks with
+find /dev/disk/by-id/
+
If virtio is used as disk bus, power off the VM and set serial numbers for disk.
+For QEMU, use -drive format=raw,file=disk2.img,serial=AaBb
.
+For libvirt, edit domain XML. See this page for examples.
Declare disk array
+DISK='/dev/disk/by-id/ata-FOO /dev/disk/by-id/nvme-BAR'
+
For single disk installation, use
+DISK='/dev/disk/by-id/disk1'
+
Set a mount point
+MNT=$(mktemp -d)
+
Set partition size:
+Set swap size in GB, set to 1 if you don’t want swap to +take up too much space
+SWAPSIZE=4
+
Set how much space should be left at the end of the disk, minimum 1GB
+RESERVE=1
+
Install ZFS support from live media:
+apk add zfs
+
Install bootloader programs and partition tool
+apk add grub-bios grub-efi parted e2fsprogs cryptsetup util-linux
+
Partition the disks.
+Note: you must clear all existing partition tables and data structures from target disks.
+For flash-based storage, this can be done by the blkdiscard command below:
+partition_disk () {
+ local disk="${1}"
+ blkdiscard -f "${disk}" || true
+
+ parted --script --align=optimal "${disk}" -- \
+ mklabel gpt \
+ mkpart EFI 2MiB 1GiB \
+ mkpart bpool 1GiB 5GiB \
+ mkpart rpool 5GiB -$((SWAPSIZE + RESERVE))GiB \
+ mkpart swap -$((SWAPSIZE + RESERVE))GiB -"${RESERVE}"GiB \
+ mkpart BIOS 1MiB 2MiB \
+ set 1 esp on \
+ set 5 bios_grub on \
+ set 5 legacy_boot on
+
+ partprobe "${disk}"
+}
+
+for i in ${DISK}; do
+ partition_disk "${i}"
+done
+
Setup encrypted swap. This is useful if the available memory is +small:
+for i in ${DISK}; do
+ cryptsetup open --type plain --key-file /dev/random "${i}"-part4 "${i##*/}"-part4
+ mkswap /dev/mapper/"${i##*/}"-part4
+ swapon /dev/mapper/"${i##*/}"-part4
+done
+
Load ZFS kernel module
+modprobe zfs
+
Create boot pool
+# shellcheck disable=SC2046
+zpool create -o compatibility=legacy \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O devices=off \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/boot \
+ -R "${MNT}" \
+ bpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part2";
+ done)
+
If not using a multi-disk setup, remove mirror
.
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features.
Create root pool
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part3";
+ done)
+
If not using a multi-disk setup, remove mirror
.
Create root system container:
+Unencrypted
+zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+rpool/alpinelinux
+
Encrypted:
+Avoid ZFS send/recv when using native encryption, see `a ZFS developer's comment on this issue`__ and `this spreadsheet of bugs`__. A LUKS-based guide has yet to be written. Once compromised, changing password will not keep your
+data safe. See zfs-change-key(8)
for more info
zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+ -o encryption=on \
+ -o keylocation=prompt \
+ -o keyformat=passphrase \
+rpool/alpinelinux
+
You can automate this step (insecure) with: echo POOLPASS | zfs create ...
.
Create system datasets,
+manage mountpoints with mountpoint=legacy
zfs create -o canmount=noauto -o mountpoint=/ rpool/alpinelinux/root
+zfs mount rpool/alpinelinux/root
+zfs create -o mountpoint=legacy rpool/alpinelinux/home
+mkdir "${MNT}"/home
+mount -t zfs rpool/alpinelinux/home "${MNT}"/home
+zfs create -o mountpoint=legacy rpool/alpinelinux/var
+zfs create -o mountpoint=legacy rpool/alpinelinux/var/lib
+zfs create -o mountpoint=legacy rpool/alpinelinux/var/log
+zfs create -o mountpoint=none bpool/alpinelinux
+zfs create -o mountpoint=legacy bpool/alpinelinux/root
+mkdir "${MNT}"/boot
+mount -t zfs bpool/alpinelinux/root "${MNT}"/boot
+mkdir -p "${MNT}"/var/log
+mkdir -p "${MNT}"/var/lib
+mount -t zfs rpool/alpinelinux/var/lib "${MNT}"/var/lib
+mount -t zfs rpool/alpinelinux/var/log "${MNT}"/var/log
+
Format and mount ESP
+for i in ${DISK}; do
+ mkfs.vfat -n EFI "${i}"-part1
+ mkdir -p "${MNT}"/boot/efis/"${i##*/}"-part1
+ mount -t vfat -o iocharset=iso8859-1 "${i}"-part1 "${MNT}"/boot/efis/"${i##*/}"-part1
+done
+
+mkdir -p "${MNT}"/boot/efi
+mount -t vfat -o iocharset=iso8859-1 "$(echo "${DISK}" | sed "s|^ *||" | cut -f1 -d' '|| true)"-part1 "${MNT}"/boot/efi
+
Workaround for GRUB to recognize predictable disk names:
+export ZPOOL_VDEV_NAME_PATH=YES
+
Install system to disk
+BOOTLOADER=grub setup-disk -k lts -v "${MNT}"
+
GRUB installation will fail and will be reinstalled later. +The error message about ZFS kernel module can be ignored.
+Allow EFI system partition to fail at boot:
+sed -i "s|vfat.*rw|vfat rw,nofail|" "${MNT}"/etc/fstab
+
Chroot
+for i in /dev /proc /sys; do mkdir -p "${MNT}"/"${i}"; mount --rbind "${i}" "${MNT}"/"${i}"; done
+chroot "${MNT}" /usr/bin/env DISK="${DISK}" sh
+
Apply GRUB workaround
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile.d/zpool_vdev_name_path.sh
+# shellcheck disable=SC1091
+. /etc/profile.d/zpool_vdev_name_path.sh
+
+# GRUB fails to detect rpool name, hard code as "rpool"
+sed -i "s|rpool=.*|rpool=rpool|" /etc/grub.d/10_linux
+
+# BusyBox stat does not recognize zfs, replace fs detection with ZFS
+sed -i 's|stat -f -c %T /|echo zfs|' /usr/sbin/grub-mkconfig
+
+# grub-probe fails to identify fs mounted at /boot
+BOOT_DEVICE=$(zpool status -P bpool | grep -- -part2 | head -n1 | sed "s|.*/dev*|/dev|" | sed "s|part2.*|part2|")
+sed -i "s|GRUB_DEVICE_BOOT=.*|GRUB_DEVICE_BOOT=${BOOT_DEVICE}|" /usr/sbin/grub-mkconfig
+
The sed
workaround for grub-mkconfig
needs to be applied
+for every GRUB update, as the update will overwrite the changes.
Install GRUB:
+mkdir -p /boot/efi/alpine/grub-bootdir/i386-pc/
+mkdir -p /boot/efi/alpine/grub-bootdir/x86_64-efi/
+for i in ${DISK}; do
+ grub-install --target=i386-pc --boot-directory \
+ /boot/efi/alpine/grub-bootdir/i386-pc/ "${i}"
+done
+grub-install --target x86_64-efi --boot-directory \
+ /boot/efi/alpine/grub-bootdir/x86_64-efi/ --efi-directory \
+ /boot/efi --bootloader-id alpine --removable
+if test -d /sys/firmware/efi/efivars/; then
+ apk add efibootmgr
+ grub-install --target x86_64-efi --boot-directory \
+ /boot/efi/alpine/grub-bootdir/x86_64-efi/ --efi-directory \
+ /boot/efi --bootloader-id alpine
+fi
+
Generate GRUB menu:
+mkdir -p /boot/grub
+grub-mkconfig -o /boot/grub/grub.cfg
+cp /boot/grub/grub.cfg \
+ /boot/efi/alpine/grub-bootdir/x86_64-efi/grub/grub.cfg
+cp /boot/grub/grub.cfg \
+ /boot/efi/alpine/grub-bootdir/i386-pc/grub/grub.cfg
+
For both legacy and EFI booting: mirror ESP content:
+espdir=$(mktemp -d)
+find /boot/efi/ -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' cp -r '{}' "${espdir}"
+find "${espdir}" -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' sh -vxc "find /boot/efis/ -maxdepth 1 -mindepth 1 -type d -print0 | xargs -t -0I '[]' cp -r '{}' '[]'"
+
Exit chroot
+exit
+
Unmount filesystems and create initial system snapshot +You can later create a boot environment from this snapshot. +See Root on ZFS maintenance page.
+umount -Rl "${MNT}"
+zfs snapshot -r rpool@initial-installation
+zfs snapshot -r bpool@initial-installation
+zpool export -a
+
Reboot
+reboot
+
ZFSBootMenu
+This tutorial is based on the GRUB bootloader. Due to its independent +implementation of a read-only ZFS driver, GRUB only supports a subset +of ZFS features on the boot pool. [In general, bootloader treat disks +as read-only to minimize the risk of damaging on-disk data.]
+ZFSBootMenu is an alternative bootloader +free of such limitations and has support for boot environments. Do not +follow instructions on this page if you plan to use ZBM, +as the layouts are not compatible. Refer +to their site for installation details.
+Customization
+Unless stated otherwise, it is not recommended to customize system +configuration before reboot.
+Only use well-tested pool features
+You should only use well-tested pool features. Avoid using new features if data integrity is paramount. See, for example, this comment.
+Disable Secure Boot. ZFS modules can not be loaded if Secure Boot is enabled.
Because the kernel of latest Live CD might be incompatible with +ZFS, we will use Alpine Linux Extended, which ships with ZFS by +default.
+Download latest extended variant of Alpine Linux +live image, +verify checksum +and boot from it.
+gpg --auto-key-retrieve --keyserver hkps://keyserver.ubuntu.com --verify alpine-extended-*.asc
+
+dd if=input-file of=output-file bs=1M
+
Login as root user. There is no password.
Configure Internet
+setup-interfaces -r
+# You must use "-r" option to start networking services properly
+# example:
+network interface: wlan0
+WiFi name: <ssid>
+ip address: dhcp
+<enter done to finish network config>
+manual netconfig: n
+
If you are using wireless network and it is not shown, see Alpine
+Linux wiki for
+further details. wpa_supplicant
can be installed with apk
+add wpa_supplicant
without internet connection.
Configure SSH server
+setup-sshd
+# example:
+ssh server: openssh
+allow root: "prohibit-password" or "yes"
+ssh key: "none" or "<public key>"
+
Set root password or /root/.ssh/authorized_keys
.
Connect from another computer
+ssh root@192.168.1.91
+
Configure NTP client for time synchronization
+setup-ntp busybox
+
Set up apk-repo. A list of available mirrors is shown. +Press space bar to continue
+setup-apkrepos
+
Throughout this guide, we use predictable disk names generated by +udev
+apk update
+apk add eudev
+setup-devd udev
+
Target disk
+List available disks with
+find /dev/disk/by-id/
+
If virtio is used as disk bus, power off the VM and set serial numbers for disk.
+For QEMU, use -drive format=raw,file=disk2.img,serial=AaBb
.
+For libvirt, edit domain XML. See this page for examples.
Declare disk array
+DISK='/dev/disk/by-id/ata-FOO /dev/disk/by-id/nvme-BAR'
+
For single disk installation, use
+DISK='/dev/disk/by-id/disk1'
+
Set a mount point
+MNT=$(mktemp -d)
+
Set partition size:
+Set swap size in GB, set to 1 if you don’t want swap to +take up too much space
+SWAPSIZE=4
+
Set how much space should be left at the end of the disk, minimum 1GB
+RESERVE=1
+
Install ZFS support from live media:
+apk add zfs
+
Install partition tool
+apk add parted e2fsprogs cryptsetup util-linux
+
Partition the disks.
+Note: you must clear all existing partition tables and data structures from target disks.
+For flash-based storage, this can be done by the blkdiscard command below:
+partition_disk () {
+ local disk="${1}"
+ blkdiscard -f "${disk}" || true
+
+ parted --script --align=optimal "${disk}" -- \
+ mklabel gpt \
+ mkpart EFI 2MiB 1GiB \
+ mkpart bpool 1GiB 5GiB \
+ mkpart rpool 5GiB -$((SWAPSIZE + RESERVE))GiB \
+ mkpart swap -$((SWAPSIZE + RESERVE))GiB -"${RESERVE}"GiB \
+ mkpart BIOS 1MiB 2MiB \
+ set 1 esp on \
+ set 5 bios_grub on \
+ set 5 legacy_boot on
+
+ partprobe "${disk}"
+}
+
+for i in ${DISK}; do
+ partition_disk "${i}"
+done
+
Setup encrypted swap. This is useful if the available memory is +small:
+for i in ${DISK}; do
+ cryptsetup open --type plain --key-file /dev/random "${i}"-part4 "${i##*/}"-part4
+ mkswap /dev/mapper/"${i##*/}"-part4
+ swapon /dev/mapper/"${i##*/}"-part4
+done
+
Load ZFS kernel module
+modprobe zfs
+
Create boot pool
+# shellcheck disable=SC2046
+zpool create -o compatibility=legacy \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O devices=off \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/boot \
+ -R "${MNT}" \
+ bpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part2";
+ done)
+
If not using a multi-disk setup, remove mirror
.
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features.
Create root pool
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part3";
+ done)
+
If not using a multi-disk setup, remove mirror
.
Create root system container:
+Unencrypted
+zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+rpool/archlinux
+
Encrypted:
+Avoid ZFS send/recv when using native encryption, see `a ZFS developer's comment on this issue`__ and `this spreadsheet of bugs`__. A LUKS-based guide has yet to be written. Once compromised, changing password will not keep your
+data safe. See zfs-change-key(8)
for more info
zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+ -o encryption=on \
+ -o keylocation=prompt \
+ -o keyformat=passphrase \
+rpool/archlinux
+
You can automate this step (insecure) with: echo POOLPASS | zfs create ...
.
Create system datasets,
+manage mountpoints with mountpoint=legacy
zfs create -o canmount=noauto -o mountpoint=/ rpool/archlinux/root
+zfs mount rpool/archlinux/root
+zfs create -o mountpoint=legacy rpool/archlinux/home
+mkdir "${MNT}"/home
+mount -t zfs rpool/archlinux/home "${MNT}"/home
+zfs create -o mountpoint=legacy rpool/archlinux/var
+zfs create -o mountpoint=legacy rpool/archlinux/var/lib
+zfs create -o mountpoint=legacy rpool/archlinux/var/log
+zfs create -o mountpoint=none bpool/archlinux
+zfs create -o mountpoint=legacy bpool/archlinux/root
+mkdir "${MNT}"/boot
+mount -t zfs bpool/archlinux/root "${MNT}"/boot
+mkdir -p "${MNT}"/var/log
+mkdir -p "${MNT}"/var/lib
+mount -t zfs rpool/archlinux/var/lib "${MNT}"/var/lib
+mount -t zfs rpool/archlinux/var/log "${MNT}"/var/log
+
Format and mount ESP
+for i in ${DISK}; do
+ mkfs.vfat -n EFI "${i}"-part1
+ mkdir -p "${MNT}"/boot/efis/"${i##*/}"-part1
+ mount -t vfat -o iocharset=iso8859-1 "${i}"-part1 "${MNT}"/boot/efis/"${i##*/}"-part1
+done
+
+mkdir -p "${MNT}"/boot/efi
+mount -t vfat -o iocharset=iso8859-1 "$(echo "${DISK}" | sed "s|^ *||" | cut -f1 -d' '|| true)"-part1 "${MNT}"/boot/efi
+
Download and extract minimal Arch Linux root filesystem:
+apk add curl
+
+curl --fail-early --fail -L \
+https://america.archive.pkgbuild.com/iso/2023.09.01/archlinux-bootstrap-x86_64.tar.gz \
+-o rootfs.tar.gz
+curl --fail-early --fail -L \
+https://america.archive.pkgbuild.com/iso/2023.09.01/archlinux-bootstrap-x86_64.tar.gz.sig \
+-o rootfs.tar.gz.sig
+
+apk add gnupg
+gpg --auto-key-retrieve --keyserver hkps://keyserver.ubuntu.com --verify rootfs.tar.gz.sig
+
+ln -s "${MNT}" "${MNT}"/root.x86_64
+tar x -C "${MNT}" -af rootfs.tar.gz root.x86_64
+
Enable community repo
+sed -i '/edge/d' /etc/apk/repositories
+sed -i -E 's/#(.*)community/\1community/' /etc/apk/repositories
+
Generate fstab:
+apk add arch-install-scripts
+genfstab -t PARTUUID "${MNT}" \
+| grep -v swap \
+| sed "s|vfat.*rw|vfat rw,x-systemd.idle-timeout=1min,x-systemd.automount,noauto,nofail|" \
+> "${MNT}"/etc/fstab
+
Chroot
+cp /etc/resolv.conf "${MNT}"/etc/resolv.conf
+for i in /dev /proc /sys; do mkdir -p "${MNT}"/"${i}"; mount --rbind "${i}" "${MNT}"/"${i}"; done
+chroot "${MNT}" /usr/bin/env DISK="${DISK}" bash
+
Add archzfs repo to pacman config
+pacman-key --init
+pacman-key --refresh-keys
+pacman-key --populate
+
+curl --fail-early --fail -L https://archzfs.com/archzfs.gpg \
+| pacman-key -a - --gpgdir /etc/pacman.d/gnupg
+
+pacman-key \
+--lsign-key \
+--gpgdir /etc/pacman.d/gnupg \
+DDF7DB817396A49B2A2723F7403BD972F75D9D76
+
+tee -a /etc/pacman.d/mirrorlist-archzfs <<- 'EOF'
+## See https://github.com/archzfs/archzfs/wiki
+## France
+#,Server = https://archzfs.com/$repo/$arch
+
+## Germany
+#,Server = https://mirror.sum7.eu/archlinux/archzfs/$repo/$arch
+#,Server = https://mirror.biocrafting.net/archlinux/archzfs/$repo/$arch
+
+## India
+#,Server = https://mirror.in.themindsmaze.com/archzfs/$repo/$arch
+
+## United States
+#,Server = https://zxcvfdsa.com/archzfs/$repo/$arch
+EOF
+
+tee -a /etc/pacman.conf <<- 'EOF'
+
+#[archzfs-testing]
+#Include = /etc/pacman.d/mirrorlist-archzfs
+
+#,[archzfs]
+#,Include = /etc/pacman.d/mirrorlist-archzfs
+EOF
+
+# this #, prefix is a workaround for ci/cd tests
+# remove them
+sed -i 's|#,||' /etc/pacman.d/mirrorlist-archzfs
+sed -i 's|#,||' /etc/pacman.conf
+sed -i 's|^#||' /etc/pacman.d/mirrorlist
+
Install base packages:
+pacman -Sy
+pacman -S --noconfirm mg mandoc grub efibootmgr mkinitcpio
+
+kernel_compatible_with_zfs="$(pacman -Si zfs-linux \
+| grep 'Depends On' \
+| sed "s|.*linux=||" \
+| awk '{ print $1 }')"
+pacman -U --noconfirm https://america.archive.pkgbuild.com/packages/l/linux/linux-"${kernel_compatible_with_zfs}"-x86_64.pkg.tar.zst
+
Install zfs packages:
+pacman -S --noconfirm zfs-linux zfs-utils
+
Configure mkinitcpio:
+sed -i 's|filesystems|zfs filesystems|' /etc/mkinitcpio.conf
+mkinitcpio -P
+
For physical machine, install firmware
+pacman -S linux-firmware intel-ucode amd-ucode
+
Enable internet time synchronisation:
+systemctl enable systemd-timesyncd
+
Generate host id:
+zgenhostid -f -o /etc/hostid
+
Generate locales:
+echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen
+locale-gen
+
Set locale, keymap, timezone, hostname
+rm -f /etc/localtime
+systemd-firstboot \
+--force \
+--locale=en_US.UTF-8 \
+--timezone=Etc/UTC \
+--hostname=testhost \
+--keymap=us
+
Set root passwd
+printf 'root:yourpassword' | chpasswd
+
Apply GRUB workaround
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile.d/zpool_vdev_name_path.sh
+# shellcheck disable=SC1091
+. /etc/profile.d/zpool_vdev_name_path.sh
+
+# GRUB fails to detect rpool name, hard code as "rpool"
+sed -i "s|rpool=.*|rpool=rpool|" /etc/grub.d/10_linux
+
This workaround needs to be applied for every GRUB update, as the +update will overwrite the changes.
+Install GRUB:
+mkdir -p /boot/efi/archlinux/grub-bootdir/i386-pc/
+mkdir -p /boot/efi/archlinux/grub-bootdir/x86_64-efi/
+for i in ${DISK}; do
+ grub-install --target=i386-pc --boot-directory \
+ /boot/efi/archlinux/grub-bootdir/i386-pc/ "${i}"
+done
+grub-install --target x86_64-efi --boot-directory \
+ /boot/efi/archlinux/grub-bootdir/x86_64-efi/ --efi-directory \
+ /boot/efi --bootloader-id archlinux --removable
+if test -d /sys/firmware/efi/efivars/; then
+ grub-install --target x86_64-efi --boot-directory \
+ /boot/efi/archlinux/grub-bootdir/x86_64-efi/ --efi-directory \
+ /boot/efi --bootloader-id archlinux
+fi
+
Import both bpool and rpool at boot:
+echo 'GRUB_CMDLINE_LINUX="zfs_import_dir=/dev/"' >> /etc/default/grub
+
Generate GRUB menu:
+mkdir -p /boot/grub
+grub-mkconfig -o /boot/grub/grub.cfg
+cp /boot/grub/grub.cfg \
+ /boot/efi/archlinux/grub-bootdir/x86_64-efi/grub/grub.cfg
+cp /boot/grub/grub.cfg \
+ /boot/efi/archlinux/grub-bootdir/i386-pc/grub/grub.cfg
+
For both legacy and EFI booting: mirror ESP content:
+espdir=$(mktemp -d)
+find /boot/efi/ -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' cp -r '{}' "${espdir}"
+find "${espdir}" -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' sh -vxc "find /boot/efis/ -maxdepth 1 -mindepth 1 -type d -print0 | xargs -t -0I '[]' cp -r '{}' '[]'"
+
Exit chroot
+exit
+
Unmount filesystems and create initial system snapshot +You can later create a boot environment from this snapshot. +See Root on ZFS maintenance page.
+umount -Rl "${MNT}"
+zfs snapshot -r rpool@initial-installation
+zfs snapshot -r bpool@initial-installation
+
Export all pools
+zpool export -a
+
Reboot
+reboot
+
Reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat.
+If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @ne9z.
+Due to license incompatibility, +ZFS is not available in Arch Linux official repo.
+ZFS support is provided by third-party archzfs repo.
+See Archlinux Wiki.
+ZFS can be used as root file system for Arch Linux. +An installation guide is available.
+Fork and clone this repo.
Install the tools:
+sudo pacman -S --needed python-pip make
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your "${PATH}", e.g. by adding this to ~/.bashrc:
+[ -d "${HOME}"/.local/bin ] && export PATH="${HOME}"/.local/bin:"${PATH}"
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @ne9z.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
64-bit Debian GNU/Linux Bookworm Live CD w/ GUI (e.g. gnome iso)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the Debian GNU/Linux Live CD. If prompted, login with the username
+user
and password live
. Connect your system to the Internet as
+appropriate (e.g. join your WiFi network). Open a terminal.
Setup and update the repositories:
+sudo vi /etc/apt/sources.list
+
deb http://deb.debian.org/debian bookworm main contrib non-free-firmware
+
sudo apt update
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+sudo apt install --yes openssh-server
+
+sudo systemctl restart ssh
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk zfsutils-linux
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio. Also when using /dev/vda, the partitions used later will be named
+differently. Otherwise, read the troubleshooting
+section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A +kernel and initrd may consume around 100M. If you have multiple kernels +and take snapshots, you may find yourself low on boot pool space, +especially if you need to regenerate your initramfs images, which may be +around 85M each. Size your boot pool appropriately for your needs.
If you are re-using a disk, clear it as necessary:
+Ensure swap partitions are not in use:
+swapoff --all
+
If the disk was previously used in an MD array:
+apt install --yes mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition:
+mdadm --zero-superblock --force ${DISK}-part2
+
If the disk was previously used with zfs:
+wipefs -a $DISK
+
For flash-based storage, if the disk was previously used, you may wish to +do a full-disk discard (TRIM/UNMAP), which can improve performance:
+blkdiscard -f $DISK
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -o compatibility=grub2 \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -O devices=off \
+ -O acltype=posixacl -O xattr=sa \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
Note: GRUB does not support all zpool features (see
+spa_feature_names
in
+grub-core/fs/zfs/zfs.c).
+We create a separate zpool for /boot
here, specifying the
+-o compatibility=grub2
property which restricts the pool to only those
+features that GRUB supports, allowing the root pool to use any/all features.
See the section on Compatibility feature sets
in the zpool-features
+man page for more information.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+apt install --yes cryptsetup
+
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality was implemented in Ubuntu with the
+zsys
tool, though its dataset layout is more complicated, and zsys
+is on life support. Even
+without such a tool, the rpool/ROOT and bpool/BOOT containers can still
+be used for manually created clones. That said, this HOWTO assumes a single
+filesystem for /boot
for simplicity.
Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
+zfs mount rpool/ROOT/debian
+
+zfs create -o mountpoint=/boot bpool/BOOT/debian
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+chmod 700 /mnt/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to separate these to exclude them from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will have a GUI:
+zfs create rpool/var/lib/AccountsService
+zfs create rpool/var/lib/NetworkManager
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will store local email in /var/mail:
+zfs create rpool/var/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Note: If you separate a directory required for booting (e.g. /etc
)
+into its own dataset, you must add it to
+ZFS_INITRD_ADDITIONAL_DATASETS
in /etc/default/zfs
. Datasets
+with canmount=off
(like rpool/usr
above) do not matter for this.
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
Install the minimal system:
+debootstrap bookworm /mnt
+
The debootstrap
command leaves the new system in an unconfigured state.
+An alternative to using debootstrap
is to copy the entirety of a
+working system into the new ZFS root.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME
below to match your interface name:
vi /mnt/etc/network/interfaces.d/NAME
+
auto NAME
+iface NAME inet dhcp
+
Customize this file if the system is not a DHCP client.
+Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://deb.debian.org/debian bookworm main contrib non-free-firmware
+deb-src http://deb.debian.org/debian bookworm main contrib non-free-firmware
+
+deb http://deb.debian.org/debian-security bookworm-security main contrib non-free-firmware
+deb-src http://deb.debian.org/debian-security bookworm-security main contrib non-free-firmware
+
+deb http://deb.debian.org/debian bookworm-updates main contrib non-free-firmware
+deb-src http://deb.debian.org/debian bookworm-updates main contrib non-free-firmware
+
Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+apt update
+
+apt install --yes console-setup locales
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales tzdata keyboard-configuration console-setup
+
Install ZFS in the chroot environment for the new system:
+apt install --yes dpkg-dev linux-headers-generic linux-image-generic
+
+apt install --yes zfs-initramfs
+
+echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup does
+not support ZFS.
For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup cryptsetup-initramfs
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) \
+ none luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
Install an NTP service to synchronize time. +This step is specific to Bookworm which does not install the package during +bootstrap. +Although this step is not necessary for ZFS, it is useful for internet +browsing where local clock drift can cause login failures:
+apt install systemd-timesyncd
+
Install GRUB
+Choose one of the following options:
+Install GRUB for legacy (BIOS) booting:
+apt install --yes grub-pc
+
Install GRUB for UEFI booting:
+apt install dosfstools
+
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+apt install --yes grub-efi-amd64 shim-signed
+
Notes:
+The -s 1
for mkdosfs
is only necessary for drives which present
+4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size
+(given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
For a mirror or raidz topology, this step only installs GRUB on the +first disk. The other disk(s) will be handled later.
Optional: Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub. os-prober is only +necessary in dual-boot configurations.
+Set a root password:
+passwd
+
Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+# Work-around to preserve zpool cache:
+ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
+ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
Note: For some disk configurations (NVMe?), this service may fail with an error
+indicating that the bpool
cannot be found. If this happens, add
+-d DISK-part3
(replace DISK
with the correct device path) to the
+zpool import
command.
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Optional: Install SSH:
+apt install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Optional: For ZFS native encryption or LUKS, configure Dropbear for remote +unlocking:
+apt install --yes --no-install-recommends dropbear-initramfs
+mkdir -p /etc/dropbear/initramfs
+
+# Optional: Convert OpenSSH server keys for Dropbear
+for type in ecdsa ed25519 rsa ; do
+ cp /etc/ssh/ssh_host_${type}_key /tmp/openssh.key
+ ssh-keygen -p -N "" -m PEM -f /tmp/openssh.key
+ dropbearconvert openssh dropbear \
+ /tmp/openssh.key \
+ /etc/dropbear/initramfs/dropbear_${type}_host_key
+done
+rm /tmp/openssh.key
+
+# Add user keys in the same format as ~/.ssh/authorized_keys
+vi /etc/dropbear/initramfs/authorized_keys
+
+# If using a static IP, set it for the initramfs environment:
+vi /etc/initramfs-tools/initramfs.conf
+# The syntax is: IP=ADDRESS::GATEWAY:MASK:HOSTNAME:NIC
+# For example:
+# IP=192.168.1.100::192.168.1.1:255.255.255.0:myhostname:ens3
+# HOSTNAME and NIC are optional.
+
+# Rebuild the initramfs (required when changing any of the above):
+update-initramfs -u -k all
+
Notes:
+Converting the server keys makes Dropbear use the same keys as OpenSSH,
+avoiding host key mismatch warnings. Currently, dropbearconvert doesn’t
+understand the new OpenSSH private key format, so the
+keys need to be converted to the old PEM format first using
+ssh-keygen
. The downside of using the same keys for both OpenSSH and
+Dropbear is that the OpenSSH keys are then available on-disk, unencrypted
+in the initramfs.
Later, to use this functionality, SSH to the system (as root) while it is
+prompting for the passphrase during the boot process. For ZFS native
+encryption, run zfsunlock
. For LUKS, run cryptroot-unlock
.
You can optionally add command="/usr/bin/zfsunlock"
or
+command="/bin/cryptroot-unlock"
in front of the authorized_keys
+line to force the unlock command. This way, the unlock command runs
+automatically and is all that can be run.
Optional (but kindly requested): Install popcon
+The popularity-contest
package reports the list of packages install
+on your system. Showing that ZFS is popular may be helpful in terms of
+long-term attention from the distro.
apt install --yes popularity-contest
+
Choose Yes at the prompt.
+Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
Refresh the initrd files:
+update-initramfs -c -k all
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup
+does not support ZFS.
Workaround GRUB’s missing zpool-features support:
+vi /etc/default/grub
+# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian"
+
Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
Install the boot loader:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the grub-install
+command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=debian --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/debian
+zfs set canmount=noauto rpool/ROOT/debian
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Once the files have data, stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Optional: Snapshot the initial installation:
+zfs snapshot bpool/BOOT/debian@install
+zfs snapshot rpool/ROOT/debian@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
If this fails for rpool, mounting it on boot will fail and you will need to
+zpool import -f rpool
, then exit
in the initamfs prompt.
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+zfs create rpool/home/$username
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video $username
+
Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks.
+For legacy (BIOS) booting:
+dpkg-reconfigure grub-pc
+
Hit enter until you get to the device selection screen. +Select (using the space bar) all of the disks (not partitions) in your pool.
+For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment debian-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. There is a bug report upstream.
+Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
Enable the swap device:
+swapon -av
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Install a regular set of software:
+tasksel --new-install
+
Note: This will check “Debian desktop environment” and “print server” +by default. If you want a server installation, unselect those.
+Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/debian@install
+sudo zfs destroy rpool/ROOT/debian@install
+
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+sudo vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+sudo systemctl restart ssh
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+apt install --yes cryptsetup
+
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+zfs mount rpool/ROOT/debian
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
See Debian Bookworm Root on ZFS for +new installs. This guide is no longer receiving most updates. It continues +to exist for reference for existing installs that followed it.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
64-bit Debian GNU/Linux Bullseye Live CD w/ GUI (e.g. gnome iso)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the Debian GNU/Linux Live CD. If prompted, login with the username
+user
and password live
. Connect your system to the Internet as
+appropriate (e.g. join your WiFi network). Open a terminal.
Setup and update the repositories:
+sudo vi /etc/apt/sources.list
+
deb http://deb.debian.org/debian bullseye main contrib
+
sudo apt update
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+sudo apt install --yes openssh-server
+
+sudo systemctl restart ssh
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk zfsutils-linux
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A +kernel and initrd may consume around 100M. If you have multiple kernels +and take snapshots, you may find yourself low on boot pool space, +especially if you need to regenerate your initramfs images, which may be +around 85M each. Size your boot pool appropriately for your needs.
If you are re-using a disk, clear it as necessary:
+Ensure swap partitions are not in use:
+swapoff --all
+
If the disk was previously used in an MD array:
+apt install --yes mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition:
+mdadm --zero-superblock --force ${DISK}-part2
+
If the disk was previously used with zfs:
+wipefs -a $DISK
+
For flash-based storage, if the disk was previously used, you may wish to +do a full-disk discard (TRIM/UNMAP), which can improve performance:
+blkdiscard -f $DISK
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on -d \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@livelist=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -o feature@zpool_checkpoint=enabled \
+ -O devices=off \
+ -O acltype=posixacl -O xattr=sa \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The device_rebuild
feature should be safe to use (except on raidz,
+which it is incompatible with), but the boot pool is small, so this does
+not matter in practice.
The log_spacemap
and spacemap_v2
features have been tested and
+are safe to use. The boot pool is small, so these do not matter in
+practice.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+apt install --yes cryptsetup
+
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality was implemented in Ubuntu with the
+zsys
tool, though its dataset layout is more complicated, and zsys
+is on life support. Even
+without such a tool, the rpool/ROOT and bpool/BOOT containers can still
+be used for manually created clones. That said, this HOWTO assumes a single
+filesystem for /boot
for simplicity.
Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
+zfs mount rpool/ROOT/debian
+
+zfs create -o mountpoint=/boot bpool/BOOT/debian
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+chmod 700 /mnt/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to separate these to exclude them from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will have a GUI:
+zfs create rpool/var/lib/AccountsService
+zfs create rpool/var/lib/NetworkManager
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will store local email in /var/mail:
+zfs create rpool/var/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Note: If you separate a directory required for booting (e.g. /etc
)
+into its own dataset, you must add it to
+ZFS_INITRD_ADDITIONAL_DATASETS
in /etc/default/zfs
. Datasets
+with canmount=off
(like rpool/usr
above) do not matter for this.
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
Install the minimal system:
+debootstrap bullseye /mnt
+
The debootstrap
command leaves the new system in an unconfigured state.
+An alternative to using debootstrap
is to copy the entirety of a
+working system into the new ZFS root.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME
below to match your interface name:
vi /mnt/etc/network/interfaces.d/NAME
+
auto NAME
+iface NAME inet dhcp
+
Customize this file if the system is not a DHCP client.
+Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://deb.debian.org/debian bullseye main contrib
+deb-src http://deb.debian.org/debian bullseye main contrib
+
+deb http://deb.debian.org/debian-security bullseye-security main contrib
+deb-src http://deb.debian.org/debian-security bullseye-security main contrib
+
+deb http://deb.debian.org/debian bullseye-updates main contrib
+deb-src http://deb.debian.org/debian bullseye-updates main contrib
+
Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+ln -s /proc/self/mounts /etc/mtab
+apt update
+
+apt install --yes console-setup locales
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales tzdata keyboard-configuration console-setup
+
Install ZFS in the chroot environment for the new system:
+apt install --yes dpkg-dev linux-headers-generic linux-image-generic
+
+apt install --yes zfs-initramfs
+
+echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup does
+not support ZFS.
For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup cryptsetup-initramfs
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) \
+ none luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
Install an NTP service to synchronize time. +This step is specific to Bullseye which does not install the package during +bootstrap. +Although this step is not necessary for ZFS, it is useful for internet +browsing where local clock drift can cause login failures:
+apt install systemd-timesyncd
+timedatectl
+
You should now see “NTP service: active” in the above timedatectl
+output.
Install GRUB
+Choose one of the following options:
+Install GRUB for legacy (BIOS) booting:
+apt install --yes grub-pc
+
Select (using the space bar) all of the disks (not partitions) in your +pool.
+Install GRUB for UEFI booting:
+apt install dosfstools
+
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+apt install --yes grub-efi-amd64 shim-signed
+
Notes:
+The -s 1
for mkdosfs
is only necessary for drives which present
+4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size
+(given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
For a mirror or raidz topology, this step only installs GRUB on the +first disk. The other disk(s) will be handled later.
Optional: Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub. os-prober is only +necessary in dual-boot configurations.
+Set a root password:
+passwd
+
Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+# Work-around to preserve zpool cache:
+ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
+ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
Note: For some disk configurations (NVMe?), this service may fail with an error
+indicating that the bpool
cannot be found. If this happens, add
+-d DISK-part3
(replace DISK
with the correct device path) to the
+zpool import
command.
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Optional: Install SSH:
+apt install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Optional: For ZFS native encryption or LUKS, configure Dropbear for remote +unlocking:
+apt install --yes --no-install-recommends dropbear-initramfs
+mkdir -p /etc/dropbear-initramfs
+
+# Optional: Convert OpenSSH server keys for Dropbear
+for type in ecdsa ed25519 rsa ; do
+ cp /etc/ssh/ssh_host_${type}_key /tmp/openssh.key
+ ssh-keygen -p -N "" -m PEM -f /tmp/openssh.key
+ dropbearconvert openssh dropbear \
+ /tmp/openssh.key \
+ /etc/dropbear-initramfs/dropbear_${type}_host_key
+done
+rm /tmp/openssh.key
+
+# Add user keys in the same format as ~/.ssh/authorized_keys
+vi /etc/dropbear-initramfs/authorized_keys
+
+# If using a static IP, set it for the initramfs environment:
+vi /etc/initramfs-tools/initramfs.conf
+# The syntax is: IP=ADDRESS::GATEWAY:MASK:HOSTNAME:NIC
+# For example:
+# IP=192.168.1.100::192.168.1.1:255.255.255.0:myhostname:ens3
+# HOSTNAME and NIC are optional.
+
+# Rebuild the initramfs (required when changing any of the above):
+update-initramfs -u -k all
+
Notes:
+Converting the server keys makes Dropbear use the same keys as OpenSSH,
+avoiding host key mismatch warnings. Currently, dropbearconvert doesn’t
+understand the new OpenSSH private key format, so the
+keys need to be converted to the old PEM format first using
+ssh-keygen
. The downside of using the same keys for both OpenSSH and
+Dropbear is that the OpenSSH keys are then available on-disk, unencrypted
+in the initramfs.
Later, to use this functionality, SSH to the system (as root) while it is
+prompting for the passphrase during the boot process. For ZFS native
+encryption, run zfsunlock
. For LUKS, run cryptroot-unlock
.
You can optionally add command="/usr/bin/zfsunlock"
or
+command="/bin/cryptroot-unlock"
in front of the authorized_keys
+line to force the unlock command. This way, the unlock command runs
+automatically and is all that can be run.
Optional (but kindly requested): Install popcon
+The popularity-contest
package reports the list of packages install
+on your system. Showing that ZFS is popular may be helpful in terms of
+long-term attention from the distro.
apt install --yes popularity-contest
+
Choose Yes at the prompt.
+Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
Refresh the initrd files:
+update-initramfs -c -k all
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup
+does not support ZFS.
Workaround GRUB’s missing zpool-features support:
+vi /etc/default/grub
+# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian"
+
Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
Install the boot loader:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the grub-install
+command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=debian --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/debian
+zfs set canmount=noauto rpool/ROOT/debian
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Once the files have data, stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Optional: Snapshot the initial installation:
+zfs snapshot bpool/BOOT/debian@install
+zfs snapshot rpool/ROOT/debian@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
If this fails for rpool, mounting it on boot will fail and you will need to
+zpool import -f rpool
, then exit
in the initamfs prompt.
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+zfs create rpool/home/$username
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video $username
+
Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks.
+For legacy (BIOS) booting:
+dpkg-reconfigure grub-pc
+
Hit enter until you get to the device selection screen. +Select (using the space bar) all of the disks (not partitions) in your pool.
+For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment debian-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. There is a bug report upstream.
+Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
Enable the swap device:
+swapon -av
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Install a regular set of software:
+tasksel --new-install
+
Note: This will check “Debian desktop environment” and “print server” +by default. If you want a server installation, unselect those.
+Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/debian@install
+sudo zfs destroy rpool/ROOT/debian@install
+
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+sudo vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+sudo systemctl restart ssh
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+apt install --yes cryptsetup
+
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+zfs mount rpool/ROOT/debian
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
See Debian Bullseye Root on ZFS for +new installs. This guide is no longer receiving most updates. It continues +to exist for reference for existing installs that followed it.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
64-bit Debian GNU/Linux Buster Live CD w/ GUI (e.g. gnome iso)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the Debian GNU/Linux Live CD. If prompted, login with the username
+user
and password live
. Connect your system to the Internet as
+appropriate (e.g. join your WiFi network). Open a terminal.
Setup and update the repositories:
+sudo vi /etc/apt/sources.list
+
deb http://deb.debian.org/debian buster main contrib
+deb http://deb.debian.org/debian buster-backports main contrib
+
sudo apt update
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+sudo apt install --yes openssh-server
+
+sudo systemctl restart ssh
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-amd64
+
+apt install --yes -t buster-backports --no-install-recommends zfs-dkms
+
+modprobe zfs
+apt install --yes -t buster-backports zfsutils-linux
+
The dkms dependency is installed manually just so it comes from buster +and not buster-backports. This is not critical.
We need to get the module built and loaded before installing +zfsutils-linux or zfs-mount.service will fail to start.
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A +kernel and initrd may consume around 100M. If you have multiple kernels +and take snapshots, you may find yourself low on boot pool space, +especially if you need to regenerate your initramfs images, which may be +around 85M each. Size your boot pool appropriately for your needs.
If you are re-using a disk, clear it as necessary:
+Ensure swap partitions are not in use:
+swapoff --all
+
If the disk was previously used in an MD array:
+apt install --yes mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition:
+mdadm --zero-superblock --force ${DISK}-part2
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -o feature@zpool_checkpoint=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
The spacemap_v2
feature has been tested and is safe to use. The boot
+pool is small, so this does not matter in practice.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -O encryption=on \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+apt install --yes cryptsetup
+
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality was implemented in Ubuntu with the
+zsys
tool, though its dataset layout is more complicated, and zsys
+is on life support. Even
+without such a tool, the rpool/ROOT and bpool/BOOT containers can still
+be used for manually created clones. That said, this HOWTO assumes a single
+filesystem for /boot
for simplicity.
Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
+zfs mount rpool/ROOT/debian
+
+zfs create -o mountpoint=/boot bpool/BOOT/debian
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+chmod 700 /mnt/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to exclude these from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /opt on this system:
+zfs create rpool/opt
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will store local email in /var/mail:
+zfs create rpool/var/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
If this system will use GNOME:
+zfs create rpool/var/lib/AccountsService
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will use NFS (locking):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Install the minimal system:
+debootstrap buster /mnt
+
The debootstrap
command leaves the new system in an unconfigured state.
+An alternative to using debootstrap
is to copy the entirety of a
+working system into the new ZFS root.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME
below to match your interface name:
vi /mnt/etc/network/interfaces.d/NAME
+
auto NAME
+iface NAME inet dhcp
+
Customize this file if the system is not a DHCP client.
+Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://deb.debian.org/debian buster main contrib
+deb-src http://deb.debian.org/debian buster main contrib
+
+deb http://security.debian.org/debian-security buster/updates main contrib
+deb-src http://security.debian.org/debian-security buster/updates main contrib
+
+deb http://deb.debian.org/debian buster-updates main contrib
+deb-src http://deb.debian.org/debian buster-updates main contrib
+
vi /mnt/etc/apt/sources.list.d/buster-backports.list
+
deb http://deb.debian.org/debian buster-backports main contrib
+deb-src http://deb.debian.org/debian buster-backports main contrib
+
vi /mnt/etc/apt/preferences.d/90_zfs
+
Package: src:zfs-linux
+Pin: release n=buster-backports
+Pin-Priority: 990
+
Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --rbind /dev /mnt/dev
+mount --rbind /proc /mnt/proc
+mount --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+ln -s /proc/self/mounts /etc/mtab
+apt update
+
+apt install --yes console-setup locales
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales tzdata keyboard-configuration console-setup
+
Install ZFS in the chroot environment for the new system:
+apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64
+
+apt install --yes zfs-initramfs
+
+echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup does
+not support ZFS.
For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) \
+ none luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
Install GRUB
+Choose one of the following options:
+Install GRUB for legacy (BIOS) booting:
+apt install --yes grub-pc
+
Select (using the space bar) all of the disks (not partitions) in your +pool.
+Install GRUB for UEFI booting:
+apt install dosfstools
+
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+apt install --yes grub-efi-amd64 shim-signed
+
Notes:
+The -s 1
for mkdosfs
is only necessary for drives which present
+4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size
+(given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
For a mirror or raidz topology, this step only installs GRUB on the +first disk. The other disk(s) will be handled later.
Optional: Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub. os-prober is only +necessary in dual-boot configurations.
+Set a root password:
+passwd
+
Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+# Work-around to preserve zpool cache:
+ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
+ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Optional: Install SSH:
+apt install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Optional (but kindly requested): Install popcon
+The popularity-contest
package reports the list of packages install
+on your system. Showing that ZFS is popular may be helpful in terms of
+long-term attention from the distro.
apt install --yes popularity-contest
+
Choose Yes at the prompt.
+Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
Refresh the initrd files:
+update-initramfs -c -k all
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup
+does not support ZFS.
Workaround GRUB’s missing zpool-features support:
+vi /etc/default/grub
+# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian"
+
Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
Install the boot loader:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the grub-install
+command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=debian --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/debian
+zfs set canmount=noauto rpool/ROOT/debian
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Once the files have data, stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Optional: Snapshot the initial installation:
+zfs snapshot bpool/BOOT/debian@install
+zfs snapshot rpool/ROOT/debian@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+zfs create rpool/home/$username
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video $username
+
Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks.
+For legacy (BIOS) booting:
+dpkg-reconfigure grub-pc
+
Hit enter until you get to the device selection screen. +Select (using the space bar) all of the disks (not partitions) in your pool.
+For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment debian-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. There is a bug report upstream.
+Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
Enable the swap device:
+swapon -av
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Install a regular set of software:
+tasksel --new-install
+
Note: This will check “Debian desktop environment” and “print server” +by default. If you want a server installation, unselect those.
+Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/debian@install
+sudo zfs destroy rpool/ROOT/debian@install
+
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+sudo vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+sudo systemctl restart ssh
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+apt install --yes cryptsetup
+
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+zfs mount rpool/ROOT/debian
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --rbind /dev /mnt/dev
+mount --rbind /proc /mnt/proc
+mount --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
rollback=<on|yes|1> Do a rollback of specified snapshot.
zfs_debug=<on|yes|1> Debug the initrd script
zfs_force=<on|yes|1> Force importing the pool. Should not be +necessary.
zfs=<off|no|0> Don’t try to import ANY pool, mount ANY filesystem or +even load the module.
rpool=<pool> Use this pool for root pool.
bootfs=<pool>/<dataset> Use this dataset for root filesystem.
root=<pool>/<dataset> Use this dataset for root filesystem.
root=ZFS=<pool>/<dataset> Use this dataset for root filesystem.
root=zfs:<pool>/<dataset> Use this dataset for root filesystem.
root=zfs:AUTO Try to detect both pool and rootfs
In all these cases, <dataset> could also be <dataset>@<snapshot>.
+The reason there are so many supported boot options to get the root +filesystem, is that there are a lot of different ways too boot ZFS out +there, and I wanted to make sure I supported them all.
+The initrd will, if the variable USE_DISK_BY_ID is set in the file +/etc/default/zfs, to import using the /dev/disk/by-* links. It will try +to import in this order:
+/dev/disk/by-vdev
/dev/disk/by-*
/dev
If all of these imports fail (or if USE_DISK_BY_ID is unset), it will +then try to import using the cache file.
+If that ALSO fails, it will try one more time, without any -d or -c +options.
+Enter the snapshot for the root= parameter like in this example:
+linux /BOOT/debian@/boot/vmlinuz-5.10.0-9-amd64 root=ZFS=rpool/ROOT/debian@some_snapshot ro
+
This will clone the snapshot rpool/ROOT/debian@some_snapshot into the +filesystem rpool/ROOT/debian_some_snapshot and use that as root +filesystem. The original filesystem and snapshot is left alone in this +case.
+BEWARE that it will first destroy, blindingly, the +rpool/ROOT/debian_some_snapshot filesystem before trying to clone the +snapshot into it again. So if you’ve booted from the same snapshot +previously and done some changes in that root filesystem, they will be +undone by the destruction of the filesystem.
+From version 0.6.4-1-3 it is now also possible to specify rollback=1 to +do a rollback of the snapshot instead of cloning it. BEWARE that +this will destroy all snapshots done after the specified snapshot!
+From version 0.6.4-1-3 it is now also possible to specify a NULL +snapshot name (such as root=rpool/ROOT/debian@) and if so, the initrd +script will discover all snapshots below that filesystem (sans the at), +and output a list of snapshot for the user to choose from.
+Although there is currently no support for native encryption in ZFS On +Linux, there is a patch floating around ‘out there’ and the initrd +supports loading key and unlock such encrypted filesystem.
+If there are separate filesystems (for example a separate dataset for +/usr), the snapshot boot code will try to find the snapshot under each +filesystems and clone (or rollback) them.
+Example:
+rpool/ROOT/debian@some_snapshot
+rpool/ROOT/debian/usr@some_snapshot
+
These will create the following filesystems respectively (if not doing a +rollback):
+rpool/ROOT/debian_some_snapshot
+rpool/ROOT/debian/usr_some_snapshot
+
The initrd code will use the mountpoint option (if any) in the original +(without the snapshot part) dataset to find where it should mount the +dataset. Or it will use the name of the dataset below the root +filesystem (rpool/ROOT/debian in this example) for the mount point.
+See Debian Buster Root on ZFS for new +installs.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
Installing on a drive which presents 4KiB logical sectors (a “4Kn” +drive) only works with UEFI booting. This not unique to ZFS. GRUB +does not and will not work on 4Kn with legacy (BIOS) +booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need massive amounts of +RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports two different encryption options: unencrypted and +LUKS (full-disk encryption). ZFS native encryption has not yet been +released. With either option, all ZFS features are fully available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+LUKS encrypts almost everything: the OS, swap, home directories, and +anything else. The only unencrypted data is the bootloader, kernel, and +initrd. The system cannot boot without the passphrase being entered at +the console. Performance is good, but LUKS sits underneath ZFS, so if +multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+1.1 Boot the Debian GNU/Linux Live CD. If prompted, login with the
+username user
and password live
. Connect your system to the
+Internet as appropriate (e.g. join your WiFi network).
1.2 Optional: Install and start the OpenSSH server in the Live CD +environment:
+If you have a second system, using SSH to access the target system can +be convenient.
+$ sudo apt update
+$ sudo apt install --yes openssh-server
+$ sudo systemctl restart ssh
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
.
1.3 Become root:
+$ sudo -i
+
1.4 Setup and update the repositories:
+# echo deb http://deb.debian.org/debian stretch contrib >> /etc/apt/sources.list
+# echo deb http://deb.debian.org/debian stretch-backports main contrib >> /etc/apt/sources.list
+# apt update
+
1.5 Install ZFS in the Live CD environment:
+# apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-amd64
+# apt install --yes -t stretch-backports zfs-dkms
+# modprobe zfs
+
The dkms dependency is installed manually just so it comes from +stretch and not stretch-backports. This is not critical.
2.1 If you are re-using a disk, clear it as necessary:
+If the disk was previously used in an MD array, zero the superblock:
+# apt install --yes mdadm
+# mdadm --zero-superblock --force /dev/disk/by-id/scsi-SATA_disk1
+
+Clear the partition table:
+# sgdisk --zap-all /dev/disk/by-id/scsi-SATA_disk1
+
2.2 Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+# sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/disk/by-id/scsi-SATA_disk1
+
+Run this for UEFI booting (for use now or in the future):
+# sgdisk -n2:1M:+512M -t2:EF00 /dev/disk/by-id/scsi-SATA_disk1
+
+Run this for the boot pool:
+# sgdisk -n3:0:+1G -t3:BF01 /dev/disk/by-id/scsi-SATA_disk1
+
Choose one of the following options:
+2.2a Unencrypted:
+# sgdisk -n4:0:0 -t4:BF01 /dev/disk/by-id/scsi-SATA_disk1
+
2.2b LUKS:
+# sgdisk -n4:0:0 -t4:8300 /dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is
+missing from /dev/disk/by-id
, use /dev/vda
if you are using
+KVM with virtio; otherwise, read the
+troubleshooting section.
If you are creating a mirror or raidz topology, repeat the +partitioning commands for all the disks which will be part of the +pool.
2.3 Create the boot pool:
+# zpool create -o ashift=12 -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -o feature@userobj_accounting=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \
+ -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt \
+ bpool /dev/disk/by-id/scsi-SATA_disk1-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See
+spa_feature_names
in
+grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror or raidz topology, create the pool using
+zpool create ... bpool mirror /dev/disk/by-id/scsi-SATA_disk1-part3 /dev/disk/by-id/scsi-SATA_disk2-part3
+(or replace mirror
with raidz
, raidz2
, or raidz3
and
+list the partitions from additional disks).
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
2.4 Create the root pool:
+Choose one of the following options:
+2.4a Unencrypted:
+# zpool create -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt \
+ rpool /dev/disk/by-id/scsi-SATA_disk1-part4
+
2.4b LUKS:
+# apt install --yes cryptsetup
+# cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 \
+ /dev/disk/by-id/scsi-SATA_disk1-part4
+# cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# zpool create -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
The use of ashift=12
is recommended here because many drives
+today have 4KiB (or larger) physical sectors, even though they
+present 512B logical sectors. Also, a future replacement drive may
+have 4KiB physical sectors (in which case ashift=12
is desirable)
+or 4KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires
+ACLs
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only
+filenames.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s
+documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI
+applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain
+controller.
+Note that `xattr=sa
is
+Linux-specific. <https://openzfs.org/wiki/Platform_code_differences>`__
+If you move your xattr=sa
pool to another OpenZFS implementation
+besides ZFS-on-Linux, extended attributes will not be readable
+(though your data will be). If portability of extended attributes is
+important to you, omit the -O xattr=sa
above. Even if you do not
+want xattr=sa
for the whole pool, it is probably fine to use it
+for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
For LUKS, the key size chosen is 512 bits. However, XTS mode requires
+two keys, so the LUKS key is split in half. Thus, -s 512
means
+AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup +FAQ +for guidance.
Hints:
+If you are creating a mirror or raidz topology, create the pool using
+zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part4 /dev/disk/by-id/scsi-SATA_disk2-part4
+(or replace mirror
with raidz
, raidz2
, or raidz3
and
+list the partitions from additional disks). For LUKS, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will
+have to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the
+root pool is named rpool
by default.
3.1 Create filesystem datasets to act as containers:
+# zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+# zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality for APT is possible but currently
+unimplemented. Even without such a tool, it can still be used for
+manually created clones.
3.2 Create filesystem datasets for the root and boot filesystems:
+# zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
+# zfs mount rpool/ROOT/debian
+
+# zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/debian
+# zfs mount bpool/BOOT/debian
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
3.3 Create datasets:
+# zfs create rpool/home
+# zfs create -o mountpoint=/root rpool/home/root
+# zfs create -o canmount=off rpool/var
+# zfs create -o canmount=off rpool/var/lib
+# zfs create rpool/var/log
+# zfs create rpool/var/spool
+
+The datasets below are optional, depending on your preferences and/or
+software choices:
+
+If you wish to exclude these from snapshots:
+# zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+# zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+# chmod 1777 /mnt/var/tmp
+
+If you use /opt on this system:
+# zfs create rpool/opt
+
+If you use /srv on this system:
+# zfs create rpool/srv
+
+If you use /usr/local on this system:
+# zfs create -o canmount=off rpool/usr
+# zfs create rpool/usr/local
+
+If this system will have games installed:
+# zfs create rpool/var/games
+
+If this system will store local email in /var/mail:
+# zfs create rpool/var/mail
+
+If this system will use Snap packages:
+# zfs create rpool/var/snap
+
+If you use /var/www on this system:
+# zfs create rpool/var/www
+
+If this system will use GNOME:
+# zfs create rpool/var/lib/AccountsService
+
+If this system will use Docker (which manages its own datasets & snapshots):
+# zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
+If this system will use NFS (locking):
+# zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+
+A tmpfs is recommended later, but if you want a separate dataset for /tmp:
+# zfs create -o com.sun:auto-snapshot=false rpool/tmp
+# chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user
+data. This allows the root filesystem to be rolled back without rolling
+back user data such as logs (in /var/log
). This will be especially
+important if/when a beadm
or similar utility is integrated. The
+com.sun.auto-snapshot
setting is used by some ZFS snapshot utilities
+to exclude transient data.
If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for
+/tmp
, as shown above. This keeps the /tmp
data out of snapshots
+of your root filesystem. It also allows you to set a quota on
+rpool/tmp
, if you want to limit the maximum space used. Otherwise,
+you can use a tmpfs (RAM filesystem) later.
3.4 Install the minimal system:
+# debootstrap stretch /mnt
+# zfs set devices=off rpool
+
The debootstrap
command leaves the new system in an unconfigured
+state. An alternative to using debootstrap
is to copy the entirety
+of a working system into the new ZFS root.
4.1 Configure the hostname (change HOSTNAME
to the desired
+hostname).
# echo HOSTNAME > /mnt/etc/hostname
+
+# vi /mnt/etc/hosts
+Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
4.2 Configure the network interface:
+Find the interface name:
+# ip addr show
+
+# vi /mnt/etc/network/interfaces.d/NAME
+auto NAME
+iface NAME inet dhcp
+
Customize this file if the system is not a DHCP client.
+4.3 Configure the package sources:
+# vi /mnt/etc/apt/sources.list
+deb http://deb.debian.org/debian stretch main contrib
+deb-src http://deb.debian.org/debian stretch main contrib
+deb http://security.debian.org/debian-security stretch/updates main contrib
+deb-src http://security.debian.org/debian-security stretch/updates main contrib
+deb http://deb.debian.org/debian stretch-updates main contrib
+deb-src http://deb.debian.org/debian stretch-updates main contrib
+
+# vi /mnt/etc/apt/sources.list.d/stretch-backports.list
+deb http://deb.debian.org/debian stretch-backports main contrib
+deb-src http://deb.debian.org/debian stretch-backports main contrib
+
+# vi /mnt/etc/apt/preferences.d/90_zfs
+Package: src:zfs-linux
+Pin: release n=stretch-backports
+Pin-Priority: 990
+
4.4 Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
# mount --rbind /dev /mnt/dev
+# mount --rbind /proc /mnt/proc
+# mount --rbind /sys /mnt/sys
+# chroot /mnt /bin/bash --login
+
Note: This is using --rbind
, not --bind
.
4.5 Configure a basic system environment:
+# ln -s /proc/self/mounts /etc/mtab
+# apt update
+
+# apt install --yes locales
+# dpkg-reconfigure locales
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available.
# dpkg-reconfigure tzdata
+
4.6 Install ZFS in the chroot environment for the new system:
+# apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64
+# apt install --yes zfs-initramfs
+
4.7 For LUKS installs only, setup crypttab:
+# apt install --yes cryptsetup
+
+# echo luks1 UUID=$(blkid -s UUID -o value \
+ /dev/disk/by-id/scsi-SATA_disk1-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not
+support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
4.8 Install GRUB
+Choose one of the following options:
+4.8a Install GRUB for legacy (BIOS) booting
+# apt install --yes grub-pc
+
Install GRUB to the disk(s), not the partition(s).
+4.8b Install GRUB for UEFI booting
+# apt install dosfstools
+# mkdosfs -F 32 -s 1 -n EFI /dev/disk/by-id/scsi-SATA_disk1-part2
+# mkdir /boot/efi
+# echo PARTUUID=$(blkid -s PARTUUID -o value \
+ /dev/disk/by-id/scsi-SATA_disk1-part2) \
+ /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab
+# mount /boot/efi
+# apt install --yes grub-efi-amd64 shim
+
The -s 1
for mkdosfs
is only necessary for drives which
+present 4 KiB logical sectors (“4Kn” drives) to meet the minimum
+cluster size (given the partition size of 512 MiB) for FAT32. It also
+works fine on drives which present 512 B sectors.
Note: If you are creating a mirror or raidz topology, this step only +installs GRUB on the first disk. The other disk(s) will be handled +later.
+4.9 Set a root password
+# passwd
+
4.10 Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
# vi /etc/systemd/system/zfs-import-bpool.service
+[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+
+[Install]
+WantedBy=zfs-import.target
+
+# systemctl enable zfs-import-bpool.service
+
4.11 Optional (but recommended): Mount a tmpfs to /tmp
+If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
# cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+# systemctl enable tmp.mount
+
4.12 Optional (but kindly requested): Install popcon
+The popularity-contest
package reports the list of packages install
+on your system. Showing that ZFS is popular may be helpful in terms of
+long-term attention from the distro.
# apt install --yes popularity-contest
+
Choose Yes at the prompt.
+5.1 Verify that the ZFS boot filesystem is recognized:
+# grub-probe /boot
+zfs
+
5.2 Refresh the initrd files:
+# update-initramfs -u -k all
+update-initramfs: Generating /boot/initrd.img-4.9.0-8-amd64
+
Note: When using LUKS, this will print “WARNING could not determine +root device from /etc/fstab”. This is because cryptsetup does not +support +ZFS.
+5.3 Workaround GRUB’s missing zpool-features support:
+# vi /etc/default/grub
+Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian"
+
5.4 Optional (but highly recommended): Make debugging GRUB easier:
+# vi /etc/default/grub
+Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+Uncomment: GRUB_TERMINAL=console
+Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+5.5 Update the boot configuration:
+# update-grub
+Generating grub configuration file ...
+Found linux image: /boot/vmlinuz-4.9.0-8-amd64
+Found initrd image: /boot/initrd.img-4.9.0-8-amd64
+done
+
Note: Ignore errors from osprober
, if present.
5.6 Install the boot loader
+5.6a For legacy (BIOS) booting, install GRUB to the MBR:
+# grub-install /dev/disk/by-id/scsi-SATA_disk1
+Installing for i386-pc platform.
+Installation finished. No error reported.
+
Do not reboot the computer until you get exactly that result message. +Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the
+grub-install
command for each disk in the pool.
5.6b For UEFI booting, install GRUB:
+# grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=debian --recheck --no-floppy
+
5.7 Verify that the ZFS module is installed:
+# ls /boot/grub/*/zfs.mod
+
5.8 Fix filesystem mount ordering
+Until ZFS gains a systemd mount
+generator, there are
+races between mounting filesystems and starting certain daemons. In
+practice, the issues (e.g.
+#5754) seem to be
+with certain filesystems in /var
, specifically /var/log
and
+/var/tmp
. Setting these to use legacy
mounting, and listing them
+in /etc/fstab
makes systemd aware that these are separate
+mountpoints. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
+feature of systemd automatically use After=var-tmp.mount
.
Until there is support for mounting /boot
in the initramfs, we also
+need to mount that, because it was marked canmount=noauto
. Also,
+with UEFI, we need to ensure it is mounted before its child filesystem
+/boot/efi
.
rpool
is guaranteed to be imported by the initramfs, so there is no
+point in adding x-systemd.requires=zfs-import.target
to those
+filesystems.
For UEFI booting, unmount /boot/efi first:
+# umount /boot/efi
+
+Everything else applies to both BIOS and UEFI booting:
+
+# zfs set mountpoint=legacy bpool/BOOT/debian
+# echo bpool/BOOT/debian /boot zfs \
+ nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstab
+
+# zfs set mountpoint=legacy rpool/var/log
+# echo rpool/var/log /var/log zfs nodev,relatime 0 0 >> /etc/fstab
+
+# zfs set mountpoint=legacy rpool/var/spool
+# echo rpool/var/spool /var/spool zfs nodev,relatime 0 0 >> /etc/fstab
+
+If you created a /var/tmp dataset:
+# zfs set mountpoint=legacy rpool/var/tmp
+# echo rpool/var/tmp /var/tmp zfs nodev,relatime 0 0 >> /etc/fstab
+
+If you created a /tmp dataset:
+# zfs set mountpoint=legacy rpool/tmp
+# echo rpool/tmp /tmp zfs nodev,relatime 0 0 >> /etc/fstab
+
6.1 Snapshot the initial installation:
+# zfs snapshot bpool/BOOT/debian@install
+# zfs snapshot rpool/ROOT/debian@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+6.2 Exit from the chroot
environment back to the LiveCD environment:
# exit
+
6.3 Run these commands in the LiveCD environment to unmount all +filesystems:
+# mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
+# zpool export -a
+
6.4 Reboot:
+# reboot
+
6.5 Wait for the newly installed system to boot normally. Login as root.
+6.6 Create a user account:
+# zfs create rpool/home/YOURUSERNAME
+# adduser YOURUSERNAME
+# cp -a /etc/skel/.[!.]* /home/YOURUSERNAME
+# chown -R YOURUSERNAME:YOURUSERNAME /home/YOURUSERNAME
+
6.7 Add your user account to the default set of groups for an +administrator:
+# usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video YOURUSERNAME
+
6.8 Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks:
+6.8a For legacy (BIOS) booting:
+# dpkg-reconfigure grub-pc
+Hit enter until you get to the device selection screen.
+Select (using the space bar) all of the disks (not partitions) in your pool.
+
6.8b UEFI
+# umount /boot/efi
+
+For the second and subsequent disks (increment debian-2 to -3, etc.):
+# dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+# efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi'
+
+# mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. This issue is currently being investigated in: +https://github.com/zfsonlinux/zfs/issues/7734
+7.1 Create a volume dataset (zvol) for use as a swap device:
+# zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
7.2 Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
# mkswap -f /dev/zvol/rpool/swap
+# echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+# echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
7.3 Enable the swap device:
+# swapon -av
+
8.1 Upgrade the minimal system:
+# apt dist-upgrade --yes
+
8.2 Install a regular set of software:
+# tasksel
+
Note: This will check “Debian desktop environment” and “print server” +by default. If you want a server installation, unselect those.
+8.3 Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain.
+Also, if you are making snapshots of /var/log
, logrotate’s
+compression will actually waste space, as the uncompressed data will
+live on in the snapshot. You can edit the files in /etc/logrotate.d
+by hand to comment out compress
, or use this loop (copy-and-paste
+highly recommended):
# for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
8.4 Reboot:
+# reboot
+
9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
+9.2 Optional: Delete the snapshots of the initial installation:
+$ sudo zfs destroy bpool/BOOT/debian@install
+$ sudo zfs destroy rpool/ROOT/debian@install
+
9.3 Optional: Disable the root password
+$ sudo usermod -p '*' root
+
9.4 Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+$ sudo vi /etc/default/grub
+Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+Comment out GRUB_TERMINAL=console
+Save and quit.
+
+$ sudo update-grub
+
Note: Ignore errors from osprober
, if present.
9.5 Optional: For LUKS installs only, backup the LUKS header:
+$ sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected +by your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for
+each LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install +Environment.
+This will automatically import your pool. Export it and re-import it to +get the mounts right:
+For LUKS, first unlock the disk(s):
+# apt install --yes cryptsetup
+# cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+Repeat for additional disks, if this is a mirror or raidz topology.
+
+# zpool export -a
+# zpool import -N -R /mnt rpool
+# zpool import -N -R /mnt bpool
+# zfs mount rpool/ROOT/debian
+# zfs mount -a
+
If needed, you can chroot into your installed environment:
+# mount --rbind /dev /mnt/dev
+# mount --rbind /proc /mnt/proc
+# mount --rbind /sys /mnt/sys
+# chroot /mnt /bin/bash --login
+# mount /boot/efi
+# mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+# exit
+# mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
+# zpool export -a
+# reboot
+
Most problem reports for this tutorial involve mpt2sas
hardware that
+does slow asynchronous drive initialization, like some IBM M1015 or
+OEM-branded cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this +glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X in +/etc/default/zfs. The system will wait X seconds for all drives to +appear before importing the pool.
+Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run
+update-initramfs -u -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit
+this error message.
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere
+configuration. Doing this ensures that /dev/disk
aliases are
+created in the guest.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+$ sudo apt install ovmf
+$ sudo vi /etc/libvirt/qemu.conf
+Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd"
+]
+$ sudo service libvirt-bin restart
+
If you want to use ZFS as your root filesystem, see the Root on ZFS +links below instead.
+ZFS packages are included in the contrib repository. The +backports repository +often provides newer releases of ZFS. You can use it as follows.
+Add the backports repository:
+vi /etc/apt/sources.list.d/bookworm-backports.list
+
deb http://deb.debian.org/debian bookworm-backports main contrib
+deb-src http://deb.debian.org/debian bookworm-backports main contrib
+
vi /etc/apt/preferences.d/90_zfs
+
Package: src:zfs-linux
+Pin: release n=bookworm-backports
+Pin-Priority: 990
+
Install the packages:
+apt update
+apt install dpkg-dev linux-headers-generic linux-image-generic
+apt install zfs-dkms zfsutils-linux
+
Caution: If you are in a poorly configured environment (e.g. certain VM or container consoles), when apt attempts to pop up a message on first install, it may fail to notice a real console is unavailable, and instead appear to hang indefinitely. To circumvent this, you can prefix the apt install commands with DEBIAN_FRONTEND=noninteractive
, like this:
DEBIAN_FRONTEND=noninteractive apt install zfs-dkms zfsutils-linux
+
ZFSBootMenu
+This tutorial is based on the GRUB bootloader. Due to its independent +implementation of a read-only ZFS driver, GRUB only supports a subset +of ZFS features on the boot pool. [In general, bootloader treat disks +as read-only to minimize the risk of damaging on-disk data.]
+ZFSBootMenu is an alternative bootloader +free of such limitations and has support for boot environments. Do not +follow instructions on this page if you plan to use ZBM, +as the layouts are not compatible. Refer +to their site for installation details.
+Customization
+Unless stated otherwise, it is not recommended to customize system +configuration before reboot.
+Only use well-tested pool features
+You should only use well-tested pool features. Avoid using new features if data integrity is paramount. See, for example, this comment.
+Disable Secure Boot. ZFS modules can not be loaded if Secure Boot is enabled.
Because the kernel of latest Live CD might be incompatible with +ZFS, we will use Alpine Linux Extended, which ships with ZFS by +default.
+Download latest extended variant of Alpine Linux +live image, +verify checksum +and boot from it.
+gpg --auto-key-retrieve --keyserver hkps://keyserver.ubuntu.com --verify alpine-extended-*.asc
+
+dd if=input-file of=output-file bs=1M
+
Login as root user. There is no password.
Configure Internet
+setup-interfaces -r
+# You must use "-r" option to start networking services properly
+# example:
+network interface: wlan0
+WiFi name: <ssid>
+ip address: dhcp
+<enter done to finish network config>
+manual netconfig: n
+
If you are using wireless network and it is not shown, see Alpine
+Linux wiki for
+further details. wpa_supplicant
can be installed with apk
+add wpa_supplicant
without internet connection.
Configure SSH server
+setup-sshd
+# example:
+ssh server: openssh
+allow root: "prohibit-password" or "yes"
+ssh key: "none" or "<public key>"
+
Set root password or /root/.ssh/authorized_keys
.
Connect from another computer
+ssh root@192.168.1.91
+
Configure NTP client for time synchronization
+setup-ntp busybox
+
Set up apk-repo. A list of available mirrors is shown. +Press space bar to continue
+setup-apkrepos
+
Throughout this guide, we use predictable disk names generated by +udev
+apk update
+apk add eudev
+setup-devd udev
+
Target disk
+List available disks with
+find /dev/disk/by-id/
+
If virtio is used as disk bus, power off the VM and set serial numbers for disk.
+For QEMU, use -drive format=raw,file=disk2.img,serial=AaBb
.
+For libvirt, edit domain XML. See this page for examples.
Declare disk array
+DISK='/dev/disk/by-id/ata-FOO /dev/disk/by-id/nvme-BAR'
+
For single disk installation, use
+DISK='/dev/disk/by-id/disk1'
+
Set a mount point
+MNT=$(mktemp -d)
+
Set partition size:
+Set swap size in GB, set to 1 if you don’t want swap to +take up too much space
+SWAPSIZE=4
+
Set how much space should be left at the end of the disk, minimum 1GB
+RESERVE=1
+
Install ZFS support from live media:
+apk add zfs
+
Install partition tool
+apk add parted e2fsprogs cryptsetup util-linux
+
Partition the disks.
+Note: you must clear all existing partition tables and data structures from target disks.
+For flash-based storage, this can be done by the blkdiscard command below:
+partition_disk () {
+ local disk="${1}"
+ blkdiscard -f "${disk}" || true
+
+ parted --script --align=optimal "${disk}" -- \
+ mklabel gpt \
+ mkpart EFI 2MiB 1GiB \
+ mkpart bpool 1GiB 5GiB \
+ mkpart rpool 5GiB -$((SWAPSIZE + RESERVE))GiB \
+ mkpart swap -$((SWAPSIZE + RESERVE))GiB -"${RESERVE}"GiB \
+ mkpart BIOS 1MiB 2MiB \
+ set 1 esp on \
+ set 5 bios_grub on \
+ set 5 legacy_boot on
+
+ partprobe "${disk}"
+}
+
+for i in ${DISK}; do
+ partition_disk "${i}"
+done
+
Setup encrypted swap. This is useful if the available memory is +small:
+for i in ${DISK}; do
+ cryptsetup open --type plain --key-file /dev/random "${i}"-part4 "${i##*/}"-part4
+ mkswap /dev/mapper/"${i##*/}"-part4
+ swapon /dev/mapper/"${i##*/}"-part4
+done
+
Load ZFS kernel module
+modprobe zfs
+
Create boot pool
+# shellcheck disable=SC2046
+zpool create -o compatibility=legacy \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O devices=off \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/boot \
+ -R "${MNT}" \
+ bpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part2";
+ done)
+
If not using a multi-disk setup, remove mirror
.
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features.
Create root pool
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part3";
+ done)
+
If not using a multi-disk setup, remove mirror
.
Create root system container:
+Unencrypted
+zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+rpool/fedora
+
Encrypted:
+Avoid ZFS send/recv when using native encryption, see `a ZFS developer's comment on this issue`__ and `this spreadsheet of bugs`__. A LUKS-based guide has yet to be written. Once compromised, changing password will not keep your
+data safe. See zfs-change-key(8)
for more info
zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+ -o encryption=on \
+ -o keylocation=prompt \
+ -o keyformat=passphrase \
+rpool/fedora
+
You can automate this step (insecure) with: echo POOLPASS | zfs create ...
.
Create system datasets,
+manage mountpoints with mountpoint=legacy
zfs create -o canmount=noauto -o mountpoint=/ rpool/fedora/root
+zfs mount rpool/fedora/root
+zfs create -o mountpoint=legacy rpool/fedora/home
+mkdir "${MNT}"/home
+mount -t zfs rpool/fedora/home "${MNT}"/home
+zfs create -o mountpoint=legacy rpool/fedora/var
+zfs create -o mountpoint=legacy rpool/fedora/var/lib
+zfs create -o mountpoint=legacy rpool/fedora/var/log
+zfs create -o mountpoint=none bpool/fedora
+zfs create -o mountpoint=legacy bpool/fedora/root
+mkdir "${MNT}"/boot
+mount -t zfs bpool/fedora/root "${MNT}"/boot
+mkdir -p "${MNT}"/var/log
+mkdir -p "${MNT}"/var/lib
+mount -t zfs rpool/fedora/var/lib "${MNT}"/var/lib
+mount -t zfs rpool/fedora/var/log "${MNT}"/var/log
+
Format and mount ESP
+for i in ${DISK}; do
+ mkfs.vfat -n EFI "${i}"-part1
+ mkdir -p "${MNT}"/boot/efis/"${i##*/}"-part1
+ mount -t vfat -o iocharset=iso8859-1 "${i}"-part1 "${MNT}"/boot/efis/"${i##*/}"-part1
+done
+
+mkdir -p "${MNT}"/boot/efi
+mount -t vfat -o iocharset=iso8859-1 "$(echo "${DISK}" | sed "s|^ *||" | cut -f1 -d' '|| true)"-part1 "${MNT}"/boot/efi
+
Download and extract minimal Fedora root filesystem:
+apk add curl
+curl --fail-early --fail -L \
+https://dl.fedoraproject.org/pub/fedora/linux/releases/38/Container/x86_64/images/Fedora-Container-Base-38-1.6.x86_64.tar.xz \
+-o rootfs.tar.gz
+curl --fail-early --fail -L \
+https://dl.fedoraproject.org/pub/fedora/linux/releases/38/Container/x86_64/images/Fedora-Container-38-1.6-x86_64-CHECKSUM \
+-o checksum
+
+# BusyBox sha256sum treats all lines in the checksum file
+# as checksums and requires two spaces " "
+# between filename and checksum
+
+grep 'Container-Base' checksum \
+| grep '^SHA256' \
+| sed -E 's|.*= ([a-z0-9]*)$|\1 rootfs.tar.gz|' > ./sha256checksum
+
+sha256sum -c ./sha256checksum
+
+rootfs_tar=$(tar t -af rootfs.tar.gz | grep layer.tar)
+rootfs_tar_dir=$(dirname "${rootfs_tar}")
+tar x -af rootfs.tar.gz "${rootfs_tar}"
+ln -s "${MNT}" "${MNT}"/"${rootfs_tar_dir}"
+tar x -C "${MNT}" -af "${rootfs_tar}"
+unlink "${MNT}"/"${rootfs_tar_dir}"
+
Enable community repo
+sed -i '/edge/d' /etc/apk/repositories
+sed -i -E 's/#(.*)community/\1community/' /etc/apk/repositories
+
Generate fstab:
+apk add arch-install-scripts
+genfstab -t PARTUUID "${MNT}" \
+| grep -v swap \
+| sed "s|vfat.*rw|vfat rw,x-systemd.idle-timeout=1min,x-systemd.automount,noauto,nofail|" \
+> "${MNT}"/etc/fstab
+
Chroot
+cp /etc/resolv.conf "${MNT}"/etc/resolv.conf
+for i in /dev /proc /sys; do mkdir -p "${MNT}"/"${i}"; mount --rbind "${i}" "${MNT}"/"${i}"; done
+chroot "${MNT}" /usr/bin/env DISK="${DISK}" bash
+
Unset all shell aliases, which can interfere with installation:
+unalias -a
+
Install base packages
+dnf -y install @core grub2-efi-x64 \
+grub2-pc grub2-pc-modules grub2-efi-x64-modules shim-x64 \
+efibootmgr kernel kernel-devel
+
Install ZFS packages
+dnf -y install \
+https://zfsonlinux.org/fedora/zfs-release-2-3"$(rpm --eval "%{dist}"||true)".noarch.rpm
+
+dnf -y install zfs zfs-dracut
+
Check whether ZFS modules are successfully built
+tail -n10 /var/lib/dkms/zfs/**/build/make.log
+
+# ERROR: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'bio_start_io_acct'
+# ERROR: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'bio_end_io_acct_remapped'
+# make[4]: [scripts/Makefile.modpost:138: /var/lib/dkms/zfs/2.1.9/build/module/Module.symvers] Error 1
+# make[3]: [Makefile:1977: modpost] Error 2
+# make[3]: Leaving directory '/usr/src/kernels/6.2.9-100.fc36.x86_64'
+# make[2]: [Makefile:55: modules-Linux] Error 2
+# make[2]: Leaving directory '/var/lib/dkms/zfs/2.1.9/build/module'
+# make[1]: [Makefile:933: all-recursive] Error 1
+# make[1]: Leaving directory '/var/lib/dkms/zfs/2.1.9/build'
+# make: [Makefile:794: all] Error 2
+
If the build failed, you need to install an Long Term Support +kernel and its headers, then rebuild ZFS module
+# this is a third-party repo!
+# you have been warned.
+#
+# select a kernel from
+# https://copr.fedorainfracloud.org/coprs/kwizart/
+
+dnf copr enable -y kwizart/kernel-longterm-VERSION
+dnf install -y kernel-longterm kernel-longterm-devel
+dnf remove -y kernel-core
+
ZFS modules will be built as part of the kernel installation.
+Check build log again with tail
command.
Add zfs modules to dracut
+echo 'add_dracutmodules+=" zfs "' >> /etc/dracut.conf.d/zfs.conf
+echo 'force_drivers+=" zfs "' >> /etc/dracut.conf.d/zfs.conf
+
Add other drivers to dracut:
+if grep mpt3sas /proc/modules; then
+ echo 'force_drivers+=" mpt3sas "' >> /etc/dracut.conf.d/zfs.conf
+fi
+if grep virtio_blk /proc/modules; then
+ echo 'filesystems+=" virtio_blk "' >> /etc/dracut.conf.d/fs.conf
+fi
+
Build initrd
+find -D exec /lib/modules -maxdepth 1 \
+-mindepth 1 -type d \
+-exec sh -vxc \
+'if test -e "$1"/modules.dep;
+ then kernel=$(basename "$1");
+ dracut --verbose --force --kver "${kernel}";
+ fi' sh {} \;
+
For SELinux, relabel filesystem on reboot:
+fixfiles -F onboot
+
Enable internet time synchronisation:
+systemctl enable systemd-timesyncd
+
Generate host id
+zgenhostid -f -o /etc/hostid
+
Install locale package, example for English locale:
+dnf install -y glibc-minimal-langpack glibc-langpack-en
+
Set locale, keymap, timezone, hostname
+rm -f /etc/localtime
+rm -f /etc/hostname
+systemd-firstboot \
+--force \
+--locale=en_US.UTF-8 \
+--timezone=Etc/UTC \
+--hostname=testhost \
+--keymap=us || true
+
Set root passwd
+printf 'root:yourpassword' | chpasswd
+
Apply GRUB workaround
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile.d/zpool_vdev_name_path.sh
+# shellcheck disable=SC1091
+. /etc/profile.d/zpool_vdev_name_path.sh
+
+# GRUB fails to detect rpool name, hard code as "rpool"
+sed -i "s|rpool=.*|rpool=rpool|" /etc/grub.d/10_linux
+
This workaround needs to be applied for every GRUB update, as the +update will overwrite the changes.
+Fedora and RHEL uses Boot Loader Specification module for GRUB, +which does not support ZFS. Disable it:
+echo 'GRUB_ENABLE_BLSCFG=false' >> /etc/default/grub
+
This means that you need to regenerate GRUB menu and mirror them +after every kernel update, otherwise computer will still boot old +kernel on reboot.
+Install GRUB:
+mkdir -p /boot/efi/fedora/grub-bootdir/i386-pc/
+for i in ${DISK}; do
+ grub2-install --target=i386-pc --boot-directory \
+ /boot/efi/fedora/grub-bootdir/i386-pc/ "${i}"
+done
+dnf reinstall -y grub2-efi-x64 shim-x64
+cp -r /usr/lib/grub/x86_64-efi/ /boot/efi/EFI/fedora/
+
Generate GRUB menu
+mkdir -p /boot/grub2
+grub2-mkconfig -o /boot/grub2/grub.cfg
+cp /boot/grub2/grub.cfg \
+ /boot/efi/efi/fedora/grub.cfg
+cp /boot/grub2/grub.cfg \
+ /boot/efi/fedora/grub-bootdir/i386-pc/grub2/grub.cfg
+
For both legacy and EFI booting: mirror ESP content:
+espdir=$(mktemp -d)
+find /boot/efi/ -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' cp -r '{}' "${espdir}"
+find "${espdir}" -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' sh -vxc "find /boot/efis/ -maxdepth 1 -mindepth 1 -type d -print0 | xargs -t -0I '[]' cp -r '{}' '[]'"
+
Exit chroot
+exit
+
Unmount filesystems and create initial system snapshot +You can later create a boot environment from this snapshot. +See Root on ZFS maintenance page.
+umount -Rl "${MNT}"
+zfs snapshot -r rpool@initial-installation
+zfs snapshot -r bpool@initial-installation
+
Export all pools
+zpool export -a
+
Reboot
+reboot
+
For BIOS-legacy boot users only: the GRUB bootloader installed +might be unusable. In this case, see Bootloader Recovery section +in Root on ZFS maintenance page.
+This issue is not related to Alpine Linux chroot, as Arch Linux +installed with this method does not have this issue.
+UEFI bootloader is not affected by this issue.
+On first reboot, SELinux policies will be applied, albeit
+incompletely. The computer will then reboot with incomplete
+policies and fail to mount /run
, resulting in a failure.
Workaround is to append enforcing=0
to kernel command line in
+the GRUB menu, as many times as necessary, until the system
+complete one successful boot. The author of this guide has not
+found out a way to solve this issue during installation. Help is
+appreciated.
Install package groups
+dnf group list --hidden -v # query package groups
+dnf group install gnome-desktop
+
Add new user, configure swap.
Note: this is for installing ZFS on an existing Fedora +installation. To use ZFS as root file system, +see below.
+If zfs-fuse
from official Fedora repo is installed,
+remove it first. It is not maintained and should not be used
+under any circumstance:
rpm -e --nodeps zfs-fuse
+
Add ZFS repo:
+dnf install -y https://zfsonlinux.org/fedora/zfs-release-2-4$(rpm --eval "%{dist}").noarch.rpm
+
List of repos is available here.
+Install kernel headers:
+dnf install -y kernel-devel
+
kernel-devel
package must be installed before zfs
package.
Install ZFS packages:
+dnf install -y zfs
+
Load kernel module:
+modprobe zfs
+
If kernel module can not be loaded, your kernel version +might be not yet supported by OpenZFS.
+An option is to an LTS kernel from COPR, provided by a third-party. +Use it at your own risk:
+# this is a third-party repo!
+# you have been warned.
+#
+# select a kernel from
+# https://copr.fedorainfracloud.org/coprs/kwizart/
+
+dnf copr enable -y kwizart/kernel-longterm-VERSION
+dnf install -y kernel-longterm kernel-longterm-devel
+
Reboot to new LTS kernel, then load kernel module:
+modprobe zfs
+
By default ZFS kernel modules are loaded upon detecting a pool. +To always load the modules at boot:
+echo zfs > /etc/modules-load.d/zfs.conf
+
By default ZFS may be removed by kernel package updates. +To lock the kernel version to only ones supported by ZFS to prevent this:
+echo 'zfs' > /etc/dnf/protected.d/zfs.conf
+
dnf update –exclude=kernel*
+Testing repository, which is disabled by default, contains +the latest version of OpenZFS which is under active development. +These packages +should not be used on production systems.
+dnf config-manager --enable zfs-testing
+dnf install zfs
+
OpenZFS is available pre-packaged as:
+the zfs-2.0-release branch, in the FreeBSD base system from FreeBSD 13.0-CURRENT forward
the master branch, in the FreeBSD ports tree as sysutils/openzfs and sysutils/openzfs-kmod from FreeBSD 12.1 forward
The rest of this document describes the use of OpenZFS either from ports/pkg or built manually from sources for development.
+The ZFS utilities will be installed in /usr/local/sbin/, so make sure +your PATH gets adjusted accordingly.
+To load the module at boot, put openzfs_load="YES"
in
+/boot/loader.conf, and remove zfs_load="YES"
if migrating a ZFS
+install.
Beware that the FreeBSD boot loader does not allow booting from root +pools with encryption active (even if it is not in use), so do not try +encryption on a pool you boot from.
+The following dependencies are required to build OpenZFS on FreeBSD:
+FreeBSD sources in /usr/src or elsewhere specified by SYSDIR in env. +If you don’t have the sources installed you can install them with +git.
+Install source For FreeBSD 12:
+git clone -b stable/12 https://git.FreeBSD.org/src.git /usr/src
+
Install source for FreeBSD Current:
+git clone https://git.FreeBSD.org/src.git /usr/src
+
Packages for build:
+pkg install \
+ autoconf \
+ automake \
+ autotools \
+ git \
+ gmake
+
Optional packages for build:
+pkg install python
+pkg install devel/py-sysctl # needed for arcstat, arc_summary, dbufstat
+
Packages for checks and tests:
+pkg install \
+ base64 \
+ bash \
+ checkbashisms \
+ fio \
+ hs-ShellCheck \
+ ksh93 \
+ pamtester \
+ devel/py-flake8 \
+ sudo
+
Your preferred python version may be substituted. The user for +running tests must have NOPASSWD sudo permission.
+To build and install:
+# as user
+git clone https://github.com/openzfs/zfs
+cd zfs
+./autogen.sh
+env MAKE=gmake ./configure
+gmake -j`sysctl -n hw.ncpu`
+# as root
+gmake install
+
To use the OpenZFS kernel module when FreeBSD starts, edit /boot/loader.conf
:
Replace the line:
+zfs_load="YES"
+
with:
+openzfs_load="YES"
+
The stock FreeBSD ZFS binaries are installed in /sbin. OpenZFS binaries are installed to /usr/local/sbin when installed form ports/pkg or manually from the source. To use OpenZFS binaries, adjust your path so /usr/local/sbin is listed before /sbin. Otherwise the native ZFS binaries will be used.
+For example, make changes to ~/.profile ~/.bashrc ~/.cshrc from this:
+PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:~/bin
+
To this:
+PATH=/usr/local/sbin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:~/bin
+
For rapid development it can be convenient to do a UFS install instead +of ZFS when setting up the work environment. That way the module can be +unloaded and loaded without rebooting.
+reboot
+
Though not required, WITHOUT_ZFS
is a useful build option in FreeBSD
+to avoid building and installing the legacy zfs tools and kmod - see
+src.conf(5)
.
Some tests require fdescfs to be mount on /dev/fd. This can be done +temporarily with:
+mount -t fdescfs fdescfs /dev/fd
+
or an entry can be added to /etc/fstab.
+fdescfs /dev/fd fdescfs rw 0 0
+
Note for arm64:
+Currently there is a bug with the grub installation script. See here for details.
+Note for Immutable Root:
+Immutable root can be enabled or disabled by setting
+zfs-root.boot.immutable
option inside per-host configuration.
Customization
+Unless stated otherwise, it is not recommended to customize system +configuration before reboot.
+Only use well-tested pool features
+You should only use well-tested pool features. Avoid using new features if data integrity is paramount. See, for example, this comment.
+Disable Secure Boot. ZFS modules can not be loaded if Secure Boot is enabled.
Download NixOS Live Image and boot from it.
+sha256sum -c ./nixos-*.sha256
+
+dd if=input-file of=output-file bs=1M
+
Connect to the Internet.
Set root password or /root/.ssh/authorized_keys
.
Start SSH server
+systemctl restart sshd
+
Connect from another computer
+ssh root@192.168.1.91
+
Target disk
+List available disks with
+find /dev/disk/by-id/
+
If virtio is used as disk bus, power off the VM and set serial numbers for disk.
+For QEMU, use -drive format=raw,file=disk2.img,serial=AaBb
.
+For libvirt, edit domain XML. See this page for examples.
Declare disk array
+DISK='/dev/disk/by-id/ata-FOO /dev/disk/by-id/nvme-BAR'
+
For single disk installation, use
+DISK='/dev/disk/by-id/disk1'
+
Set a mount point
+MNT=$(mktemp -d)
+
Set partition size:
+Set swap size in GB, set to 1 if you don’t want swap to +take up too much space
+SWAPSIZE=4
+
Set how much space should be left at the end of the disk, minimum 1GB
+RESERVE=1
+
Enable Nix Flakes functionality
+mkdir -p ~/.config/nix
+echo "experimental-features = nix-command flakes" >> ~/.config/nix/nix.conf
+
Install programs needed for system installation
+if ! command -v git; then nix-env -f '<nixpkgs>' -iA git; fi
+if ! command -v partprobe; then nix-env -f '<nixpkgs>' -iA parted; fi
+
Partition the disks.
+Note: you must clear all existing partition tables and data structures from target disks.
+For flash-based storage, this can be done by the blkdiscard command below:
+partition_disk () {
+ local disk="${1}"
+ blkdiscard -f "${disk}" || true
+
+ parted --script --align=optimal "${disk}" -- \
+ mklabel gpt \
+ mkpart EFI 2MiB 1GiB \
+ mkpart bpool 1GiB 5GiB \
+ mkpart rpool 5GiB -$((SWAPSIZE + RESERVE))GiB \
+ mkpart swap -$((SWAPSIZE + RESERVE))GiB -"${RESERVE}"GiB \
+ mkpart BIOS 1MiB 2MiB \
+ set 1 esp on \
+ set 5 bios_grub on \
+ set 5 legacy_boot on
+
+ partprobe "${disk}"
+ udevadm settle
+}
+
+for i in ${DISK}; do
+ partition_disk "${i}"
+done
+
Setup encrypted swap. This is useful if the available memory is +small:
+for i in ${DISK}; do
+ cryptsetup open --type plain --key-file /dev/random "${i}"-part4 "${i##*/}"-part4
+ mkswap /dev/mapper/"${i##*/}"-part4
+ swapon /dev/mapper/"${i##*/}"-part4
+done
+
LUKS only: Setup encrypted LUKS container for root pool:
+for i in ${DISK}; do
+ # see PASSPHRASE PROCESSING section in cryptsetup(8)
+ printf "YOUR_PASSWD" | cryptsetup luksFormat --type luks2 "${i}"-part3 -
+ printf "YOUR_PASSWD" | cryptsetup luksOpen "${i}"-part3 luks-rpool-"${i##*/}"-part3 -
+done
+
Create boot pool
+# shellcheck disable=SC2046
+zpool create -o compatibility=legacy \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O devices=off \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/boot \
+ -R "${MNT}" \
+ bpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part2";
+ done)
+
If not using a multi-disk setup, remove mirror
.
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features.
Features enabled with -o compatibility=grub2
can be seen
+here.
Create root pool
+Unencrypted
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part3";
+ done)
+
LUKS encrypted
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '/dev/mapper/luks-rpool-%s ' "${i##*/}-part3";
+ done)
+
If not using a multi-disk setup, remove mirror
.
Create root system container:
+Unencrypted
+zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+rpool/nixos
+
Encrypted:
+Avoid ZFS send/recv when using native encryption, see `a ZFS developer's comment on +this issue`__ and `this spreadsheet of bugs`__. In short, if you +care about your data, don’t use native encryption. This section +has been removed, use LUKS encryption instead.
+Create system datasets,
+manage mountpoints with mountpoint=legacy
zfs create -o mountpoint=legacy rpool/nixos/root
+mount -t zfs rpool/nixos/root "${MNT}"/
+zfs create -o mountpoint=legacy rpool/nixos/home
+mkdir "${MNT}"/home
+mount -t zfs rpool/nixos/home "${MNT}"/home
+zfs create -o mountpoint=none rpool/nixos/var
+zfs create -o mountpoint=legacy rpool/nixos/var/lib
+zfs create -o mountpoint=legacy rpool/nixos/var/log
+zfs create -o mountpoint=none bpool/nixos
+zfs create -o mountpoint=legacy bpool/nixos/root
+mkdir "${MNT}"/boot
+mount -t zfs bpool/nixos/root "${MNT}"/boot
+mkdir -p "${MNT}"/var/log
+mkdir -p "${MNT}"/var/lib
+mount -t zfs rpool/nixos/var/lib "${MNT}"/var/lib
+mount -t zfs rpool/nixos/var/log "${MNT}"/var/log
+zfs create -o mountpoint=legacy rpool/nixos/empty
+zfs snapshot rpool/nixos/empty@start
+
Format and mount ESP
+for i in ${DISK}; do
+ mkfs.vfat -n EFI "${i}"-part1
+ mkdir -p "${MNT}"/boot/efis/"${i##*/}"-part1
+ mount -t vfat -o iocharset=iso8859-1 "${i}"-part1 "${MNT}"/boot/efis/"${i##*/}"-part1
+done
+
Clone template flake configuration
+mkdir -p "${MNT}"/etc
+git clone --depth 1 --branch openzfs-guide \
+ https://github.com/ne9z/dotfiles-flake.git "${MNT}"/etc/nixos
+
From now on, the complete configuration of the system will be +tracked by git, set a user name and email address to continue
+rm -rf "${MNT}"/etc/nixos/.git
+git -C "${MNT}"/etc/nixos/ init -b main
+git -C "${MNT}"/etc/nixos/ add "${MNT}"/etc/nixos/
+git -C "${MNT}"/etc/nixos config user.email "you@example.com"
+git -C "${MNT}"/etc/nixos config user.name "Alice Q. Nixer"
+git -C "${MNT}"/etc/nixos commit -asm 'initial commit'
+
Customize configuration to your hardware
+for i in ${DISK}; do
+ sed -i \
+ "s|/dev/disk/by-id/|${i%/*}/|" \
+ "${MNT}"/etc/nixos/hosts/exampleHost/default.nix
+ break
+done
+
+diskNames=""
+for i in ${DISK}; do
+ diskNames="${diskNames} \"${i##*/}\""
+done
+
+sed -i "s|\"bootDevices_placeholder\"|${diskNames}|g" \
+ "${MNT}"/etc/nixos/hosts/exampleHost/default.nix
+
+sed -i "s|\"abcd1234\"|\"$(head -c4 /dev/urandom | od -A none -t x4| sed 's| ||g' || true)\"|g" \
+ "${MNT}"/etc/nixos/hosts/exampleHost/default.nix
+
+sed -i "s|\"x86_64-linux\"|\"$(uname -m || true)-linux\"|g" \
+ "${MNT}"/etc/nixos/flake.nix
+
LUKS only: Enable LUKS support:
+sed -i 's|luks.enable = false|luks.enable = true|' "${MNT}"/etc/nixos/hosts/exampleHost/default.nix
+
Detect kernel modules needed for boot
+cp "$(command -v nixos-generate-config || true)" ./nixos-generate-config
+
+chmod a+rw ./nixos-generate-config
+
+# shellcheck disable=SC2016
+echo 'print STDOUT $initrdAvailableKernelModules' >> ./nixos-generate-config
+
+kernelModules="$(./nixos-generate-config --show-hardware-config --no-filesystems | tail -n1 || true)"
+
+sed -i "s|\"kernelModules_placeholder\"|${kernelModules}|g" \
+ "${MNT}"/etc/nixos/hosts/exampleHost/default.nix
+
Set root password
+rootPwd=$(mkpasswd -m SHA-512)
+
Declare password in configuration
+sed -i \
+"s|rootHash_placeholder|${rootPwd}|" \
+"${MNT}"/etc/nixos/configuration.nix
+
You can enable NetworkManager for wireless networks and GNOME
+desktop environment in configuration.nix
.
Commit changes to local repo
+git -C "${MNT}"/etc/nixos commit -asm 'initial installation'
+
Update flake lock file to track latest system version
+nix flake update --commit-lock-file \
+ "git+file://${MNT}/etc/nixos"
+
Install system and apply configuration
+nixos-install \
+--root "${MNT}" \
+--no-root-passwd \
+--flake "git+file://${MNT}/etc/nixos#exampleHost"
+
Unmount filesystems
+umount -Rl "${MNT}"
+zpool export -a
+
Reboot
+reboot
+
For instructions on maintenance tasks, see Root on ZFS maintenance +page.
Reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat.
+If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @ne9z.
+Note: this is for installing ZFS on an existing +NixOS installation. To use ZFS as root file system, +see below.
+NixOS live image ships with ZFS support by default.
+Note that you need to apply these settings even if you don’t need +to boot from ZFS. The kernel module ‘zfs.ko’ will not be available +to modprobe until you make these changes and reboot.
+Edit /etc/nixos/configuration.nix
and add the following
+options:
boot.supportedFilesystems = [ "zfs" ];
+boot.zfs.forceImportRoot = false;
+networking.hostId = "yourHostId";
+
Where hostID can be generated with:
+head -c4 /dev/urandom | od -A none -t x4
+
Apply configuration changes:
+nixos-rebuild boot
+
Reboot:
+reboot
+
You can contribute to this documentation. Fork this repo, edit the +documentation, then opening a pull request.
+To test your changes locally, use the devShell in this repo:
+git clone https://github.com/ne9z/nixos-live openzfs-docs-dev
+cd openzfs-docs-dev
+nix develop ./openzfs-docs-dev/#docs
+
Inside the openzfs-docs repo, build pages:
+make html
+
Look for errors and warnings in the make output. If there is no +errors:
+xdg-open _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a
+pull request. Mention @ne9z.
This page has been moved to RHEL-based distro.
+ZFSBootMenu
+This tutorial is based on the GRUB bootloader. Due to its independent +implementation of a read-only ZFS driver, GRUB only supports a subset +of ZFS features on the boot pool. [In general, bootloader treat disks +as read-only to minimize the risk of damaging on-disk data.]
+ZFSBootMenu is an alternative bootloader +free of such limitations and has support for boot environments. Do not +follow instructions on this page if you plan to use ZBM, +as the layouts are not compatible. Refer +to their site for installation details.
+Customization
+Unless stated otherwise, it is not recommended to customize system +configuration before reboot.
+Only use well-tested pool features
+You should only use well-tested pool features. Avoid using new features if data integrity is paramount. See, for example, this comment.
+Disable Secure Boot. ZFS modules can not be loaded if Secure Boot is enabled.
Because the kernel of latest Live CD might be incompatible with +ZFS, we will use Alpine Linux Extended, which ships with ZFS by +default.
+Download latest extended variant of Alpine Linux +live image, +verify checksum +and boot from it.
+gpg --auto-key-retrieve --keyserver hkps://keyserver.ubuntu.com --verify alpine-extended-*.asc
+
+dd if=input-file of=output-file bs=1M
+
Login as root user. There is no password.
Configure Internet
+setup-interfaces -r
+# You must use "-r" option to start networking services properly
+# example:
+network interface: wlan0
+WiFi name: <ssid>
+ip address: dhcp
+<enter done to finish network config>
+manual netconfig: n
+
If you are using wireless network and it is not shown, see Alpine
+Linux wiki for
+further details. wpa_supplicant
can be installed with apk
+add wpa_supplicant
without internet connection.
Configure SSH server
+setup-sshd
+# example:
+ssh server: openssh
+allow root: "prohibit-password" or "yes"
+ssh key: "none" or "<public key>"
+
Set root password or /root/.ssh/authorized_keys
.
Connect from another computer
+ssh root@192.168.1.91
+
Configure NTP client for time synchronization
+setup-ntp busybox
+
Set up apk-repo. A list of available mirrors is shown. +Press space bar to continue
+setup-apkrepos
+
Throughout this guide, we use predictable disk names generated by +udev
+apk update
+apk add eudev
+setup-devd udev
+
Target disk
+List available disks with
+find /dev/disk/by-id/
+
If virtio is used as disk bus, power off the VM and set serial numbers for disk.
+For QEMU, use -drive format=raw,file=disk2.img,serial=AaBb
.
+For libvirt, edit domain XML. See this page for examples.
Declare disk array
+DISK='/dev/disk/by-id/ata-FOO /dev/disk/by-id/nvme-BAR'
+
For single disk installation, use
+DISK='/dev/disk/by-id/disk1'
+
Set a mount point
+MNT=$(mktemp -d)
+
Set partition size:
+Set swap size in GB, set to 1 if you don’t want swap to +take up too much space
+SWAPSIZE=4
+
Set how much space should be left at the end of the disk, minimum 1GB
+RESERVE=1
+
Install ZFS support from live media:
+apk add zfs
+
Install partition tool
+apk add parted e2fsprogs cryptsetup util-linux
+
Partition the disks.
+Note: you must clear all existing partition tables and data structures from target disks.
+For flash-based storage, this can be done by the blkdiscard command below:
+partition_disk () {
+ local disk="${1}"
+ blkdiscard -f "${disk}" || true
+
+ parted --script --align=optimal "${disk}" -- \
+ mklabel gpt \
+ mkpart EFI 2MiB 1GiB \
+ mkpart bpool 1GiB 5GiB \
+ mkpart rpool 5GiB -$((SWAPSIZE + RESERVE))GiB \
+ mkpart swap -$((SWAPSIZE + RESERVE))GiB -"${RESERVE}"GiB \
+ mkpart BIOS 1MiB 2MiB \
+ set 1 esp on \
+ set 5 bios_grub on \
+ set 5 legacy_boot on
+
+ partprobe "${disk}"
+}
+
+for i in ${DISK}; do
+ partition_disk "${i}"
+done
+
Setup encrypted swap. This is useful if the available memory is +small:
+for i in ${DISK}; do
+ cryptsetup open --type plain --key-file /dev/random "${i}"-part4 "${i##*/}"-part4
+ mkswap /dev/mapper/"${i##*/}"-part4
+ swapon /dev/mapper/"${i##*/}"-part4
+done
+
Load ZFS kernel module
+modprobe zfs
+
Create boot pool
+# shellcheck disable=SC2046
+zpool create -o compatibility=legacy \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O devices=off \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/boot \
+ -R "${MNT}" \
+ bpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part2";
+ done)
+
If not using a multi-disk setup, remove mirror
.
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features.
Create root pool
+# shellcheck disable=SC2046
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -R "${MNT}" \
+ -O acltype=posixacl \
+ -O canmount=off \
+ -O compression=zstd \
+ -O dnodesize=auto \
+ -O normalization=formD \
+ -O relatime=on \
+ -O xattr=sa \
+ -O mountpoint=/ \
+ rpool \
+ mirror \
+ $(for i in ${DISK}; do
+ printf '%s ' "${i}-part3";
+ done)
+
If not using a multi-disk setup, remove mirror
.
Create root system container:
+Unencrypted
+zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+rpool/rhel
+
Encrypted:
+Avoid ZFS send/recv when using native encryption, see `a ZFS developer's comment on this issue`__ and `this spreadsheet of bugs`__. A LUKS-based guide has yet to be written. Once compromised, changing password will not keep your
+data safe. See zfs-change-key(8)
for more info
zfs create \
+ -o canmount=off \
+ -o mountpoint=none \
+ -o encryption=on \
+ -o keylocation=prompt \
+ -o keyformat=passphrase \
+rpool/rhel
+
You can automate this step (insecure) with: echo POOLPASS | zfs create ...
.
Create system datasets,
+manage mountpoints with mountpoint=legacy
zfs create -o canmount=noauto -o mountpoint=/ rpool/rhel/root
+zfs mount rpool/rhel/root
+zfs create -o mountpoint=legacy rpool/rhel/home
+mkdir "${MNT}"/home
+mount -t zfs rpool/rhel/home "${MNT}"/home
+zfs create -o mountpoint=legacy rpool/rhel/var
+zfs create -o mountpoint=legacy rpool/rhel/var/lib
+zfs create -o mountpoint=legacy rpool/rhel/var/log
+zfs create -o mountpoint=none bpool/rhel
+zfs create -o mountpoint=legacy bpool/rhel/root
+mkdir "${MNT}"/boot
+mount -t zfs bpool/rhel/root "${MNT}"/boot
+mkdir -p "${MNT}"/var/log
+mkdir -p "${MNT}"/var/lib
+mount -t zfs rpool/rhel/var/lib "${MNT}"/var/lib
+mount -t zfs rpool/rhel/var/log "${MNT}"/var/log
+
Format and mount ESP
+for i in ${DISK}; do
+ mkfs.vfat -n EFI "${i}"-part1
+ mkdir -p "${MNT}"/boot/efis/"${i##*/}"-part1
+ mount -t vfat -o iocharset=iso8859-1 "${i}"-part1 "${MNT}"/boot/efis/"${i##*/}"-part1
+done
+
+mkdir -p "${MNT}"/boot/efi
+mount -t vfat -o iocharset=iso8859-1 "$(echo "${DISK}" | sed "s|^ *||" | cut -f1 -d' '|| true)"-part1 "${MNT}"/boot/efi
+
Download and extract minimal Rhel root filesystem:
+apk add curl
+curl --fail-early --fail -L \
+https://dl.rockylinux.org/pub/rocky/9.2/images/x86_64/Rocky-9-Container-Base-9.2-20230513.0.x86_64.tar.xz \
+-o rootfs.tar.gz
+curl --fail-early --fail -L \
+https://dl.rockylinux.org/pub/rocky/9.2/images/x86_64/Rocky-9-Container-Base-9.2-20230513.0.x86_64.tar.xz.CHECKSUM \
+-o checksum
+
+# BusyBox sha256sum treats all lines in the checksum file
+# as checksums and requires two spaces " "
+# between filename and checksum
+
+grep 'Container-Base' checksum \
+| grep '^SHA256' \
+| sed -E 's|.*= ([a-z0-9]*)$|\1 rootfs.tar.gz|' > ./sha256checksum
+
+sha256sum -c ./sha256checksum
+
+tar x -C "${MNT}" -af rootfs.tar.gz
+
Enable community repo
+sed -i '/edge/d' /etc/apk/repositories
+sed -i -E 's/#(.*)community/\1community/' /etc/apk/repositories
+
Generate fstab:
+apk add arch-install-scripts
+genfstab -t PARTUUID "${MNT}" \
+| grep -v swap \
+| sed "s|vfat.*rw|vfat rw,x-systemd.idle-timeout=1min,x-systemd.automount,noauto,nofail|" \
+> "${MNT}"/etc/fstab
+
Chroot
+cp /etc/resolv.conf "${MNT}"/etc/resolv.conf
+for i in /dev /proc /sys; do mkdir -p "${MNT}"/"${i}"; mount --rbind "${i}" "${MNT}"/"${i}"; done
+chroot "${MNT}" /usr/bin/env DISK="${DISK}" bash
+
Unset all shell aliases, which can interfere with installation:
+unalias -a
+
Install base packages
+dnf -y install --allowerasing @core grub2-efi-x64 \
+grub2-pc grub2-pc-modules grub2-efi-x64-modules shim-x64 \
+efibootmgr kernel-core
+
Install ZFS packages:
+dnf install -y https://zfsonlinux.org/epel/zfs-release-2-3"$(rpm --eval "%{dist}"|| true)".noarch.rpm
+dnf config-manager --disable zfs
+dnf config-manager --enable zfs-kmod
+dnf install -y zfs zfs-dracut
+
Add zfs modules to dracut:
+echo 'add_dracutmodules+=" zfs "' >> /etc/dracut.conf.d/zfs.conf
+echo 'force_drivers+=" zfs "' >> /etc/dracut.conf.d/zfs.conf
+
Add other drivers to dracut:
+if grep mpt3sas /proc/modules; then
+ echo 'force_drivers+=" mpt3sas "' >> /etc/dracut.conf.d/zfs.conf
+fi
+if grep virtio_blk /proc/modules; then
+ echo 'filesystems+=" virtio_blk "' >> /etc/dracut.conf.d/fs.conf
+fi
+
Build initrd:
+find -D exec /lib/modules -maxdepth 1 \
+-mindepth 1 -type d \
+-exec sh -vxc \
+'if test -e "$1"/modules.dep;
+ then kernel=$(basename "$1");
+ dracut --verbose --force --kver "${kernel}";
+ fi' sh {} \;
+
For SELinux, relabel filesystem on reboot:
+fixfiles -F onboot
+
Generate host id:
+zgenhostid -f -o /etc/hostid
+
Install locale package, example for English locale:
+dnf install -y glibc-minimal-langpack glibc-langpack-en
+
Set locale, keymap, timezone, hostname
+rm -f /etc/localtime
+systemd-firstboot \
+--force \
+--locale=en_US.UTF-8 \
+--timezone=Etc/UTC \
+--hostname=testhost \
+--keymap=us
+
Set root passwd
+printf 'root:yourpassword' | chpasswd
+
Apply GRUB workaround
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile.d/zpool_vdev_name_path.sh
+# shellcheck disable=SC1091
+. /etc/profile.d/zpool_vdev_name_path.sh
+
+# GRUB fails to detect rpool name, hard code as "rpool"
+sed -i "s|rpool=.*|rpool=rpool|" /etc/grub.d/10_linux
+
This workaround needs to be applied for every GRUB update, as the +update will overwrite the changes.
+RHEL uses Boot Loader Specification module for GRUB, +which does not support ZFS. Disable it:
+echo 'GRUB_ENABLE_BLSCFG=false' >> /etc/default/grub
+
This means that you need to regenerate GRUB menu and mirror them +after every kernel update, otherwise computer will still boot old +kernel on reboot.
+Install GRUB:
+mkdir -p /boot/efi/rocky/grub-bootdir/i386-pc/
+for i in ${DISK}; do
+ grub2-install --target=i386-pc --boot-directory \
+ /boot/efi/rocky/grub-bootdir/i386-pc/ "${i}"
+done
+dnf reinstall -y grub2-efi-x64 shim-x64
+cp -r /usr/lib/grub/x86_64-efi/ /boot/efi/EFI/rocky/
+
Generate GRUB menu:
+mkdir -p /boot/grub2
+grub2-mkconfig -o /boot/grub2/grub.cfg
+cp /boot/grub2/grub.cfg \
+ /boot/efi/efi/rocky/grub.cfg
+cp /boot/grub2/grub.cfg \
+ /boot/efi/rocky/grub-bootdir/i386-pc/grub2/grub.cfg
+
For both legacy and EFI booting: mirror ESP content:
+espdir=$(mktemp -d)
+find /boot/efi/ -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' cp -r '{}' "${espdir}"
+find "${espdir}" -maxdepth 1 -mindepth 1 -type d -print0 \
+| xargs -t -0I '{}' sh -vxc "find /boot/efis/ -maxdepth 1 -mindepth 1 -type d -print0 | xargs -t -0I '[]' cp -r '{}' '[]'"
+
Exit chroot
+exit
+
Unmount filesystems and create initial system snapshot +You can later create a boot environment from this snapshot. +See Root on ZFS maintenance page.
+umount -Rl "${MNT}"
+zfs snapshot -r rpool@initial-installation
+zfs snapshot -r bpool@initial-installation
+
Export all pools
+zpool export -a
+
Reboot
+reboot
+
For BIOS-legacy boot users only: the GRUB bootloader installed +might be unusable. In this case, see Bootloader Recovery section +in Root on ZFS maintenance page.
+This issue is not related to Alpine Linux chroot, as Arch Linux +installed with this method does not have this issue.
+UEFI bootloader is not affected by this issue.
+Install package groups
+dnf group list --hidden -v # query package groups
+dnf group install gnome-desktop
+
Add new user, configure swap.
DKMS and kABI-tracking kmod style packages are provided for x86_64 RHEL- +and CentOS-based distributions from the OpenZFS repository. These packages +are updated as new versions are released. Only the repository for the current +minor version of each current major release is updated with new packages.
+To simplify installation, a zfs-release package is provided which includes +a zfs.repo configuration file and public signing key. All official OpenZFS +packages are signed using this key, and by default yum or dnf will verify a +package’s signature before allowing it be to installed. Users are strongly +encouraged to verify the authenticity of the OpenZFS public key using +the fingerprint listed here.
+For EL7 run:
+yum install https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm
+
and for EL8 and 9:
+dnf install https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm
+
After installing the zfs-release package and verifying the public key +users can opt to install either the DKMS or kABI-tracking kmod style packages. +DKMS packages are recommended for users running a non-distribution kernel or +for users who wish to apply local customizations to OpenZFS. For most users +the kABI-tracking kmod packages are recommended in order to avoid needing to +rebuild OpenZFS for every kernel update.
+To install DKMS style packages issue the following commands. First add the +EPEL repository which provides DKMS by installing the epel-release +package, then the kernel-devel and zfs packages. Note that it is +important to make sure that the matching kernel-devel package is installed +for the running kernel since DKMS requires it to build OpenZFS.
+For EL6 and 7, separately run:
+yum install -y epel-release
+yum install -y kernel-devel
+yum install -y zfs
+
And for EL8 and newer, separately run:
+dnf install -y epel-release
+dnf install -y kernel-devel
+dnf install -y zfs
+
Note
+When switching from DKMS to kABI-tracking kmods first uninstall the +existing DKMS packages. This should remove the kernel modules for all +installed kernels, then the kABI-tracking kmods can be installed as +described in the section below.
+By default the zfs-release package is configured to install DKMS style +packages so they will work with a wide range of kernels. In order to +install the kABI-tracking kmods the default repository must be switched +from zfs to zfs-kmod. Keep in mind that the kABI-tracking kmods are +only verified to work with the distribution-provided, non-Stream kernel.
+For EL6 and 7 run:
+yum-config-manager --disable zfs
+yum-config-manager --enable zfs-kmod
+yum install zfs
+
And for EL8 and newer:
+dnf config-manager --disable zfs
+dnf config-manager --enable zfs-kmod
+dnf install zfs
+
By default the OpenZFS kernel modules are automatically loaded when a ZFS
+pool is detected. If you would prefer to always load the modules at boot
+time you can create such configuration in /etc/modules-load.d
:
echo zfs >/etc/modules-load.d/zfs.conf
+
Note
+When updating to a new EL minor release the existing kmod +packages may not work due to upstream kABI changes in the kernel. +The configuration of the current release package may have already made an +updated package available, but the package manager may not know to install +that package if the version number isn’t newer. When upgrading, users +should verify that the kmod-zfs package is providing suitable kernel +modules, reinstalling the kmod-zfs package if necessary.
+The current release package uses “${releasever}” rather than specify a particular +minor release as previous release packages did. Typically “${releasever}” will +resolve to just the major version (e.g. 8), and the resulting repository URL +will be aliased to the current minor version (e.g. 8.7), but you can specify +–releasever to use previous repositories.
+[vagrant@localhost ~]$ dnf list available --showduplicates kmod-zfs
+Last metadata expiration check: 0:00:08 ago on tor 31 jan 2023 17:50:05 UTC.
+Available Packages
+kmod-zfs.x86_64 2.1.6-1.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.7-1.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.8-1.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.9-1.el8 zfs-kmod
+[vagrant@localhost ~]$ dnf list available --showduplicates --releasever=8.6 kmod-zfs
+Last metadata expiration check: 0:16:13 ago on tor 31 jan 2023 17:34:10 UTC.
+Available Packages
+kmod-zfs.x86_64 2.1.4-1.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.5-1.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.5-2.el8 zfs-kmod
+kmod-zfs.x86_64 2.1.6-1.el8 zfs-kmod
+[vagrant@localhost ~]$
+
In the above example, the former packages were built for EL8.7, and the latter for EL8.6.
+In addition to the primary zfs repository a zfs-testing repository +is available. This repository, which is disabled by default, contains +the latest version of OpenZFS which is under active development. These +packages are made available in order to get feedback from users regarding +the functionality and stability of upcoming releases. These packages +should not be used on production systems. Packages from the testing +repository can be installed as follows.
+For EL6 and 7 run:
+yum-config-manager --enable zfs-testing
+yum install kernel-devel zfs
+
And for EL8 and newer:
+dnf config-manager --enable zfs-testing
+dnf install kernel-devel zfs
+
Note
+Use zfs-testing for DKMS packages and zfs-testing-kmod +for kABI-tracking kmod packages.
+See Ubuntu 20.04 Root on ZFS for new +installs. This guide is no longer receiving most updates. It continues +to exist for reference for existing installs that followed it.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
Ubuntu 18.04.3 (“Bionic”) Desktop +CD +(not any server images)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” +drive) only works with UEFI booting. This not unique to ZFS. GRUB +does not and will not work on 4Kn with legacy (BIOS) +booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need massive amounts of +RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports two different encryption options: unencrypted and +LUKS (full-disk encryption). With either option, all ZFS features are fully +available. ZFS native encryption is not available in Ubuntu 18.04.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+1.1 Boot the Ubuntu Live CD. Select Try Ubuntu. Connect your system to +the Internet as appropriate (e.g. join your WiFi network). Open a +terminal (press Ctrl-Alt-T).
+1.2 Setup and update the repositories:
+sudo apt-add-repository universe
+sudo apt update
+
1.3 Optional: Install and start the OpenSSH server in the Live CD +environment:
+If you have a second system, using SSH to access the target system can +be convenient:
+passwd
+# There is no current password; hit enter at that prompt.
+sudo apt install --yes openssh-server
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh ubuntu@IP
.
1.4 Become root:
+sudo -i
+
1.5 Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk zfs-initramfs
+
2.1 Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is
+missing from /dev/disk/by-id
, use /dev/vda
if you are using
+KVM with virtio; otherwise, read the
+troubleshooting section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A kernel +and initrd may consume around 100M. If you have multiple kernels and take +snapshots, you may find yourself low on boot pool space, especially if you +need to regenerate your initramfs images, which may be around 85M each. Size +your boot pool appropriately for your needs.
2.2 If you are re-using a disk, clear it as necessary:
+If the disk was previously used in an MD array, zero the superblock:
+apt install --yes mdadm
+mdadm --zero-superblock --force $DISK
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
2.3 Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+2.3a Unencrypted:
+sgdisk -n4:0:0 -t4:BF01 $DISK
+
2.3b LUKS:
+sgdisk -n4:0:0 -t4:8300 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+2.4 Create the boot pool:
+zpool create -o ashift=12 -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \
+ -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See
+spa_feature_names
in
+grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror or raidz topology, create the pool using
+zpool create ... bpool mirror /dev/disk/by-id/scsi-SATA_disk1-part3 /dev/disk/by-id/scsi-SATA_disk2-part3
+(or replace mirror
with raidz
, raidz2
, or raidz3
and
+list the partitions from additional disks).
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Feature Notes:
+As a read-only compatible feature, the userobj_accounting
feature should
+be compatible in theory, but in practice, GRUB can fail with an “invalid
+dnode type” error. This feature does not matter for /boot
anyway.
2.5 Create the root pool:
+Choose one of the following options:
+2.5a Unencrypted:
+zpool create -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt rpool ${DISK}-part4
+
2.5b LUKS:
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/ -R /mnt rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires
+ACLs
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only
+filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you want to
+tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s
+documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI
+applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain
+controller.
+Note that xattr=sa
is
+Linux-specific.
+If you move your xattr=sa
pool to another OpenZFS implementation
+besides ZFS-on-Linux, extended attributes will not be readable
+(though your data will be). If portability of extended attributes is
+important to you, omit the -O xattr=sa
above. Even if you do not
+want xattr=sa
for the whole pool, it is probably fine to use it
+for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
For LUKS, the key size chosen is 512 bits. However, XTS mode requires
+two keys, so the LUKS key is split in half. Thus, -s 512
means
+AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup +FAQ +for guidance.
Hints:
+If you are creating a mirror or raidz topology, create the pool using
+zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part4 /dev/disk/by-id/scsi-SATA_disk2-part4
+(or replace mirror
with raidz
, raidz2
, or raidz3
and
+list the partitions from additional disks). For LUKS, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will
+have to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the
+root pool is named rpool
by default.
3.1 Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality has been implemented in Ubuntu 20.04 with the
+zsys
tool, though its dataset layout is more complicated. Even without
+such a tool, the rpool/ROOT and bpool/BOOT containers can still be used
+for manually created clones.
3.2 Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/ubuntu
+zfs mount rpool/ROOT/ubuntu
+
+zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/ubuntu
+zfs mount bpool/BOOT/ubuntu
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
3.3 Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to exclude these from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /opt on this system:
+zfs create rpool/opt
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will store local email in /var/mail:
+zfs create rpool/var/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
If this system will use GNOME:
+zfs create rpool/var/lib/AccountsService
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will use NFS (locking):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user data.
+This allows the root filesystem to be rolled back without rolling back user
+data. The com.sun.auto-snapshot
setting is used by some ZFS
+snapshot utilities to exclude transient data.
If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for
+/tmp
, as shown above. This keeps the /tmp
data out of snapshots
+of your root filesystem. It also allows you to set a quota on
+rpool/tmp
, if you want to limit the maximum space used. Otherwise,
+you can use a tmpfs (RAM filesystem) later.
3.4 Install the minimal system:
+debootstrap bionic /mnt
+zfs set devices=off rpool
+
The debootstrap
command leaves the new system in an unconfigured
+state. An alternative to using debootstrap
is to copy the entirety
+of a working system into the new ZFS root.
4.1 Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
echo HOSTNAME > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
4.2 Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME below to match your interface name:
+vi /mnt/etc/netplan/01-netcfg.yaml
+
network:
+ version: 2
+ ethernets:
+ NAME:
+ dhcp4: true
+
Customize this file if the system is not a DHCP client.
+4.3 Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://archive.ubuntu.com/ubuntu bionic main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu bionic-updates main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu bionic-backports main restricted universe multiverse
+deb http://security.ubuntu.com/ubuntu bionic-security main restricted universe multiverse
+
4.4 Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --rbind /dev /mnt/dev
+mount --rbind /proc /mnt/proc
+mount --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
4.5 Configure a basic system environment:
+ln -s /proc/self/mounts /etc/mtab
+apt update
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales
+dpkg-reconfigure tzdata
+
If you prefer nano
over vi
, install it:
apt install --yes nano
+
4.6 Install ZFS in the chroot environment for the new system:
+apt install --yes --no-install-recommends linux-image-generic
+apt install --yes zfs-initramfs
+
Hint: For the HWE kernel, install linux-image-generic-hwe-18.04
+instead of linux-image-generic
.
4.7 For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup
+
+echo luks1 UUID=$(blkid -s UUID -o value ${DISK}-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
4.8 Install GRUB
+Choose one of the following options:
+4.8a Install GRUB for legacy (BIOS) booting:
+apt install --yes grub-pc
+
Select (using the space bar) all of the disks (not partitions) in your pool.
+4.8b Install GRUB for UEFI booting:
+apt install dosfstools
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo PARTUUID=$(blkid -s PARTUUID -o value ${DISK}-part2) \
+ /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab
+mount /boot/efi
+apt install --yes grub-efi-amd64-signed shim-signed
+
Notes:
+The -s 1
for mkdosfs
is only necessary for drives which present
+4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size
+(given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
For a mirror or raidz topology, this step only installs GRUB on the +first disk. The other disk(s) will be handled later.
4.9 (Optional): Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub. os-prober is only necessary +in dual-boot configurations.
+4.10 Set a root password:
+passwd
+
4.11 Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
4.12 Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
4.13 Setup system groups:
+addgroup --system lpadmin
+addgroup --system sambashare
+
5.1 Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
5.2 Refresh the initrd files:
+update-initramfs -c -k all
+
Note: When using LUKS, this will print “WARNING could not determine +root device from /etc/fstab”. This is because cryptsetup does not +support ZFS.
+5.3 Workaround GRUB’s missing zpool-features support:
+vi /etc/default/grub
+# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/ubuntu"
+
5.4 Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Comment out: GRUB_TIMEOUT_STYLE=hidden
+# Set: GRUB_TIMEOUT=5
+# Below GRUB_TIMEOUT, add: GRUB_RECORDFAIL_TIMEOUT=5
+# Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+5.5 Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
5.6 Install the boot loader:
+5.6a For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the
+grub-install
command for each disk in the pool.
5.6b For UEFI booting, install GRUB:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=ubuntu --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+5.7 Fix filesystem mount ordering:
+Until ZFS gains a systemd mount
+generator, there are
+races between mounting filesystems and starting certain daemons. In
+practice, the issues (e.g.
+#5754) seem to be
+with certain filesystems in /var
, specifically /var/log
and
+/var/tmp
. Setting these to use legacy
mounting, and listing them
+in /etc/fstab
makes systemd aware that these are separate
+mountpoints. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
+feature of systemd automatically use After=var-tmp.mount
.
Until there is support for mounting /boot
in the initramfs, we also
+need to mount that, because it was marked canmount=noauto
. Also,
+with UEFI, we need to ensure it is mounted before its child filesystem
+/boot/efi
.
rpool
is guaranteed to be imported by the initramfs, so there is no
+point in adding x-systemd.requires=zfs-import.target
to those
+filesystems.
For UEFI booting, unmount /boot/efi first:
+umount /boot/efi
+
Everything else applies to both BIOS and UEFI booting:
+zfs set mountpoint=legacy bpool/BOOT/ubuntu
+echo bpool/BOOT/ubuntu /boot zfs \
+ nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstab
+
+zfs set mountpoint=legacy rpool/var/log
+echo rpool/var/log /var/log zfs nodev,relatime 0 0 >> /etc/fstab
+
+zfs set mountpoint=legacy rpool/var/spool
+echo rpool/var/spool /var/spool zfs nodev,relatime 0 0 >> /etc/fstab
+
If you created a /var/tmp dataset:
+zfs set mountpoint=legacy rpool/var/tmp
+echo rpool/var/tmp /var/tmp zfs nodev,relatime 0 0 >> /etc/fstab
+
If you created a /tmp dataset:
+zfs set mountpoint=legacy rpool/tmp
+echo rpool/tmp /tmp zfs nodev,relatime 0 0 >> /etc/fstab
+
6.1 Snapshot the initial installation:
+zfs snapshot bpool/BOOT/ubuntu@install
+zfs snapshot rpool/ROOT/ubuntu@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+6.2 Exit from the chroot
environment back to the LiveCD environment:
exit
+
6.3 Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
+zpool export -a
+
6.4 Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+6.5 Create a user account:
+Replace username
with your desired username:
zfs create rpool/home/username
+adduser username
+
+cp -a /etc/skel/. /home/username
+chown -R username:username /home/username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video username
+
6.6 Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks:
+6.6a For legacy (BIOS) booting:
+dpkg-reconfigure grub-pc
+Hit enter until you get to the device selection screen.
+Select (using the space bar) all of the disks (not partitions) in your pool.
+
6.6b For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment ubuntu-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "ubuntu-2" -l '\EFI\ubuntu\shimx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. This issue is currently being investigated in: +https://github.com/zfsonlinux/zfs/issues/7734
+7.1 Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
7.2 Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
7.3 Enable the swap device:
+swapon -av
+
8.1 Upgrade the minimal system:
+apt dist-upgrade --yes
+
8.2 Install a regular set of software:
+Choose one of the following options:
+8.2a Install a command-line environment only:
+apt install --yes ubuntu-standard
+
8.2b Install a full GUI environment:
+apt install --yes ubuntu-desktop
+vi /etc/gdm3/custom.conf
+# In the [daemon] section, add: InitialSetupEnable=false
+
Hint: If you are installing a full GUI environment, you will likely +want to manage your network with NetworkManager:
+rm /mnt/etc/netplan/01-netcfg.yaml
+vi /etc/netplan/01-network-manager-all.yaml
+
network:
+ version: 2
+ renderer: NetworkManager
+
8.3 Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain.
+Also, if you are making snapshots of /var/log
, logrotate’s
+compression will actually waste space, as the uncompressed data will
+live on in the snapshot. You can edit the files in /etc/logrotate.d
+by hand to comment out compress
, or use this loop (copy-and-paste
+highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
8.4 Reboot:
+reboot
+
9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
+9.2 Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/ubuntu@install
+sudo zfs destroy rpool/ROOT/ubuntu@install
+
9.3 Optional: Disable the root password:
+sudo usermod -p '*' root
+
9.4 Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Uncomment: GRUB_TIMEOUT_STYLE=hidden
+# Add quiet and splash to: GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out: GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
9.5 Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected +by your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for
+each LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install +Environment.
+For LUKS, first unlock the disk(s):
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs mount rpool/ROOT/ubuntu
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --rbind /dev /mnt/dev
+mount --rbind /proc /mnt/proc
+mount --rbind /sys /mnt/sys
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Most problem reports for this tutorial involve mpt2sas
hardware that
+does slow asynchronous drive initialization, like some IBM M1015 or
+OEM-branded cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run
+update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit
+this error message.
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere
+configuration. Doing this ensures that /dev/disk
aliases are
+created in the guest.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
See Ubuntu 22.04 Root on ZFS for Raspberry Pi for new installs. This guide +is no longer receiving most updates. It continues to exist for reference +for existing installs that followed it.
This HOWTO uses a whole physical disk.
Backup your data. Any existing data will be lost.
A Raspberry Pi 4 B. (If you are looking to install on a regular PC, see +Ubuntu 20.04 Root on ZFS.)
A microSD card or USB disk. For microSD card recommendations, see Jeff +Geerling’s performance comparison. +When using a USB enclosure, ensure it supports UASP.
An Ubuntu system (with the ability to write to the microSD card or USB disk) +other than the target Raspberry Pi.
4 GiB of memory is recommended. Do not use deduplication, as it needs massive +amounts of RAM. +Enabling deduplication is a permanent change that cannot be easily reverted.
+A Raspberry Pi 3 B/B+ would probably work (as the Pi 3 is 64-bit, though it +has less RAM), but has not been tested. Please report your results (good or +bad) using the issue link below.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
WARNING: Encryption has not yet been tested on the Raspberry Pi.
+This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+The Raspberry Pi 4 runs much faster using a USB Solid State Drive (SSD) than +a microSD card. These instructions can also be used to install Ubuntu on a +USB-connected SSD or other USB disk. USB disks have three requirements that +do not apply to microSD cards:
+The Raspberry Pi’s Bootloader EEPROM must be dated 2020-09-03 or later.
+To check the bootloader version, power up the Raspberry Pi without an SD
+card inserted or a USB boot device attached; the date will be on the
+bootloader
line. (If you do not see the bootloader
line, the
+bootloader is too old.) Alternatively, run sudo rpi-eeprom-update
+on an existing OS on the Raspberry Pi (which on Ubuntu requires
+apt install rpi-eeprom
).
If needed, the bootloader can be updated from an existing OS on the
+Raspberry Pi using rpi-eeprom-update -a
and rebooting.
+For other options, see Updating the Bootloader.
The Raspberry Pi must configured for USB boot. The bootloader will show a
+boot
line; if order
includes 4
, USB boot is enabled.
If not already enabled, it can be enabled from an existing OS on the
+Raspberry Pi using rpi-eeprom-config -e
: set BOOT_ORDER=0xf41
+and reboot to apply the change. On subsequent reboots, USB boot will be
+enabled.
Otherwise, it can be enabled without an existing OS as follows:
+Download the Raspberry Pi Imager Utility.
Flash the USB Boot
image to a microSD card. The USB Boot
image is
+listed under Bootload
in the Misc utility images
folder.
Boot the Raspberry Pi from the microSD card. USB Boot should be enabled +automatically.
U-Boot on Ubuntu 20.04 does not seem to support the Raspberry Pi USB.
+Ubuntu 20.10 may work. As a
+work-around, the Raspberry Pi bootloader is configured to directly boot
+Linux. For this to work, the Linux kernel must not be compressed. These
+instructions decompress the kernel and add a script to
+/etc/kernel/postinst.d
to handle kernel upgrades.
The commands in this step are run on the system other than the Raspberry Pi.
+This guide has you go to some extra work so that the stock ext4 partition can +be deleted.
+Download and unpack the official image:
+curl -O https://cdimage.ubuntu.com/releases/20.04.4/release/ubuntu-20.04.4-preinstalled-server-arm64+raspi.img.xz
+xz -d ubuntu-20.04.4-preinstalled-server-arm64+raspi.img.xz
+
+# or combine them to decompress as you download:
+curl https://cdimage.ubuntu.com/releases/20.04.4/release/ubuntu-20.04.4-preinstalled-server-arm64+raspi.img.xz | \
+ xz -d > ubuntu-20.04.4-preinstalled-server-arm64+raspi.img
+
Dump the partition table for the image:
+sfdisk -d ubuntu-20.04.4-preinstalled-server-arm64+raspi.img
+
That will output this:
+label: dos
+label-id: 0xddbefb06
+device: ubuntu-20.04.4-preinstalled-server-arm64+raspi.img
+unit: sectors
+
+<name>.img1 : start= 2048, size= 524288, type=c, bootable
+<name>.img2 : start= 526336, size= 6285628, type=83
+
The important numbers are 524288 and 6285628. Store those in variables:
+BOOT=524288
+ROOT=6285628
+
Create a partition script:
+cat > partitions << EOF
+label: dos
+unit: sectors
+
+1 : start= 2048, size=$BOOT, type=c, bootable
+2 : start=$((2048+BOOT)), size=$ROOT, type=83
+3 : start=$((2048+BOOT+ROOT)), size=$ROOT, type=83
+EOF
+
Connect the disk:
+Connect the disk to a machine other than the target Raspberry Pi. If any
+filesystems are automatically mounted (e.g. by GNOME) unmount them.
+Determine the device name. For SD, the device name is almost certainly
+/dev/mmcblk0
. For USB SSDs, the device name is /dev/sdX
, where
+X
is a lowercase letter. lsblk
can help determine the device name.
+Set the DISK
environment variable to the device name:
DISK=/dev/mmcblk0 # microSD card
+DISK=/dev/sdX # USB disk
+
Because partitions are named differently for /dev/mmcblk0
and /dev/sdX
+devices, set a second variable used when working with partitions:
export DISKP=${DISK}p # microSD card
+export DISKP=${DISK} # USB disk ($DISKP == $DISK for /dev/sdX devices)
+
Hint: microSD cards connected using a USB reader also have /dev/sdX
+names.
WARNING: The following steps destroy the existing data on the disk. Ensure
+DISK
and DISKP
are correct before proceeding.
Ensure swap partitions are not in use:
+swapon -v
+# If a partition is in use from the disk, disable it:
+sudo swapoff THAT_PARTITION
+
Clear old ZFS labels:
+sudo zpool labelclear -f ${DISK}
+
If a ZFS label still exists from a previous system/attempt, expanding the +pool will result in an unbootable system.
+Hint: If you do not already have the ZFS utilities installed, you can
+install them with: sudo apt install zfsutils-linux
Alternatively, you
+can zero the entire disk with:
+sudo dd if=/dev/zero of=${DISK} bs=1M status=progress
Delete existing partitions:
+echo "label: dos" | sudo sfdisk ${DISK}
+sudo partprobe
+ls ${DISKP}*
+
Make sure there are no partitions, just the file for the disk itself. This +step is not strictly necessary; it exists to catch problems.
+Create the partitions:
+sudo sfdisk $DISK < partitions
+
Loopback mount the image:
+IMG=$(sudo losetup -fP --show \
+ ubuntu-20.04.4-preinstalled-server-arm64+raspi.img)
+
Copy the bootloader data:
+sudo dd if=${IMG}p1 of=${DISKP}1 bs=1M
+
Clear old label(s) from partition 2:
+sudo wipefs -a ${DISKP}2
+
If a filesystem with the writable
label from the Ubuntu image is still
+present in partition 2, the system will not boot initially.
Copy the root filesystem data:
+# NOTE: the destination is p3, not p2.
+sudo dd if=${IMG}p2 of=${DISKP}3 bs=1M status=progress conv=fsync
+
Unmount the image:
+sudo losetup -d $IMG
+
If setting up a USB disk:
+Decompress the kernel:
+sudo -sE
+
+MNT=$(mktemp -d /mnt/XXXXXXXX)
+mkdir -p $MNT/boot $MNT/root
+mount ${DISKP}1 $MNT/boot
+mount ${DISKP}3 $MNT/root
+
+zcat -qf $MNT/boot/vmlinuz >$MNT/boot/vmlinux
+
Modify boot config:
+cat >> $MNT/boot/usercfg.txt << EOF
+kernel=vmlinux
+initramfs initrd.img followkernel
+boot_delay
+EOF
+
Create a script to automatically decompress the kernel after an upgrade:
+cat >$MNT/root/etc/kernel/postinst.d/zz-decompress-kernel << 'EOF'
+#!/bin/sh
+
+set -eu
+
+echo "Updating decompressed kernel..."
+[ -e /boot/firmware/vmlinux ] && \
+ cp /boot/firmware/vmlinux /boot/firmware/vmlinux.bak
+vmlinuxtmp=$(mktemp /boot/firmware/vmlinux.XXXXXXXX)
+zcat -qf /boot/vmlinuz > "$vmlinuxtmp"
+mv "$vmlinuxtmp" /boot/firmware/vmlinux
+EOF
+
+chmod +x $MNT/root/etc/kernel/postinst.d/zz-decompress-kernel
+
Cleanup:
+umount $MNT/*
+rm -rf $MNT
+exit
+
Boot the Raspberry Pi.
+Move the SD/USB disk to the Raspberry Pi. Boot it and login (e.g. via SSH)
+with ubuntu
as the username and password. If you are using SSH, note
+that it takes a little bit for cloud-init to enable password logins on the
+first boot. Set a new password when prompted and login again using that
+password. If you have your local SSH configured to use ControlPersist
,
+you will have to kill the existing SSH process before logging in the second
+time.
Become root:
+sudo -i
+
Set the DISK and DISKP variables again:
+DISK=/dev/mmcblk0 # microSD card
+DISKP=${DISK}p # microSD card
+
+DISK=/dev/sdX # USB disk
+DISKP=${DISK} # USB disk
+
WARNING: Device names can change when moving a device to a different +computer or switching the microSD card from a USB reader to a built-in +slot. Double check the device name before continuing.
+Install ZFS:
+apt update
+
+apt install pv zfs-initramfs
+
Note: Since this is the first boot, you may get Waiting for cache
+lock
because unattended-upgrades
is running in the background.
+Wait for it to finish.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISKP}2
+
WARNING: Encryption has not yet been tested on the Raspberry Pi.
+ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -O encryption=aes-256-gcm \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISKP}2
+
LUKS:
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISKP}2
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
+Also, disabling ACLs apparently breaks umask handling with NFSv4.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption defaults to aes-256-ccm
, but the default has
+changed upstream
+to aes-256-gcm
. AES-GCM seems to be generally preferred over AES-CCM,
+is faster now,
+and will be even faster in the future.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Create a filesystem dataset to act as a container:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+
Create a filesystem dataset for the root filesystem:
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+
+zfs create -o canmount=noauto -o mountpoint=/ \
+ -o com.ubuntu.zsys:bootfs=yes \
+ -o com.ubuntu.zsys:last-used=$(date +%s) rpool/ROOT/ubuntu_$UUID
+zfs mount rpool/ROOT/ubuntu_$UUID
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/srv
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/usr
+zfs create rpool/ROOT/ubuntu_$UUID/usr/local
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/var
+zfs create rpool/ROOT/ubuntu_$UUID/var/games
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/AccountsService
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/apt
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/dpkg
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/NetworkManager
+zfs create rpool/ROOT/ubuntu_$UUID/var/log
+zfs create rpool/ROOT/ubuntu_$UUID/var/mail
+zfs create rpool/ROOT/ubuntu_$UUID/var/snap
+zfs create rpool/ROOT/ubuntu_$UUID/var/spool
+zfs create rpool/ROOT/ubuntu_$UUID/var/www
+
+zfs create -o canmount=off -o mountpoint=/ \
+ rpool/USERDATA
+zfs create -o com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_$UUID \
+ -o canmount=on -o mountpoint=/root \
+ rpool/USERDATA/root_$UUID
+
If you want a separate dataset for /tmp
:
zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Optional: Ignore synchronous requests:
+microSD cards are relatively slow. If you want to increase performance
+(especially when installing packages) at the cost of some safety, you can
+disable flushing of synchronous requests (e.g. fsync()
, O_[D]SYNC
):
Choose one of the following options:
+For the root filesystem, but not user data:
+zfs set sync=disabled rpool/ROOT
+
For everything:
+zfs set sync=disabled rpool
+
ZFS is transactional, so it will still be crash consistent. However, you
+should leave sync
at its default of standard
if this system needs
+to guarantee persistence (e.g. if it is a database or NFS server).
Copy the system into the ZFS filesystems:
+(cd /; tar -cf - --one-file-system --warning=no-file-ignored .) | \
+ pv -p -bs $(du -sxm --apparent-size / | cut -f1)m | \
+ (cd /mnt ; tar -x)
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Stop zed
:
systemctl stop zed
+
Bind the virtual filesystems from the running environment to the new
+ZFS environment and chroot
into it:
mount --make-private --rbind /boot/firmware /mnt/boot/firmware
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /run /mnt/run
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK UUID=$UUID bash --login
+
Configure a basic system environment:
+apt update
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales
+dpkg-reconfigure tzdata
+
For LUKS installs only, setup /etc/crypttab
:
# cryptsetup is already installed, but this marks it as manually
+# installed so it is not automatically removed.
+apt install --yes cryptsetup
+
+echo luks1 UUID=$(blkid -s UUID -o value ${DISK}-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Optional: Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Setup system groups:
+addgroup --system lpadmin
+addgroup --system sambashare
+
Patch a dependency loop:
+For ZFS native encryption or LUKS:
+apt install --yes curl patch
+
+curl https://launchpadlibrarian.net/478315221/2150-fix-systemd-dependency-loops.patch | \
+ sed "s|/etc|/lib|;s|\.in$||" | (cd / ; patch -p1)
+
Ignore the failure in Hunk #2 (say n
twice).
This patch is from Bug #1875577 Encrypted swap won’t load on 20.04 with +zfs root.
+Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/rpool
+ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
+zed -F &
+
Force a cache update:
+zfs set canmount=noauto rpool/ROOT/ubuntu_$UUID
+
Verify that zed
updated the cache by making sure this is not empty,
+which will take a few seconds:
cat /etc/zfs/zfs-list.cache/rpool
+
Stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Remove old filesystem from /etc/fstab
:
vi /etc/fstab
+# Remove the old root filesystem line:
+# LABEL=writable / ext4 ...
+
Configure kernel command line:
+cp /boot/firmware/cmdline.txt /boot/firmware/cmdline.txt.bak
+sed -i "s|root=LABEL=writable rootfstype=ext4|root=ZFS=rpool/ROOT/ubuntu_$UUID|" \
+ /boot/firmware/cmdline.txt
+sed -i "s| fixrtc||" /boot/firmware/cmdline.txt
+sed -i "s|$| init_on_alloc=0|" /boot/firmware/cmdline.txt
+
The fixrtc
script is not compatible with ZFS and will cause the boot
+to hang for 180 seconds.
The init_on_alloc=0
is to address performance regressions.
Optional (but highly recommended): Make debugging booting easier:
+sed -i "s|$| nosplash|" /boot/firmware/cmdline.txt
+
Reboot:
+exit
+reboot
+
Wait for the newly installed system to boot normally. Login as ubuntu
.
Become root:
+sudo -i
+
Set the DISK variable again:
+DISK=/dev/mmcblk0 # microSD card
+
+DISK=/dev/sdX # USB disk
+
Delete the ext4 partition and expand the ZFS partition:
+sfdisk $DISK --delete 3
+echo ", +" | sfdisk --no-reread -N 2 $DISK
+
Note: This does not automatically expand the pool. That will be happen +on reboot.
+Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+ROOT_DS=$(zfs list -o name | awk '/ROOT\/ubuntu_/{print $1;exit}')
+zfs create -o com.ubuntu.zsys:bootfs-datasets=$ROOT_DS \
+ -o canmount=on -o mountpoint=/home/$username \
+ rpool/USERDATA/${username}_$UUID
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G adm,cdrom,dip,lpadmin,lxd,plugdev,sambashare,sudo $username
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created.
+Become root:
+sudo -i
+
Expand the ZFS pool:
+Verify the pool expanded:
+zfs list rpool
+
If it did not automatically expand, try to expand it manually:
+DISK=/dev/mmcblk0 # microSD card
+DISKP=${DISK}p # microSD card
+
+DISK=/dev/sdX # USB disk
+DISKP=${DISK} # USB disk
+
+zpool online -e rpool ${DISKP}2
+
Delete the ubuntu
user:
deluser --remove-home ubuntu
+
Optional: Remove cloud-init:
+vi /etc/netplan/01-netcfg.yaml
+
network:
+ version: 2
+ ethernets:
+ eth0:
+ dhcp4: true
+
rm /etc/netplan/50-cloud-init.yaml
+apt purge --autoremove ^cloud-init
+rm -rf /etc/cloud
+
Optional: Remove other storage packages:
+apt purge --autoremove bcache-tools btrfs-progs cloud-guest-utils lvm2 \
+ mdadm multipath-tools open-iscsi overlayroot xfsprogs
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Optional: Install a full GUI environment:
+apt install --yes ubuntu-desktop
+echo dtoverlay=vc4-fkms-v3d >> /boot/firmware/usercfg.txt
+
Hint: If you are installing a full GUI environment, you will likely +want to remove cloud-init as discussed above but manage your network with +NetworkManager:
+rm /etc/netplan/*.yaml
+vi /etc/netplan/01-network-manager-all.yaml
+
network:
+ version: 2
+ renderer: NetworkManager
+
Optional (but recommended): Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
See Ubuntu 22.04 Root on ZFS for new +installs. This guide is no longer receiving most updates. It continues +to exist for reference for existing installs that followed it.
If you previously installed using this guide, please apply these fixes if +applicable:
+For a mirror or raidz topology, /boot/grub
is on a separate dataset. This
+was originally bpool/grub
, then changed on 2020-05-30 to
+bpool/BOOT/ubuntu_UUID/grub
to work-around zsys setting canmount=off
+which would result in /boot/grub
not mounting. This work-around lead to
+issues with snapshot restores. The underlying zsys
+issue was fixed and backported
+to 20.04, so it is now back to being bpool/grub
.
If you never applied the 2020-05-30 errata fix, then /boot/grub
is
+probably not mounting. Check that:
mount | grep /boot/grub
+
If it is mounted, everything is fine. Stop. Otherwise:
+zfs set canmount=on bpool/grub
+update-initramfs -c -k all
+update-grub
+
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=ubuntu --recheck --no-floppy
+
Run this for the additional disk(s), incrementing the “2” to “3” and so on
+for both /boot/efi2
and ubuntu-2
:
cp -a /boot/efi/EFI /boot/efi2
+grub-install --target=x86_64-efi --efi-directory=/boot/efi2 \
+ --bootloader-id=ubuntu-2 --recheck --no-floppy
+
Check that these have set prefix=($root)'/grub@'
:
grep prefix= \
+ /boot/efi/EFI/ubuntu/grub.cfg \
+ /boot/efi2/EFI/ubuntu-2/grub.cfg
+
If you applied the 2020-05-30 errata fix, then you should revert the dataset +rename:
+umount /boot/grub
+zfs rename bpool/BOOT/ubuntu_UUID/grub bpool/grub
+zfs set com.ubuntu.zsys:bootfs=no bpool/grub
+zfs mount bpool/grub
+
The HOWTO previously had a typo in AccountsService (where Accounts is plural) +as AccountServices (where Services is plural). This means that AccountsService +data will be written to the root filesystem. This is only harmful in the event +of a rollback of the root filesystem that does not include a rollback of the +user data. Check it:
+zfs list | grep Account
+
If the “s” is on “Accounts”, you are good. If it is on “Services”, fix it:
+mv /var/lib/AccountsService /var/lib/AccountsService-old
+zfs list -r rpool
+# Replace the UUID twice below:
+zfs rename rpool/ROOT/ubuntu_UUID/var/lib/AccountServices \
+ rpool/ROOT/ubuntu_UUID/var/lib/AccountsService
+mv /var/lib/AccountsService-old/* /var/lib/AccountsService
+rmdir /var/lib/AccountsService-old
+
The Ubuntu installer has support for root-on-ZFS. +This HOWTO produces nearly identical results as the Ubuntu installer because of +bidirectional collaboration.
+If you want a single-disk, unencrypted, desktop install, use the installer. It +is far easier and faster than doing everything by hand.
+If you want a ZFS native encrypted, desktop install, you can trivially edit
+the installer.
+The -O recordsize=1M
there is unrelated to encryption; omit that unless
+you understand it. Make sure to use a password that is at least 8 characters
+or this hack will crash the installer. Additionally, once the system is
+installed, you should switch to encrypted swap:
swapon -v
+# Note the device, including the partition.
+
+ls -l /dev/disk/by-id/
+# Find the by-id name of the disk.
+
+sudo swapoff -a
+sudo vi /etc/fstab
+# Remove the swap entry.
+
+sudo apt install --yes cryptsetup
+
+# Replace DISK-partN as appropriate from above:
+echo swap /dev/disk/by-id/DISK-partN /dev/urandom \
+ swap,cipher=aes-xts-plain64:sha256,size=512 | sudo tee -a /etc/crypttab
+echo /dev/mapper/swap none swap defaults 0 0 | sudo tee -a /etc/fstab
+
Hopefully the installer will gain encryption support in +the future.
+If you want to setup a mirror or raidz topology, use LUKS encryption, and/or +install a server (no desktop GUI), use this HOWTO.
+If you are looking to install on a Raspberry Pi, see +Ubuntu 20.04 Root on ZFS for Raspberry Pi.
+This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
Ubuntu 20.04.4 (“Focal”) Desktop CD +(not any server images)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the Ubuntu Live CD. Select Try Ubuntu. Connect your system to the +Internet as appropriate (e.g. join your WiFi network). Open a terminal +(press Ctrl-Alt-T).
Setup and update the repositories:
+sudo apt update
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+passwd
+# There is no current password.
+sudo apt install --yes openssh-server vim
+
Installing the full vim
package fixes terminal problems that occur when
+using the vim-tiny
package (that ships in the Live CD environment) over
+SSH.
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh ubuntu@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk zfsutils-linux
+
+systemctl stop zed
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A +kernel and initrd may consume around 100M. If you have multiple kernels +and take snapshots, you may find yourself low on boot pool space, +especially if you need to regenerate your initramfs images, which may be +around 85M each. Size your boot pool appropriately for your needs.
If you are re-using a disk, clear it as necessary:
+Ensure swap partitions are not in use:
+swapoff --all
+
If the disk was previously used in an MD array:
+apt install --yes mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition (e.g. a swap partition per this HOWTO):
+mdadm --zero-superblock --force ${DISK}-part2
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Create bootloader partition(s):
+sgdisk -n1:1M:+512M -t1:EF00 $DISK
+
+# For legacy (BIOS) booting:
+sgdisk -a1 -n5:24K:+1000K -t5:EF02 $DISK
+
Note: While the Ubuntu installer uses an MBR label for legacy (BIOS)
+booting, this HOWTO uses GPT partition labels for both UEFI and legacy
+(BIOS) booting. This is simpler than having two options. It is also
+provides forward compatibility (future proofing). In other words, for
+legacy (BIOS) booting, this will allow you to move the disk(s) to a new
+system/motherboard in the future without having to rebuild the pool (and
+restore your data from a backup). The ESP is created in both cases for
+similar reasons. Additionally, the ESP is used for /boot/grub
in
+single-disk installs, as discussed below.
Create a partition for swap:
+Previous versions of this HOWTO put swap on a zvol. Ubuntu recommends +against this configuration due to deadlocks. There +is a bug report upstream.
+Putting swap on a partition gives up the benefit of ZFS checksums (for your +swap). That is probably the right trade-off given the reports of ZFS +deadlocks with swap. If you are bothered by this, simply do not enable +swap.
+Choose one of the following options if you want swap:
+For a single-disk install:
+sgdisk -n2:0:+500M -t2:8200 $DISK
+
For a mirror or raidz topology:
+sgdisk -n2:0:+500M -t2:FD00 $DISK
+
Adjust the swap swize to your needs. If you wish to enable hiberation +(which only works for unencrypted installs), the swap partition must be +at least as large as the system’s RAM.
+Create a boot pool partition:
+sgdisk -n3:0:+2G -t3:BE00 $DISK
+
The Ubuntu installer uses 5% of the disk space constrained to a minimum of +500 MiB and a maximum of 2 GiB. Making this too small (and 500 MiB might +be too small) can result in an inability to upgrade the kernel.
+Create a root pool partition:
+Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 -o autotrim=on -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The boot pool name is no longer arbitrary. It _must_ be bpool
.
+If you really want to rename it, edit /etc/grub.d/10_linux_zfs
later,
+after GRUB is installed (and run update-grub
).
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
The spacemap_v2
feature has been tested and is safe to use. The boot
+pool is small, so this does not matter in practice.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 -o autotrim=on \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o ashift=12 -o autotrim=on \
+ -O encryption=aes-256-gcm \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 -o autotrim=on \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
+Also, disabling ACLs apparently breaks umask handling with NFSv4.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption defaults to aes-256-ccm
, but the default has
+changed upstream
+to aes-256-gcm
. AES-GCM seems to be generally preferred over AES-CCM,
+is faster now,
+and will be even faster in the future.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
Create filesystem datasets for the root and boot filesystems:
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+
+zfs create -o mountpoint=/ \
+ -o com.ubuntu.zsys:bootfs=yes \
+ -o com.ubuntu.zsys:last-used=$(date +%s) rpool/ROOT/ubuntu_$UUID
+
+zfs create -o mountpoint=/boot bpool/BOOT/ubuntu_$UUID
+
Create datasets:
+zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/srv
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/usr
+zfs create rpool/ROOT/ubuntu_$UUID/usr/local
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/var
+zfs create rpool/ROOT/ubuntu_$UUID/var/games
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/AccountsService
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/apt
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/dpkg
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/NetworkManager
+zfs create rpool/ROOT/ubuntu_$UUID/var/log
+zfs create rpool/ROOT/ubuntu_$UUID/var/mail
+zfs create rpool/ROOT/ubuntu_$UUID/var/snap
+zfs create rpool/ROOT/ubuntu_$UUID/var/spool
+zfs create rpool/ROOT/ubuntu_$UUID/var/www
+
+zfs create -o canmount=off -o mountpoint=/ \
+ rpool/USERDATA
+zfs create -o com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_$UUID \
+ -o canmount=on -o mountpoint=/root \
+ rpool/USERDATA/root_$UUID
+chmod 700 /mnt/root
+
For a mirror or raidz topology, create a dataset for /boot/grub
:
zfs create -o com.ubuntu.zsys:bootfs=no bpool/grub
+
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Install the minimal system:
+debootstrap focal /mnt
+
The debootstrap
command leaves the new system in an unconfigured state.
+An alternative to using debootstrap
is to copy the entirety of a
+working system into the new ZFS root.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME
below to match your interface name:
vi /mnt/etc/netplan/01-netcfg.yaml
+
network:
+ version: 2
+ ethernets:
+ NAME:
+ dhcp4: true
+
Customize this file if the system is not a DHCP client.
+Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
+deb http://security.ubuntu.com/ubuntu focal-security main restricted universe multiverse
+
Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK UUID=$UUID bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+apt update
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales tzdata keyboard-configuration console-setup
+
Install your preferred text editor:
+apt install --yes nano
+
+apt install --yes vim
+
Installing the full vim
package fixes terminal problems that occur when
+using the vim-tiny
package (that is installed by debootstrap
) over
+SSH.
For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) \
+ none luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
Create the EFI filesystem:
+Perform these steps for both UEFI and legacy (BIOS) booting:
+apt install --yes dosfstools
+
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part1
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part1) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+
For a mirror or raidz topology, repeat the mkdosfs for the additional +disks, but do not repeat the other commands.
+Note: The -s 1
for mkdosfs
is only necessary for drives which
+present 4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster
+size (given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
Put /boot/grub
on the EFI System Partition:
For a single-disk install only:
+mkdir /boot/efi/grub /boot/grub
+echo /boot/efi/grub /boot/grub none defaults,bind 0 0 >> /etc/fstab
+mount /boot/grub
+
This allows GRUB to write to /boot/grub
(since it is on a FAT-formatted
+ESP instead of on ZFS), which means that /boot/grub/grubenv
and the
+recordfail
feature works as expected: if the boot fails, the normally
+hidden GRUB menu will be shown on the next boot. For a mirror or raidz
+topology, we do not want GRUB writing to the EFI System Partition. This is
+because we duplicate it at install without a mechanism to update the copies
+when the GRUB configuration changes (e.g. as the kernel is upgraded). Thus,
+we keep /boot/grub
on the boot pool for the mirror or raidz topologies.
+This preserves correct mirroring/raidz behavior, at the expense of being
+able to write to /boot/grub/grubenv
and thus the recordfail
+behavior.
Install GRUB/Linux/ZFS in the chroot environment for the new system:
+Choose one of the following options:
+Install GRUB/Linux/ZFS for legacy (BIOS) booting:
+apt install --yes grub-pc linux-image-generic zfs-initramfs zsys
+
Select (using the space bar) all of the disks (not partitions) in your +pool.
+Install GRUB/Linux/ZFS for UEFI booting:
+apt install --yes \
+ grub-efi-amd64 grub-efi-amd64-signed linux-image-generic \
+ shim-signed zfs-initramfs zsys
+
Notes:
+Ignore any error messages saying ERROR: Couldn't resolve device
and
+WARNING: Couldn't determine root device
. cryptsetup does not
+support ZFS.
Ignore any error messages saying Module zfs not found
and
+couldn't connect to zsys daemon
. The first seems to occur due to a
+version mismatch between the Live CD kernel and the chroot environment,
+but this is irrelevant since the module is already loaded. The second
+may be caused by the first but either way is irrelevant since zed
+is started manually later.
For a mirror or raidz topology, this step only installs GRUB on the
+first disk. The other disk(s) will be handled later. For some reason,
+grub-efi-amd64 does not prompt for install_devices
here, but does
+after a reboot.
Optional: Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub
. os-prober
is only
+necessary in dual-boot configurations.
Set a root password:
+passwd
+
Configure swap:
+Choose one of the following options if you want swap:
+For an unencrypted single-disk install:
+mkswap -f ${DISK}-part2
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
+ none swap discard 0 0 >> /etc/fstab
+swapon -a
+
For an unencrypted mirror or raidz topology:
+apt install --yes mdadm
+
+# Adjust the level (ZFS raidz = MD raid5, raidz2 = raid6) and
+# raid-devices if necessary and specify the actual devices.
+mdadm --create /dev/md0 --metadata=1.2 --level=mirror \
+ --raid-devices=2 ${DISK1}-part2 ${DISK2}-part2
+mkswap -f /dev/md0
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value /dev/md0) \
+ none swap discard 0 0 >> /etc/fstab
+
For an encrypted (LUKS or ZFS native encryption) single-disk install:
+apt install --yes cryptsetup
+
+echo swap ${DISK}-part2 /dev/urandom \
+ swap,cipher=aes-xts-plain64:sha256,size=512 >> /etc/crypttab
+echo /dev/mapper/swap none swap defaults 0 0 >> /etc/fstab
+
For an encrypted (LUKS or ZFS native encryption) mirror or raidz +topology:
+apt install --yes cryptsetup mdadm
+
+# Adjust the level (ZFS raidz = MD raid5, raidz2 = raid6) and
+# raid-devices if necessary and specify the actual devices.
+mdadm --create /dev/md0 --metadata=1.2 --level=mirror \
+ --raid-devices=2 ${DISK1}-part2 ${DISK2}-part2
+echo swap /dev/md0 /dev/urandom \
+ swap,cipher=aes-xts-plain64:sha256,size=512 >> /etc/crypttab
+echo /dev/mapper/swap none swap defaults 0 0 >> /etc/fstab
+
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Setup system groups:
+addgroup --system lpadmin
+addgroup --system lxd
+addgroup --system sambashare
+
Patch a dependency loop:
+For ZFS native encryption or LUKS:
+apt install --yes curl patch
+
+curl https://launchpadlibrarian.net/478315221/2150-fix-systemd-dependency-loops.patch | \
+ sed "s|/etc|/lib|;s|\.in$||" | (cd / ; patch -p1)
+
Ignore the failure in Hunk #2 (say n
twice).
This patch is from Bug #1875577 Encrypted swap won’t load on 20.04 with +zfs root.
+Optional: Install SSH:
+apt install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
Refresh the initrd files:
+update-initramfs -c -k all
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup
+does not support ZFS.
Disable memory zeroing:
+vi /etc/default/grub
+# Add init_on_alloc=0 to: GRUB_CMDLINE_LINUX_DEFAULT
+# Save and quit (or see the next step).
+
This is to address performance regressions.
+Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Comment out: GRUB_TIMEOUT_STYLE=hidden
+# Set: GRUB_TIMEOUT=5
+# Below GRUB_TIMEOUT, add: GRUB_RECORDFAIL_TIMEOUT=5
+# Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
Install the boot loader:
+Choose one of the following options:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the
+grub-install
command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=ubuntu --recheck --no-floppy
+
Disable grub-initrd-fallback.service
+For a mirror or raidz topology:
+systemctl mask grub-initrd-fallback.service
+
This is the service for /boot/grub/grubenv
which does not work on
+mirrored or raidz topologies. Disabling this keeps it from blocking
+subsequent mounts of /boot/grub
if that mount ever fails.
Another option would be to set RequiresMountsFor=/boot/grub
via a
+drop-in unit, but that is more work to do here for no reason. Hopefully
+this bug
+will be fixed upstream.
Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/ubuntu_$UUID
+zfs set canmount=on rpool/ROOT/ubuntu_$UUID
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Once the files have data, stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Install GRUB to additional disks:
+For a UEFI mirror or raidz topology only:
+dpkg-reconfigure grub-efi-amd64
+
+Select (using the space bar) all of the ESP partitions (partition 1 on
+each of the pool disks).
+
Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+ROOT_DS=$(zfs list -o name | awk '/ROOT\/ubuntu_/{print $1;exit}')
+zfs create -o com.ubuntu.zsys:bootfs-datasets=$ROOT_DS \
+ -o canmount=on -o mountpoint=/home/$username \
+ rpool/USERDATA/${username}_$UUID
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G adm,cdrom,dip,lpadmin,lxd,plugdev,sambashare,sudo $username
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Install a regular set of software:
+Choose one of the following options:
+Install a command-line environment only:
+apt install --yes ubuntu-standard
+
Install a full GUI environment:
+apt install --yes ubuntu-desktop
+
Hint: If you are installing a full GUI environment, you will likely +want to manage your network with NetworkManager:
+rm /etc/netplan/01-netcfg.yaml
+vi /etc/netplan/01-network-manager-all.yaml
+
network:
+ version: 2
+ renderer: NetworkManager
+
Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+sudo vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+sudo systemctl restart ssh
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Uncomment: GRUB_TIMEOUT_STYLE=hidden
+# Add quiet and splash to: GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out: GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+# Replace “UUID” as appropriate; use zfs list to find it:
+zfs mount rpool/ROOT/ubuntu_UUID
+zfs mount bpool/BOOT/ubuntu_UUID
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+chroot /mnt /bin/bash --login
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.ms.fd:/usr/share/OVMF/OVMF_VARS.ms.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
Note
+These are beta instructions. The author still needs to test them. +Additionally, it may be possible to use U-Boot now, which would eliminate +some of the customizations.
+This HOWTO uses a whole physical disk.
Backup your data. Any existing data will be lost.
A Raspberry Pi 4 B. (If you are looking to install on a regular PC, see +Ubuntu 22.04 Root on ZFS.)
A microSD card or USB disk. For microSD card recommendations, see Jeff +Geerling’s performance comparison. +When using a USB enclosure, ensure it supports UASP.
An Ubuntu system (with the ability to write to the microSD card or USB disk) +other than the target Raspberry Pi.
4 GiB of memory is recommended. Do not use deduplication, as it needs massive +amounts of RAM. +Enabling deduplication is a permanent change that cannot be easily reverted.
+A Raspberry Pi 3 B/B+ would probably work (as the Pi 3 is 64-bit, though it +has less RAM), but has not been tested. Please report your results (good or +bad) using the issue link below.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
WARNING: Encryption has not yet been tested on the Raspberry Pi.
+This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+The Raspberry Pi 4 runs much faster using a USB Solid State Drive (SSD) than +a microSD card. These instructions can also be used to install Ubuntu on a +USB-connected SSD or other USB disk. USB disks have three requirements that +do not apply to microSD cards:
+The Raspberry Pi’s Bootloader EEPROM must be dated 2020-09-03 or later.
+To check the bootloader version, power up the Raspberry Pi without an SD
+card inserted or a USB boot device attached; the date will be on the
+bootloader
line. (If you do not see the bootloader
line, the
+bootloader is too old.) Alternatively, run sudo rpi-eeprom-update
+on an existing OS on the Raspberry Pi (which on Ubuntu requires
+apt install rpi-eeprom
).
If needed, the bootloader can be updated from an existing OS on the
+Raspberry Pi using rpi-eeprom-update -a
and rebooting.
+For other options, see Updating the Bootloader.
The Raspberry Pi must configured for USB boot. The bootloader will show a
+boot
line; if order
includes 4
, USB boot is enabled.
If not already enabled, it can be enabled from an existing OS on the
+Raspberry Pi using rpi-eeprom-config -e
: set BOOT_ORDER=0xf41
+and reboot to apply the change. On subsequent reboots, USB boot will be
+enabled.
Otherwise, it can be enabled without an existing OS as follows:
+Download the Raspberry Pi Imager Utility.
Flash the USB Boot
image to a microSD card. The USB Boot
image is
+listed under Bootload
in the Misc utility images
folder.
Boot the Raspberry Pi from the microSD card. USB Boot should be enabled +automatically.
U-Boot on Ubuntu 20.04 does not seem to support the Raspberry Pi USB.
+Ubuntu 20.10 may work. As a
+work-around, the Raspberry Pi bootloader is configured to directly boot
+Linux. For this to work, the Linux kernel must not be compressed. These
+instructions decompress the kernel and add a script to
+/etc/kernel/postinst.d
to handle kernel upgrades.
The commands in this step are run on the system other than the Raspberry Pi.
+This guide has you go to some extra work so that the stock ext4 partition can +be deleted.
+Download and unpack the official image:
+curl -O https://cdimage.ubuntu.com/releases/22.04/release/ubuntu-22.04.1-preinstalled-server-arm64+raspi.img.xz
+xz -d ubuntu-22.04.1-preinstalled-server-arm64+raspi.img.xz
+
+# or combine them to decompress as you download:
+curl https://cdimage.ubuntu.com/releases/22.04/release/ubuntu-22.04.1-preinstalled-server-arm64+raspi.img.xz | \
+ xz -d > ubuntu-22.04.1-preinstalled-server-arm64+raspi.img
+
Dump the partition table for the image:
+sfdisk -d ubuntu-22.04.1-preinstalled-server-arm64+raspi.img
+
That will output this:
+label: dos
+label-id: 0x638274e3
+device: ubuntu-22.04.1-preinstalled-server-arm64+raspi.img
+unit: sectors
+
+<name>.img1 : start= 2048, size= 524288, type=c, bootable
+<name>.img2 : start= 526336, size= 7193932, type=83
+
The important numbers are 524288 and 7193932. Store those in variables:
+BOOT=524288
+ROOT=7193932
+
Create a partition script:
+cat > partitions << EOF
+label: dos
+unit: sectors
+
+1 : start= 2048, size=$BOOT, type=c, bootable
+2 : start=$((2048+BOOT)), size=$ROOT, type=83
+3 : start=$((2048+BOOT+ROOT)), size=$ROOT, type=83
+EOF
+
Connect the disk:
+Connect the disk to a machine other than the target Raspberry Pi. If any
+filesystems are automatically mounted (e.g. by GNOME) unmount them.
+Determine the device name. For SD, the device name is almost certainly
+/dev/mmcblk0
. For USB SSDs, the device name is /dev/sdX
, where
+X
is a lowercase letter. lsblk
can help determine the device name.
+Set the DISK
environment variable to the device name:
DISK=/dev/mmcblk0 # microSD card
+DISK=/dev/sdX # USB disk
+
Because partitions are named differently for /dev/mmcblk0
and /dev/sdX
+devices, set a second variable used when working with partitions:
export DISKP=${DISK}p # microSD card
+export DISKP=${DISK} # USB disk ($DISKP == $DISK for /dev/sdX devices)
+
Hint: microSD cards connected using a USB reader also have /dev/sdX
+names.
WARNING: The following steps destroy the existing data on the disk. Ensure
+DISK
and DISKP
are correct before proceeding.
Ensure swap partitions are not in use:
+swapon -v
+# If a partition is in use from the disk, disable it:
+sudo swapoff THAT_PARTITION
+
Clear old ZFS labels:
+sudo zpool labelclear -f ${DISK}
+
If a ZFS label still exists from a previous system/attempt, expanding the +pool will result in an unbootable system.
+Hint: If you do not already have the ZFS utilities installed, you can
+install them with: sudo apt install zfsutils-linux
Alternatively, you
+can zero the entire disk with:
+sudo dd if=/dev/zero of=${DISK} bs=1M status=progress
Delete existing partitions:
+echo "label: dos" | sudo sfdisk ${DISK}
+sudo partprobe
+ls ${DISKP}*
+
Make sure there are no partitions, just the file for the disk itself. This +step is not strictly necessary; it exists to catch problems.
+Create the partitions:
+sudo sfdisk $DISK < partitions
+
Loopback mount the image:
+IMG=$(sudo losetup -fP --show \
+ ubuntu-22.04.1-preinstalled-server-arm64+raspi.img)
+
Copy the bootloader data:
+sudo dd if=${IMG}p1 of=${DISKP}1 bs=1M
+
Clear old label(s) from partition 2:
+sudo wipefs -a ${DISKP}2
+
If a filesystem with the writable
label from the Ubuntu image is still
+present in partition 2, the system will not boot initially.
Copy the root filesystem data:
+# NOTE: the destination is p3, not p2.
+sudo dd if=${IMG}p2 of=${DISKP}3 bs=1M status=progress conv=fsync
+
Unmount the image:
+sudo losetup -d $IMG
+
If setting up a USB disk:
+Decompress the kernel:
+sudo -sE
+
+MNT=$(mktemp -d /mnt/XXXXXXXX)
+mkdir -p $MNT/boot $MNT/root
+mount ${DISKP}1 $MNT/boot
+mount ${DISKP}3 $MNT/root
+
+zcat -qf $MNT/boot/vmlinuz >$MNT/boot/vmlinux
+
Modify boot config:
+cat >> $MNT/boot/usercfg.txt << EOF
+kernel=vmlinux
+initramfs initrd.img followkernel
+boot_delay
+EOF
+
Create a script to automatically decompress the kernel after an upgrade:
+cat >$MNT/root/etc/kernel/postinst.d/zz-decompress-kernel << 'EOF'
+#!/bin/sh
+
+set -eu
+
+echo "Updating decompressed kernel..."
+[ -e /boot/firmware/vmlinux ] && \
+ cp /boot/firmware/vmlinux /boot/firmware/vmlinux.bak
+vmlinuxtmp=$(mktemp /boot/firmware/vmlinux.XXXXXXXX)
+zcat -qf /boot/vmlinuz > "$vmlinuxtmp"
+mv "$vmlinuxtmp" /boot/firmware/vmlinux
+EOF
+
+chmod +x $MNT/root/etc/kernel/postinst.d/zz-decompress-kernel
+
Cleanup:
+umount $MNT/*
+rm -rf $MNT
+exit
+
Boot the Raspberry Pi.
+Move the SD/USB disk to the Raspberry Pi. Boot it and login (e.g. via SSH)
+with ubuntu
as the username and password. If you are using SSH, note
+that it takes a little bit for cloud-init to enable password logins on the
+first boot. Set a new password when prompted and login again using that
+password. If you have your local SSH configured to use ControlPersist
,
+you will have to kill the existing SSH process before logging in the second
+time.
Become root:
+sudo -i
+
Set the DISK and DISKP variables again:
+DISK=/dev/mmcblk0 # microSD card
+DISKP=${DISK}p # microSD card
+
+DISK=/dev/sdX # USB disk
+DISKP=${DISK} # USB disk
+
WARNING: Device names can change when moving a device to a different +computer or switching the microSD card from a USB reader to a built-in +slot. Double check the device name before continuing.
+Install ZFS:
+apt update
+
+apt install pv zfs-initramfs
+
Note: Since this is the first boot, you may get Waiting for cache
+lock
because unattended-upgrades
is running in the background.
+Wait for it to finish.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISKP}2
+
WARNING: Encryption has not yet been tested on the Raspberry Pi.
+ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -O encryption=on \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISKP}2
+
LUKS:
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISKP}2
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
+Also, disabling ACLs apparently breaks umask handling with NFSv4.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Create a filesystem dataset to act as a container:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+
Create a filesystem dataset for the root filesystem:
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+
+zfs create -o canmount=noauto -o mountpoint=/ \
+ -o com.ubuntu.zsys:bootfs=yes \
+ -o com.ubuntu.zsys:last-used=$(date +%s) rpool/ROOT/ubuntu_$UUID
+zfs mount rpool/ROOT/ubuntu_$UUID
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/usr
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/var
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib
+zfs create rpool/ROOT/ubuntu_$UUID/var/log
+zfs create rpool/ROOT/ubuntu_$UUID/var/spool
+
+zfs create -o canmount=off -o mountpoint=/ \
+ rpool/USERDATA
+zfs create -o com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_$UUID \
+ -o canmount=on -o mountpoint=/root \
+ rpool/USERDATA/root_$UUID
+chmod 700 /mnt/root
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to separate these to exclude them from snapshots:
+zfs create rpool/ROOT/ubuntu_$UUID/var/cache
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/nfs
+zfs create rpool/ROOT/ubuntu_$UUID/var/tmp
+chmod 1777 /mnt/var/tmp
+
If desired (the Ubuntu installer creates these):
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/apt
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/dpkg
+
If you use /srv on this system:
+zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/srv
+
If you use /usr/local on this system:
+zfs create rpool/ROOT/ubuntu_$UUID/usr/local
+
If this system will have games installed:
+zfs create rpool/ROOT/ubuntu_$UUID/var/games
+
If this system will have a GUI:
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/AccountsService
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/NetworkManager
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/docker
+
If this system will store local email in /var/mail:
+zfs create rpool/ROOT/ubuntu_$UUID/var/mail
+
If this system will use Snap packages:
+zfs create rpool/ROOT/ubuntu_$UUID/var/snap
+
If you use /var/www on this system:
+zfs create rpool/ROOT/ubuntu_$UUID/var/www
+
For a mirror or raidz topology, create a dataset for /boot/grub
:
zfs create -o com.ubuntu.zsys:bootfs=no bpool/grub
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Note: If you separate a directory required for booting (e.g. /etc
)
+into its own dataset, you must add it to
+ZFS_INITRD_ADDITIONAL_DATASETS
in /etc/default/zfs
. Datasets
+with canmount=off
(like rpool/usr
above) do not matter for this.
Optional: Ignore synchronous requests:
+microSD cards are relatively slow. If you want to increase performance
+(especially when installing packages) at the cost of some safety, you can
+disable flushing of synchronous requests (e.g. fsync()
, O_[D]SYNC
):
Choose one of the following options:
+For the root filesystem, but not user data:
+zfs set sync=disabled rpool/ROOT
+
For everything:
+zfs set sync=disabled rpool
+
ZFS is transactional, so it will still be crash consistent. However, you
+should leave sync
at its default of standard
if this system needs
+to guarantee persistence (e.g. if it is a database or NFS server).
Copy the system into the ZFS filesystems:
+(cd /; tar -cf - --one-file-system --warning=no-file-ignored .) | \
+ pv -p -bs $(du -sxm --apparent-size / | cut -f1)m | \
+ (cd /mnt ; tar -x)
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Stop zed
:
systemctl stop zed
+
Bind the virtual filesystems from the running environment to the new
+ZFS environment and chroot
into it:
mount --make-private --rbind /boot/firmware /mnt/boot/firmware
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /run /mnt/run
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK UUID=$UUID bash --login
+
Configure a basic system environment:
+apt update
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales
+dpkg-reconfigure tzdata
+
For LUKS installs only, setup /etc/crypttab
:
# cryptsetup is already installed, but this marks it as manually
+# installed so it is not automatically removed.
+apt install --yes cryptsetup
+
+echo luks1 UUID=$(blkid -s UUID -o value ${DISK}-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Optional: Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Setup system groups:
+addgroup --system lpadmin
+addgroup --system sambashare
+
Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/rpool
+zed -F &
+
Force a cache update:
+zfs set canmount=noauto rpool/ROOT/ubuntu_$UUID
+
Verify that zed
updated the cache by making sure this is not empty,
+which will take a few seconds:
cat /etc/zfs/zfs-list.cache/rpool
+
Stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Remove old filesystem from /etc/fstab
:
vi /etc/fstab
+# Remove the old root filesystem line:
+# LABEL=writable / ext4 ...
+
Configure kernel command line:
+cp /boot/firmware/cmdline.txt /boot/firmware/cmdline.txt.bak
+sed -i "s|root=LABEL=writable rootfstype=ext4|root=ZFS=rpool/ROOT/ubuntu_$UUID|" \
+ /boot/firmware/cmdline.txt
+sed -i "s| fixrtc||" /boot/firmware/cmdline.txt
+sed -i "s|$| init_on_alloc=0|" /boot/firmware/cmdline.txt
+
The fixrtc
script is not compatible with ZFS and will cause the boot
+to hang for 180 seconds.
The init_on_alloc=0
is to address performance regressions.
Optional (but highly recommended): Make debugging booting easier:
+sed -i "s|$| nosplash|" /boot/firmware/cmdline.txt
+
Reboot:
+exit
+reboot
+
Wait for the newly installed system to boot normally. Login as ubuntu
.
Become root:
+sudo -i
+
Set the DISK variable again:
+DISK=/dev/mmcblk0 # microSD card
+
+DISK=/dev/sdX # USB disk
+
Delete the ext4 partition and expand the ZFS partition:
+sfdisk $DISK --delete 3
+echo ", +" | sfdisk --no-reread -N 2 $DISK
+
Note: This does not automatically expand the pool. That will be happen +on reboot.
+Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+ROOT_DS=$(zfs list -o name | awk '/ROOT\/ubuntu_/{print $1;exit}')
+zfs create -o com.ubuntu.zsys:bootfs-datasets=$ROOT_DS \
+ -o canmount=on -o mountpoint=/home/$username \
+ rpool/USERDATA/${username}_$UUID
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G adm,cdrom,dip,lpadmin,lxd,plugdev,sambashare,sudo $username
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created.
+Become root:
+sudo -i
+
Expand the ZFS pool:
+Verify the pool expanded:
+zfs list rpool
+
If it did not automatically expand, try to expand it manually:
+DISK=/dev/mmcblk0 # microSD card
+DISKP=${DISK}p # microSD card
+
+DISK=/dev/sdX # USB disk
+DISKP=${DISK} # USB disk
+
+zpool online -e rpool ${DISKP}2
+
Delete the ubuntu
user:
deluser --remove-home ubuntu
+
Optional: Remove cloud-init:
+vi /etc/netplan/01-netcfg.yaml
+
network:
+ version: 2
+ ethernets:
+ eth0:
+ dhcp4: true
+
rm /etc/netplan/50-cloud-init.yaml
+apt purge --autoremove ^cloud-init
+rm -rf /etc/cloud
+
Optional: Remove other storage packages:
+apt purge --autoremove bcache-tools btrfs-progs cloud-guest-utils lvm2 \
+ mdadm multipath-tools open-iscsi overlayroot xfsprogs
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Optional: Install a full GUI environment:
+apt install --yes ubuntu-desktop
+echo dtoverlay=vc4-fkms-v3d >> /boot/firmware/usercfg.txt
+
Hint: If you are installing a full GUI environment, you will likely +want to remove cloud-init as discussed above but manage your network with +NetworkManager:
+rm /etc/netplan/*.yaml
+vi /etc/netplan/01-network-manager-all.yaml
+
network:
+ version: 2
+ renderer: NetworkManager
+
Optional (but recommended): Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
The Ubuntu installer still has ZFS support, but it was almost removed for +22.04 +and it no longer installs zsys. At +the moment, this HOWTO still uses zsys, but that will be probably be removed +in the near future.
+If you are looking to install on a Raspberry Pi, see +Ubuntu 20.04 Root on ZFS for Raspberry Pi.
+This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
Ubuntu 22.04.1 (“jammy”) Desktop CD +(not any server images)
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @rlaager.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo apt install python3-pip
+
+pip3 install -r docs/requirements.txt
+
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request. Mention @rlaager.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the Ubuntu Live CD. From the GRUB boot menu, select Try or Install Ubuntu. +On the Welcome page, select your preferred language and Try Ubuntu. +Connect your system to the Internet as appropriate (e.g. join your WiFi network). +Open a terminal (press Ctrl-Alt-T).
Setup and update the repositories:
+sudo apt update
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+passwd
+# There is no current password.
+sudo apt install --yes openssh-server vim
+
Installing the full vim
package fixes terminal problems that occur when
+using the vim-tiny
package (that ships in the Live CD environment) over
+SSH.
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh ubuntu@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+apt install --yes debootstrap gdisk zfsutils-linux
+
+systemctl stop zed
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
For a mirror or raidz topology, use DISK1
, DISK2
, etc.
When choosing a boot pool size, consider how you will use the space. A +kernel and initrd may consume around 100M. If you have multiple kernels +and take snapshots, you may find yourself low on boot pool space, +especially if you need to regenerate your initramfs images, which may be +around 85M each. Size your boot pool appropriately for your needs.
If you are re-using a disk, clear it as necessary:
+Ensure swap partitions are not in use:
+swapoff --all
+
If the disk was previously used in an MD array:
+apt install --yes mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition (e.g. a swap partition per this HOWTO):
+mdadm --zero-superblock --force ${DISK}-part2
+
If the disk was previously used with zfs:
+wipefs -a $DISK
+
For flash-based storage, if the disk was previously used, you may wish to +do a full-disk discard (TRIM/UNMAP), which can improve performance:
+blkdiscard -f $DISK
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Create bootloader partition(s):
+sgdisk -n1:1M:+512M -t1:EF00 $DISK
+
+# For legacy (BIOS) booting:
+sgdisk -a1 -n5:24K:+1000K -t5:EF02 $DISK
+
Note: While the Ubuntu installer uses an MBR label for legacy (BIOS)
+booting, this HOWTO uses GPT partition labels for both UEFI and legacy
+(BIOS) booting. This is simpler than having two options. It is also
+provides forward compatibility (future proofing). In other words, for
+legacy (BIOS) booting, this will allow you to move the disk(s) to a new
+system/motherboard in the future without having to rebuild the pool (and
+restore your data from a backup). The ESP is created in both cases for
+similar reasons. Additionally, the ESP is used for /boot/grub
in
+single-disk installs, as discussed below.
Create a partition for swap:
+Previous versions of this HOWTO put swap on a zvol. Ubuntu recommends +against this configuration due to deadlocks. There +is a bug report upstream.
+Putting swap on a partition gives up the benefit of ZFS checksums (for your +swap). That is probably the right trade-off given the reports of ZFS +deadlocks with swap. If you are bothered by this, simply do not enable +swap.
+Choose one of the following options if you want swap:
+For a single-disk install:
+sgdisk -n2:0:+500M -t2:8200 $DISK
+
For a mirror or raidz topology:
+sgdisk -n2:0:+500M -t2:FD00 $DISK
+
Adjust the swap swize to your needs. If you wish to enable hiberation +(which only works for unencrypted installs), the swap partition must be +at least as large as the system’s RAM.
+Create a boot pool partition:
+sgdisk -n3:0:+2G -t3:BE00 $DISK
+
The Ubuntu installer uses 5% of the disk space constrained to a minimum of +500 MiB and a maximum of 2 GiB. Making this too small (and 500 MiB might +be too small) can result in an inability to upgrade the kernel.
+Create a root pool partition:
+Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o compatibility=grub2 \
+ -o feature@livelist=enabled \
+ -o feature@zpool_checkpoint=enabled \
+ -O devices=off \
+ -O acltype=posixacl -O xattr=sa \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+Ignore the warnings about the features “not in specified ‘compatibility’ +feature set.”
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The boot pool name is no longer arbitrary. It _must_ be bpool
.
+If you really want to rename it, edit /etc/grub.d/10_linux_zfs
later,
+after GRUB is installed (and run update-grub
).
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The device_rebuild
feature should be safe to use (except on raidz,
+which it is incompatible with), but the boot pool is small, so this does
+not matter in practice.
The log_spacemap
and spacemap_v2
features have been tested and
+are safe to use. The boot pool is small, so these do not matter in
+practice.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o ashift=12 \
+ -o autotrim=on \
+ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
+ -O compression=lz4 \
+ -O normalization=formD \
+ -O relatime=on \
+ -O canmount=off -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
+Also, disabling ACLs apparently breaks umask handling with NFSv4.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
Create filesystem datasets for the root and boot filesystems:
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+
+zfs create -o mountpoint=/ \
+ -o com.ubuntu.zsys:bootfs=yes \
+ -o com.ubuntu.zsys:last-used=$(date +%s) rpool/ROOT/ubuntu_$UUID
+
+zfs create -o mountpoint=/boot bpool/BOOT/ubuntu_$UUID
+
Create datasets:
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/usr
+zfs create -o com.ubuntu.zsys:bootfs=no -o canmount=off \
+ rpool/ROOT/ubuntu_$UUID/var
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib
+zfs create rpool/ROOT/ubuntu_$UUID/var/log
+zfs create rpool/ROOT/ubuntu_$UUID/var/spool
+
+zfs create -o canmount=off -o mountpoint=/ \
+ rpool/USERDATA
+zfs create -o com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_$UUID \
+ -o canmount=on -o mountpoint=/root \
+ rpool/USERDATA/root_$UUID
+chmod 700 /mnt/root
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to separate these to exclude them from snapshots:
+zfs create rpool/ROOT/ubuntu_$UUID/var/cache
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/nfs
+zfs create rpool/ROOT/ubuntu_$UUID/var/tmp
+chmod 1777 /mnt/var/tmp
+
If desired (the Ubuntu installer creates these):
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/apt
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/dpkg
+
If you use /srv on this system:
+zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/srv
+
If you use /usr/local on this system:
+zfs create rpool/ROOT/ubuntu_$UUID/usr/local
+
If this system will have games installed:
+zfs create rpool/ROOT/ubuntu_$UUID/var/games
+
If this system will have a GUI:
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/AccountsService
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/NetworkManager
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create rpool/ROOT/ubuntu_$UUID/var/lib/docker
+
If this system will store local email in /var/mail:
+zfs create rpool/ROOT/ubuntu_$UUID/var/mail
+
If this system will use Snap packages:
+zfs create rpool/ROOT/ubuntu_$UUID/var/snap
+
If you use /var/www on this system:
+zfs create rpool/ROOT/ubuntu_$UUID/var/www
+
For a mirror or raidz topology, create a dataset for /boot/grub
:
zfs create -o com.ubuntu.zsys:bootfs=no bpool/grub
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.ubuntu.zsys:bootfs=no \
+ rpool/ROOT/ubuntu_$UUID/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Note: If you separate a directory required for booting (e.g. /etc
)
+into its own dataset, you must add it to
+ZFS_INITRD_ADDITIONAL_DATASETS
in /etc/default/zfs
. Datasets
+with canmount=off
(like rpool/usr
above) do not matter for this.
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
Install the minimal system:
+debootstrap jammy /mnt
+
The debootstrap
command leaves the new system in an unconfigured state.
+An alternative to using debootstrap
is to copy the entirety of a
+working system into the new ZFS root.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
hostname HOSTNAME
+hostname > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Configure the network interface:
+Find the interface name:
+ip addr show
+
Adjust NAME
below to match your interface name:
vi /mnt/etc/netplan/01-netcfg.yaml
+
network:
+ version: 2
+ ethernets:
+ NAME:
+ dhcp4: true
+
Customize this file if the system is not a DHCP client.
+Configure the package sources:
+vi /mnt/etc/apt/sources.list
+
deb http://archive.ubuntu.com/ubuntu jammy main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu jammy-updates main restricted universe multiverse
+deb http://archive.ubuntu.com/ubuntu jammy-backports main restricted universe multiverse
+deb http://security.ubuntu.com/ubuntu jammy-security main restricted universe multiverse
+
Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /usr/bin/env DISK=$DISK UUID=$UUID bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+apt update
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
dpkg-reconfigure locales tzdata keyboard-configuration console-setup
+
Install your preferred text editor:
+apt install --yes nano
+
+apt install --yes vim
+
Installing the full vim
package fixes terminal problems that occur when
+using the vim-tiny
package (that is installed by debootstrap
) over
+SSH.
For LUKS installs only, setup /etc/crypttab
:
apt install --yes cryptsetup
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) \
+ none luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
Create the EFI filesystem:
+Perform these steps for both UEFI and legacy (BIOS) booting:
+apt install --yes dosfstools
+
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part1
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part1) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+
For a mirror or raidz topology, repeat the mkdosfs for the additional +disks, but do not repeat the other commands.
+Note: The -s 1
for mkdosfs
is only necessary for drives which
+present 4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster
+size (given the partition size of 512 MiB) for FAT32. It also works fine on
+drives which present 512 B sectors.
Put /boot/grub
on the EFI System Partition:
For a single-disk install only:
+mkdir /boot/efi/grub /boot/grub
+echo /boot/efi/grub /boot/grub none defaults,bind 0 0 >> /etc/fstab
+mount /boot/grub
+
This allows GRUB to write to /boot/grub
(since it is on a FAT-formatted
+ESP instead of on ZFS), which means that /boot/grub/grubenv
and the
+recordfail
feature works as expected: if the boot fails, the normally
+hidden GRUB menu will be shown on the next boot. For a mirror or raidz
+topology, we do not want GRUB writing to the EFI System Partition. This is
+because we duplicate it at install without a mechanism to update the copies
+when the GRUB configuration changes (e.g. as the kernel is upgraded). Thus,
+we keep /boot/grub
on the boot pool for the mirror or raidz topologies.
+This preserves correct mirroring/raidz behavior, at the expense of being
+able to write to /boot/grub/grubenv
and thus the recordfail
+behavior.
Install GRUB/Linux/ZFS in the chroot environment for the new system:
+Choose one of the following options:
+Install GRUB/Linux/ZFS for legacy (BIOS) booting:
+apt install --yes grub-pc linux-image-generic zfs-initramfs zsys
+
Select (using the space bar) all of the disks (not partitions) in your +pool.
+Install GRUB/Linux/ZFS for UEFI booting:
+apt install --yes \
+ grub-efi-amd64 grub-efi-amd64-signed linux-image-generic \
+ shim-signed zfs-initramfs zsys
+
Notes:
+Ignore any error messages saying ERROR: Couldn't resolve device
and
+WARNING: Couldn't determine root device
. cryptsetup does not
+support ZFS.
Ignore any error messages saying Module zfs not found
and
+couldn't connect to zsys daemon
. The first seems to occur due to a
+version mismatch between the Live CD kernel and the chroot environment,
+but this is irrelevant since the module is already loaded. The second
+may be caused by the first but either way is irrelevant since zed
+is started manually later.
For a mirror or raidz topology, this step only installs GRUB on the
+first disk. The other disk(s) will be handled later. For some reason,
+grub-efi-amd64 does not prompt for install_devices
here, but does
+after a reboot.
Optional: Remove os-prober:
+apt purge --yes os-prober
+
This avoids error messages from update-grub
. os-prober
is only
+necessary in dual-boot configurations.
Set a root password:
+passwd
+
Configure swap:
+Choose one of the following options if you want swap:
+For an unencrypted single-disk install:
+mkswap -f ${DISK}-part2
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
+ none swap discard 0 0 >> /etc/fstab
+swapon -a
+
For an unencrypted mirror or raidz topology:
+apt install --yes mdadm
+
+# Adjust the level (ZFS raidz = MD raid5, raidz2 = raid6) and
+# raid-devices if necessary and specify the actual devices.
+mdadm --create /dev/md0 --metadata=1.2 --level=mirror \
+ --raid-devices=2 ${DISK1}-part2 ${DISK2}-part2
+mkswap -f /dev/md0
+echo /dev/disk/by-uuid/$(blkid -s UUID -o value /dev/md0) \
+ none swap discard 0 0 >> /etc/fstab
+
For an encrypted (LUKS or ZFS native encryption) single-disk install:
+apt install --yes cryptsetup
+
+echo swap ${DISK}-part2 /dev/urandom \
+ swap,cipher=aes-xts-plain64:sha256,size=512 >> /etc/crypttab
+echo /dev/mapper/swap none swap defaults 0 0 >> /etc/fstab
+
For an encrypted (LUKS or ZFS native encryption) mirror or raidz +topology:
+apt install --yes cryptsetup mdadm
+
+# Adjust the level (ZFS raidz = MD raid5, raidz2 = raid6) and
+# raid-devices if necessary and specify the actual devices.
+mdadm --create /dev/md0 --metadata=1.2 --level=mirror \
+ --raid-devices=2 ${DISK1}-part2 ${DISK2}-part2
+echo swap /dev/md0 /dev/urandom \
+ swap,cipher=aes-xts-plain64:sha256,size=512 >> /etc/crypttab
+echo /dev/mapper/swap none swap defaults 0 0 >> /etc/fstab
+
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Setup system groups:
+addgroup --system lpadmin
+addgroup --system lxd
+addgroup --system sambashare
+
Optional: Install SSH:
+apt install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Verify that the ZFS boot filesystem is recognized:
+grub-probe /boot
+
Refresh the initrd files:
+update-initramfs -c -k all
+
Note: Ignore any error messages saying ERROR: Couldn't resolve
+device
and WARNING: Couldn't determine root device
. cryptsetup
+does not support ZFS.
Disable memory zeroing:
+vi /etc/default/grub
+# Add init_on_alloc=0 to: GRUB_CMDLINE_LINUX_DEFAULT
+# Save and quit (or see the next step).
+
This is to address performance regressions.
+Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Comment out: GRUB_TIMEOUT_STYLE=hidden
+# Set: GRUB_TIMEOUT=5
+# Below GRUB_TIMEOUT, add: GRUB_RECORDFAIL_TIMEOUT=5
+# Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-grub
+
Note: Ignore errors from osprober
, if present.
Install the boot loader:
+Choose one of the following options:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the
+grub-install
command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=ubuntu --recheck --no-floppy
+
Disable grub-initrd-fallback.service
+For a mirror or raidz topology:
+systemctl mask grub-initrd-fallback.service
+
This is the service for /boot/grub/grubenv
which does not work on
+mirrored or raidz topologies. Disabling this keeps it from blocking
+subsequent mounts of /boot/grub
if that mount ever fails.
Another option would be to set RequiresMountsFor=/boot/grub
via a
+drop-in unit, but that is more work to do here for no reason. Hopefully
+this bug
+will be fixed upstream.
Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/ubuntu_$UUID
+zfs set canmount=on rpool/ROOT/ubuntu_$UUID
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Once the files have data, stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Install GRUB to additional disks:
+For a UEFI mirror or raidz topology only:
+dpkg-reconfigure grub-efi-amd64
+
+Select (using the space bar) all of the ESP partitions (partition 1 on
+each of the pool disks).
+
Create a user account:
+Replace YOUR_USERNAME
with your desired username:
username=YOUR_USERNAME
+
+UUID=$(dd if=/dev/urandom bs=1 count=100 2>/dev/null |
+ tr -dc 'a-z0-9' | cut -c-6)
+ROOT_DS=$(zfs list -o name | awk '/ROOT\/ubuntu_/{print $1;exit}')
+zfs create -o com.ubuntu.zsys:bootfs-datasets=$ROOT_DS \
+ -o canmount=on -o mountpoint=/home/$username \
+ rpool/USERDATA/${username}_$UUID
+adduser $username
+
+cp -a /etc/skel/. /home/$username
+chown -R $username:$username /home/$username
+usermod -a -G adm,cdrom,dip,lpadmin,lxd,plugdev,sambashare,sudo $username
+
Upgrade the minimal system:
+apt dist-upgrade --yes
+
Install a regular set of software:
+Choose one of the following options:
+Install a command-line environment only:
+apt install --yes ubuntu-standard
+
Install a full GUI environment:
+apt install --yes ubuntu-desktop
+
Hint: If you are installing a full GUI environment, you will likely +want to manage your network with NetworkManager:
+rm /etc/netplan/01-netcfg.yaml
+vi /etc/netplan/01-network-manager-all.yaml
+
network:
+ version: 2
+ renderer: NetworkManager
+
Optional: Disable log compression:
+As /var/log
is already compressed by ZFS, logrotate’s compression is
+going to burn CPU and disk I/O for (in most cases) very little gain. Also,
+if you are making snapshots of /var/log
, logrotate’s compression will
+actually waste space, as the uncompressed data will live on in the
+snapshot. You can edit the files in /etc/logrotate.d
by hand to comment
+out compress
, or use this loop (copy-and-paste highly recommended):
for file in /etc/logrotate.d/* ; do
+ if grep -Eq "(^|[^#y])compress" "$file" ; then
+ sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"
+ fi
+done
+
Reboot:
+reboot
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+sudo vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+sudo systemctl restart ssh
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Uncomment: GRUB_TIMEOUT_STYLE=hidden
+# Add quiet and splash to: GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out: GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-grub
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+# Replace “UUID” as appropriate; use zfs list to find it:
+zfs mount rpool/ROOT/ubuntu_UUID
+zfs mount bpool/BOOT/ubuntu_UUID
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+chroot /mnt /bin/bash --login
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo apt install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.ms.fd:/usr/share/OVMF/OVMF_VARS.ms.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
Note
+If you want to use ZFS as your root filesystem, see the +Root on ZFS links below instead.
+On Ubuntu, ZFS is included in the default Linux kernel packages.
+To install the ZFS utilities, first make sure universe
is enabled in
+/etc/apt/sources.list
:
deb http://archive.ubuntu.com/ubuntu <CODENAME> main universe
+
Then install zfsutils-linux
:
apt update
+apt install zfsutils-linux
+
To get started with OpenZFS refer to the provided documentation for your +distribution. It will cover the recommended installation method and any +distribution specific information. First time OpenZFS users are +encouraged to check out Aaron Toponce’s excellent +documentation.
+If you want to use ZFS as your root filesystem, see the Root on ZFS +links below instead.
+ZFS packages are not included in official openSUSE repositories, but repository of filesystems projects of openSUSE +includes such packages of filesystems including OpenZFS.
+openSUSE progresses through 3 main distribution branches, these are called Tumbleweed, Leap and SLE. There are ZFS packages available for all three.
+This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
This is not an openSUSE official HOWTO page. This document will be updated if Root on ZFS support of +openSUSE is added in the future. +Also, openSUSE’s default system installer Yast2 does not support zfs. The method of setting up system +with zypper without Yast2 used in this page is based on openSUSE installation methods written by the +experience of the people in the community. +For more information about this, please look at the external links.
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @Zaryob.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo zypper install python3-pip
+pip3 install -r docs/requirements.txt
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+You can use unofficial script LroZ (Linux Root On Zfs), which is based on this manual and automates most steps.
Boot the openSUSE Live CD. If prompted, login with the username
+linux
without password. Connect your system to the Internet as
+appropriate (e.g. join your WiFi network). Open a terminal.
Check your openSUSE Leap release:
+lsb_release -d
+Description: openSUSE Leap {$release}
+
Setup and update the repositories:
+sudo zypper addrepo https://download.opensuse.org/repositories/filesystems/$(lsb_release -rs)/filesystems.repo
+sudo zypper refresh # Refresh all repositories
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+sudo zypper install openssh-server
+sudo systemctl restart sshd.service
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
. Do not forget to set the password for user by passwd
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+zypper install zfs zfs-kmp-default
+zypper install gdisk dkms
+modprobe zfs
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
If you are re-using a disk, clear it as necessary:
+If the disk was previously used in an MD array:
+zypper install mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition:
+mdadm --zero-superblock --force ${DISK}-part2
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
Hints:
+If you are creating a mirror or raidz topology, repeat the partitioning commands for all the disks which will be part of the pool.
Create the boot pool:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -o feature@zpool_checkpoint=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
The spacemap_v2
feature has been tested and is safe to use. The boot
+pool is small, so this does not matter in practice.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O encryption=on \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+zypper install cryptsetup
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
If you want to use grub bootloader, you must set:
+-o feature@async_destroy=enabled \
+-o feature@bookmarks=enabled \
+-o feature@embedded_data=enabled \
+-o feature@empty_bpobj=enabled \
+-o feature@enabled_txg=enabled \
+-o feature@extensible_dataset=enabled \
+-o feature@filesystem_limits=enabled \
+-o feature@hole_birth=enabled \
+-o feature@large_blocks=enabled \
+-o feature@lz4_compress=enabled \
+-o feature@spacemap_histogram=enabled \
+-o feature@zpool_checkpoint=enabled \
+
for your root pool. Relevant for grub 2.04 and Leap 15.3. Don’t use zpool +upgrade for this pool or you will lost the possibility to use grub2-install command.
+Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality has been implemented in Ubuntu 20.04 with
+the zsys
tool, though its dataset layout is more complicated. Even
+without such a tool, the rpool/ROOT and bpool/BOOT containers can still
+be used for manually created clones. That said, this HOWTO assumes a single
+filesystem for /boot
for simplicity.
Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/suse
+zfs mount rpool/ROOT/suse
+
+zfs create -o mountpoint=/boot bpool/BOOT/suse
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+chmod 700 /mnt/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to exclude these from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /opt on this system:
+zfs create rpool/opt
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will store local email in /var/mail:
+zfs create rpool/var/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If this system will use Flatpak packages:
+zfs create rpool/var/lib/flatpak
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
If this system will use GNOME:
+zfs create rpool/var/lib/AccountsService
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will use NFS (locking):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs -p
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Add repositories into chrooting directory:
+zypper --root /mnt ar http://download.opensuse.org/distribution/leap/$(lsb_release -rs)/repo/non-oss non-oss
+zypper --root /mnt ar http://download.opensuse.org/distribution/leap/$(lsb_release -rs)/repo/oss oss
+zypper --root /mnt ar http://download.opensuse.org/update/leap/$(lsb_release -rs)/oss update-oss
+zypper --root /mnt ar http://download.opensuse.org/update/leap/$(lsb_release -rs)/non-oss update-nonoss
+
Generate repository indexes:
+zypper --root /mnt refresh
+
You will get fingerprint exception, click a to say always trust and continue.:
+New repository or package signing key received:
+
+Repository: oss
+Key Name: openSUSE Project Signing Key <opensuse@opensuse.org>
+Key Fingerprint: 22C07BA5 34178CD0 2EFE22AA B88B2FD4 3DBDC284
+Key Created: Mon May 5 11:37:40 2014
+Key Expires: Thu May 2 11:37:40 2024
+Rpm Name: gpg-pubkey-3dbdc284-53674dd4
+
+Do you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r):
+
Install openSUSE Leap with zypper:
+If you install base pattern, zypper will install busybox-grep which masks default kernel package. +Thats why I recommend you to install enhanced_base pattern, if you’re new in openSUSE. But in enhanced_base, bloats +can annoy you, while you want to use it openSUSE on server. So, you need to select
+Install base packages of openSUSE Leap with zypper (Recommended for server):
+zypper --root /mnt install -t pattern base
+
Install enhanced base of openSUSE Leap with zypper (Recommended for desktop):
+zypper --root /mnt install -t pattern enhanced_base
+
Install openSUSE zypper package system into chroot:
+zypper --root /mnt install zypper
+
Recommended: Install openSUSE yast2 system into chroot:
+zypper --root /mnt install yast2
+zypper --root /mnt install -t pattern yast2_basis
+
It will make easier to configure network and other configurations for beginners.
+To install a desktop environment, see the openSUSE wiki
+Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
echo HOSTNAME > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+
or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Copy network information:
+rm /mnt/etc/resolv.conf
+cp /etc/resolv.conf /mnt/etc/
+
You will reconfigure network with yast2 later.
+Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+ln -s /proc/self/mounts /etc/mtab
+zypper refresh
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
locale -a
+
Output must include that languages:
+C
C.utf8
en_US.utf8
POSIX
Find yout locale from locale -a commands output then set it with following command.
+localectl set-locale LANG=en_US.UTF-8
+
Optional: Reinstallation for stability:
+After installation it may need. Some packages may have minor errors. +For that, do this if you wish. Since there is no command like +dpkg-reconfigure in openSUSE, zypper install -f stated as a alternative for +it +but it will reinstall packages.
+zypper install -f permissions-config iputils ca-certificates ca-certificates-mozilla pam shadow dbus libutempter0 suse-module-tools util-linux
+
Install kernel:
+zypper install kernel-default kernel-firmware
+
Note: If you installed base pattern, you need to deinstall busybox-grep to install kernel-default package.
+Install ZFS in the chroot environment for the new system:
+zypper install lsb-release
+zypper addrepo https://download.opensuse.org/repositories/filesystems/`lsb_release -rs`/filesystems.repo
+zypper refresh # Refresh all repositories
+zypper install zfs zfs-kmp-default
+
For LUKS installs only, setup /etc/crypttab
:
zypper install cryptsetup
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
For LUKS installs only, fix cryptsetup naming for ZFS:
+echo 'ENV{DM_NAME}!="", SYMLINK+="$env{DM_NAME}"
+ENV{DM_NAME}!="", SYMLINK+="dm-name-$env{DM_NAME}"' >> /etc/udev/rules.d/99-local-crypt.rules
+
Recommended: Generate and setup hostid:
+cd /root
+zypper install wget
+wget https://github.com/openzfs/zfs/files/4537537/genhostid.sh.gz
+gzip -d genhostid.sh.gz
+chmod +x genhostid.sh
+zgenhostid `/root/genhostid.sh`
+
Check, that generated and system hostid matches:
+/root/genhostid.sh
+hostid
+
Install GRUB
+Choose one of the following options:
+Install GRUB for legacy (BIOS) booting:
+zypper install grub2-x86_64-pc
+
If your processor is 32bit use grub2-i386-pc instead of x86_64 one.
+Install GRUB for UEFI booting:
+zypper install grub2-x86_64-efi dosfstools os-prober
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s PARTUUID -o value ${DISK}-part2) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+
Notes:
+-s 1
for mkdosfs
is only necessary for drives which present4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size +(given the partition size of 512 MiB) for FAT32. It also works fine on +drives which present 512 B sectors.
+first disk. The other disk(s) will be handled later.
+Optional: Remove os-prober:
+zypper remove os-prober
+
This avoids error messages from update-bootloader. os-prober is only +necessary in dual-boot configurations.
+Set a root password:
+passwd
+
Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/usr/sbin/zpool import -N -o cachefile=none bpool
+# Work-around to preserve zpool cache:
+ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
+ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Add zfs module into dracut:
+echo 'zfs'>> /etc/modules-load.d/zfs.conf
+
Kernel version of livecd can differ from currently installed version. Get kernel version of your new OS:
+kernel_version=$(find /boot/vmlinuz-* | grep -Eo '[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\-.*-default')
+
Refresh kernel files:
+kernel-install add "$kernel_version" /boot/vmlinuz-"$kernel_version"
+
Refresh the initrd files:
+mkinitrd
+
Note: After some installations, LUKS partition cannot seen by dracut, +this will print “Failure occured during following action: +configuring encrypted DM device X VOLUME_CRYPTSETUP_FAILED“. For fix this +issue you need to check cryptsetup installation. See for more information +Note: Although we add the zfs config to the system module into /etc/modules.d, if it is not seen by dracut, we have to add it to dracut by force. +dracut –kver $(uname -r) –force –add-drivers “zfs”
+Verify that the ZFS boot filesystem is recognized:
+grub2-probe /boot
+
Output must be zfs
+If you having trouble with grub2-probe command make this:
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile
+export ZPOOL_VDEV_NAME_PATH=YES
+
then go back to grub2-probe step.
+Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-bootloader
+
Note: Ignore errors from osprober
, if present.
+Note: If you have had trouble with the grub2 installation, I suggest you use systemd-boot.
+Note: If this command don’t gives any output, use classic grub.cfg generation with following command:
+grub2-mkconfig -o /boot/grub2/grub.cfg
Check that /boot/grub2/grub.cfg
have the menuentry root=ZFS=rpool/ROOT/suse
, like this:
linux /boot@/vmlinuz-5.3.18-150300.59.60-default root=ZFS=rpool/ROOT/suse
+
If not, change /etc/default/grub
:
GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/suse"
+
and repeat previous step.
+Install the boot loader:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub2-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the grub-install
+command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub2-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=opensuse --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+Warning: This will break your Yast2 Bootloader Configuration. Make sure that you +are not able to fix the problem you are having with grub2. I decided to write this +part because sometimes grub2 doesn’t see the rpool pool in some cases.
+Install systemd-boot:
+bootctl install
+
Note: Only if previous cmd replied “Failed to get machine id: No medium found”, you need:
+++systemd-machine-id-setup
+
and repeat installation systemd-boot.
+Configure bootloader configuration:
+tee -a /boot/efi/loader/loader.conf << EOF
+default openSUSE_Leap.conf
+timeout 5
+console-mode auto
+EOF
+
Write Entries:
+tee -a /boot/efi/loader/entries/openSUSE_Leap.conf << EOF
+title openSUSE Leap
+linux /EFI/openSUSE/vmlinuz
+initrd /EFI/openSUSE/initrd
+options root=zfs:rpool/ROOT/suse boot=zfs
+EOF
+
Copy files into EFI:
+mkdir /boot/efi/EFI/openSUSE
+cp /boot/{vmlinuz,initrd} /boot/efi/EFI/openSUSE
+
Update systemd-boot variables:
+bootctl update
+
Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+ln -s /usr/lib/zfs/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/suse
+zfs set canmount=noauto rpool/ROOT/suse
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Optional: Install SSH:
+zypper install -y openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Optional: Snapshot the initial installation:
+zfs snapshot -r bpool/BOOT/suse@install
+zfs snapshot -r rpool/ROOT/suse@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Create a user account:
+Replace username
with your desired username:
zfs create rpool/home/username
+adduser username
+
+cp -a /etc/skel/. /home/username
+chown -R username:username /home/username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video username
+
Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks.
+For legacy (BIOS) booting:: +Check to be sure we using efi mode:
+efibootmgr -v
+
This must return a message contains legacy_boot
+Then reconfigure grub:
+grub-install $DISK
+
Hit enter until you get to the device selection screen. +Select (using the space bar) all of the disks (not partitions) in your pool.
+For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment debian-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "opensuse-2" -l '\EFI\opensuse\grubx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. There is a bug report upstream.
+Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
Enable the swap device:
+swapon -av
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/suse@install
+sudo zfs destroy rpool/ROOT/suse@install
+
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+systemctl restart sshd
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-bootloader
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+zypper install cryptsetup
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+zfs mount rpool/ROOT/suse
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo zypper install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
This HOWTO uses a whole physical disk.
Do not use these instructions for dual-booting.
Backup your data. Any existing data will be lost.
This is not an openSUSE official HOWTO page. This document will be updated if Root on ZFS support of +openSUSE is added in the future. +Also, openSUSE’s default system installer Yast2 does not support zfs. The method of setting up system +with zypper without Yast2 used in this page is based on openSUSE installation methods written by the +experience of the people in the community. +For more information about this, please look at the external links.
Installing on a drive which presents 4 KiB logical sectors (a “4Kn” drive) +only works with UEFI booting. This not unique to ZFS. GRUB does not and +will not work on 4Kn with legacy (BIOS) booting.
Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of memory +is recommended for normal performance in basic workloads. If you wish to use +deduplication, you will need massive amounts of RAM. Enabling +deduplication is a permanent change that cannot be easily reverted.
+If you need help, reach out to the community using the Mailing Lists or IRC at +#zfsonlinux on Libera Chat. If you have a bug report or feature request +related to this HOWTO, please file a new issue and mention @Zaryob.
+Fork and clone: https://github.com/openzfs/openzfs-docs
Install the tools:
+sudo zypper install python3-pip
+pip3 install -r docs/requirements.txt
+# Add ~/.local/bin to your $PATH, e.g. by adding this to ~/.bashrc:
+PATH=$HOME/.local/bin:$PATH
+
Make your changes.
Test:
+cd docs
+make html
+sensible-browser _build/html/index.html
+
git commit --signoff
to a branch, git push
, and create a pull
+request.
This guide supports three different encryption options: unencrypted, ZFS +native encryption, and LUKS. With any option, all ZFS features are fully +available.
+Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance.
+ZFS native encryption encrypts the data and most metadata in the root
+pool. It does not encrypt dataset or snapshot names or properties. The
+boot pool is not encrypted at all, but it only contains the bootloader,
+kernel, and initrd. (Unless you put a password in /etc/fstab
, the
+initrd is unlikely to contain sensitive data.) The system cannot boot
+without the passphrase being entered at the console. Performance is
+good. As the encryption happens in ZFS, even if multiple disks (mirror
+or raidz topologies) are used, the data only has to be encrypted once.
LUKS encrypts almost everything. The only unencrypted data is the bootloader, +kernel, and initrd. The system cannot boot without the passphrase being +entered at the console. Performance is good, but LUKS sits underneath ZFS, so +if multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk.
+Boot the openSUSE Live CD. If prompted, login with the username
+live
and password live
. Connect your system to the Internet as
+appropriate (e.g. join your WiFi network). Open a terminal.
Setup and update the repositories:
+sudo zypper addrepo https://download.opensuse.org/repositories/filesystems/openSUSE_Tumbleweed/filesystems.repo
+sudo zypper refresh # Refresh all repositories
+
Optional: Install and start the OpenSSH server in the Live CD environment:
+If you have a second system, using SSH to access the target system can be +convenient:
+sudo zypper install openssh-server
+sudo systemctl restart sshd.service
+
Hint: You can find your IP address with
+ip addr show scope global | grep inet
. Then, from your main machine,
+connect with ssh user@IP
.
Disable automounting:
+If the disk has been used before (with partitions at the same offsets), +previous filesystems (e.g. the ESP) will automount if not disabled:
+gsettings set org.gnome.desktop.media-handling automount false
+
Become root:
+sudo -i
+
Install ZFS in the Live CD environment:
+zypper install zfs zfs-kmp-default
+zypper install gdisk
+modprobe zfs
+
Set a variable with the disk name:
+DISK=/dev/disk/by-id/scsi-SATA_disk1
+
Always use the long /dev/disk/by-id/*
aliases with ZFS. Using the
+/dev/sd*
device nodes directly can cause sporadic import failures,
+especially on systems that have more than one storage pool.
Hints:
+ls -la /dev/disk/by-id
will list the aliases.
Are you doing this in a virtual machine? If your virtual disk is missing
+from /dev/disk/by-id
, use /dev/vda
if you are using KVM with
+virtio; otherwise, read the troubleshooting
+section.
If you are re-using a disk, clear it as necessary:
+If the disk was previously used in an MD array:
+zypper install mdadm
+
+# See if one or more MD arrays are active:
+cat /proc/mdstat
+# If so, stop them (replace ``md0`` as required):
+mdadm --stop /dev/md0
+
+# For an array using the whole disk:
+mdadm --zero-superblock --force $DISK
+# For an array using a partition:
+mdadm --zero-superblock --force ${DISK}-part2
+
Clear the partition table:
+sgdisk --zap-all $DISK
+
If you get a message about the kernel still using the old partition table, +reboot and start over (except that you can skip this step).
+Partition your disk(s):
+Run this if you need legacy (BIOS) booting:
+sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
+
Run this for UEFI booting (for use now or in the future):
+sgdisk -n2:1M:+512M -t2:EF00 $DISK
+
Run this for the boot pool:
+sgdisk -n3:0:+1G -t3:BF01 $DISK
+
Choose one of the following options:
+Unencrypted or ZFS native encryption:
+sgdisk -n4:0:0 -t4:BF00 $DISK
+
LUKS:
+sgdisk -n4:0:0 -t4:8309 $DISK
+
If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool.
+Create the boot pool:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 -d \
+ -o feature@async_destroy=enabled \
+ -o feature@bookmarks=enabled \
+ -o feature@embedded_data=enabled \
+ -o feature@empty_bpobj=enabled \
+ -o feature@enabled_txg=enabled \
+ -o feature@extensible_dataset=enabled \
+ -o feature@filesystem_limits=enabled \
+ -o feature@hole_birth=enabled \
+ -o feature@large_blocks=enabled \
+ -o feature@lz4_compress=enabled \
+ -o feature@spacemap_histogram=enabled \
+ -o feature@zpool_checkpoint=enabled \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
+ -O mountpoint=/boot -R /mnt \
+ bpool ${DISK}-part3
+
You should not need to customize any of the options for the boot pool.
+GRUB does not support all of the zpool features. See spa_feature_names
+in grub-core/fs/zfs/zfs.c.
+This step creates a separate boot pool for /boot
with the features
+limited to only those that GRUB supports, allowing the root pool to use
+any/all features. Note that GRUB opens the pool read-only, so all
+read-only compatible features are “supported” by GRUB.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ bpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part3 \
+ /dev/disk/by-id/scsi-SATA_disk2-part3
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
The pool name is arbitrary. If changed, the new name must be used
+consistently. The bpool
convention originated in this HOWTO.
Feature Notes:
+The allocation_classes
feature should be safe to use. However, unless
+one is using it (i.e. a special
vdev), there is no point to enabling
+it. It is extremely unlikely that someone would use this feature for a
+boot pool. If one cares about speeding up the boot pool, it would make
+more sense to put the whole pool on the faster disk rather than using it
+as a special
vdev.
The project_quota
feature has been tested and is safe to use. This
+feature is extremely unlikely to matter for the boot pool.
The resilver_defer
should be safe but the boot pool is small enough
+that it is unlikely to be necessary.
The spacemap_v2
feature has been tested and is safe to use. The boot
+pool is small, so this does not matter in practice.
As a read-only compatible feature, the userobj_accounting
feature
+should be compatible in theory, but in practice, GRUB can fail with an
+“invalid dnode type” error. This feature does not matter for /boot
+anyway.
Create the root pool:
+Choose one of the following options:
+Unencrypted:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
ZFS native encryption:
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O encryption=on \
+ -O keylocation=prompt -O keyformat=passphrase \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool ${DISK}-part4
+
LUKS:
+zypper install cryptsetup
+cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4
+cryptsetup luksOpen ${DISK}-part4 luks1
+zpool create \
+ -o cachefile=/etc/zfs/zpool.cache \
+ -o ashift=12 \
+ -O acltype=posixacl -O canmount=off -O compression=lz4 \
+ -O dnodesize=auto -O normalization=formD -O relatime=on \
+ -O xattr=sa -O mountpoint=/ -R /mnt \
+ rpool /dev/mapper/luks1
+
Notes:
+The use of ashift=12
is recommended here because many drives
+today have 4 KiB (or larger) physical sectors, even though they
+present 512 B logical sectors. Also, a future replacement drive may
+have 4 KiB physical sectors (in which case ashift=12
is desirable)
+or 4 KiB logical sectors (in which case ashift=12
is required).
Setting -O acltype=posixacl
enables POSIX ACLs globally. If you
+do not want this, remove that option, but later add
+-o acltype=posixacl
(note: lowercase “o”) to the zfs create
+for /var/log
, as journald requires ACLs
Setting normalization=formD
eliminates some corner cases relating
+to UTF-8 filename normalization. It also implies utf8only=on
,
+which means that only UTF-8 filenames are allowed. If you care to
+support non-UTF-8 filenames, do not use this option. For a discussion
+of why requiring UTF-8 filenames may be a bad idea, see The problems
+with enforced UTF-8 only filenames.
recordsize
is unset (leaving it at the default of 128 KiB). If you
+want to tune it (e.g. -O recordsize=1M
), see these various blog
+posts.
Setting relatime=on
is a middle ground between classic POSIX
+atime
behavior (with its significant performance impact) and
+atime=off
(which provides the best performance by completely
+disabling atime updates). Since Linux 2.6.30, relatime
has been
+the default for other filesystems. See RedHat’s documentation
+for further information.
Setting xattr=sa
vastly improves the performance of extended
+attributes.
+Inside ZFS, extended attributes are used to implement POSIX ACLs.
+Extended attributes can also be used by user-space applications.
+They are used by some desktop GUI applications.
+They can be used by Samba to store Windows ACLs and DOS attributes;
+they are required for a Samba Active Directory domain controller.
+Note that xattr=sa
is Linux-specific. If you move your
+xattr=sa
pool to another OpenZFS implementation besides ZFS-on-Linux,
+extended attributes will not be readable (though your data will be). If
+portability of extended attributes is important to you, omit the
+-O xattr=sa
above. Even if you do not want xattr=sa
for the whole
+pool, it is probably fine to use it for /var/log
.
Make sure to include the -part4
portion of the drive path. If you
+forget that, you are specifying the whole disk, which ZFS will then
+re-partition, and you will lose the bootloader partition(s).
ZFS native encryption now
+defaults to aes-256-gcm
.
For LUKS, the key size chosen is 512 bits. However, XTS mode requires two
+keys, so the LUKS key is split in half. Thus, -s 512
means AES-256.
Your passphrase will likely be the weakest link. Choose wisely. See +section 5 of the cryptsetup FAQ +for guidance.
Hints:
+If you are creating a mirror topology, create the pool using:
+zpool create \
+ ... \
+ rpool mirror \
+ /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ /dev/disk/by-id/scsi-SATA_disk2-part4
+
For raidz topologies, replace mirror
in the above command with
+raidz
, raidz2
, or raidz3
and list the partitions from
+the additional disks.
When using LUKS with mirror or raidz topologies, use
+/dev/mapper/luks1
, /dev/mapper/luks2
, etc., which you will have
+to create using cryptsetup
.
The pool name is arbitrary. If changed, the new name must be used
+consistently. On systems that can automatically install to ZFS, the root
+pool is named rpool
by default.
Create filesystem datasets to act as containers:
+zfs create -o canmount=off -o mountpoint=none rpool/ROOT
+zfs create -o canmount=off -o mountpoint=none bpool/BOOT
+
On Solaris systems, the root filesystem is cloned and the suffix is
+incremented for major system changes through pkg image-update
or
+beadm
. Similar functionality has been implemented in Ubuntu 20.04 with
+the zsys
tool, though its dataset layout is more complicated. Even
+without such a tool, the rpool/ROOT and bpool/BOOT containers can still
+be used for manually created clones. That said, this HOWTO assumes a single
+filesystem for /boot
for simplicity.
Create filesystem datasets for the root and boot filesystems:
+zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/suse
+zfs mount rpool/ROOT/suse
+
+zfs create -o mountpoint=/boot bpool/BOOT/suse
+
With ZFS, it is not normally necessary to use a mount command (either
+mount
or zfs mount
). This situation is an exception because of
+canmount=noauto
.
Create datasets:
+zfs create rpool/home
+zfs create -o mountpoint=/root rpool/home/root
+chmod 700 /mnt/root
+zfs create -o canmount=off rpool/var
+zfs create -o canmount=off rpool/var/lib
+zfs create rpool/var/log
+zfs create rpool/var/spool
+
The datasets below are optional, depending on your preferences and/or +software choices.
+If you wish to exclude these from snapshots:
+zfs create -o com.sun:auto-snapshot=false rpool/var/cache
+zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
+chmod 1777 /mnt/var/tmp
+
If you use /opt on this system:
+zfs create rpool/opt
+
If you use /srv on this system:
+zfs create rpool/srv
+
If you use /usr/local on this system:
+zfs create -o canmount=off rpool/usr
+zfs create rpool/usr/local
+
If this system will have games installed:
+zfs create rpool/var/games
+
If this system will store local email in /var/mail:
+zfs create rpool/var/spool/mail
+
If this system will use Snap packages:
+zfs create rpool/var/snap
+
If this system will use Flatpak packages:
+zfs create rpool/var/lib/flatpak
+
If you use /var/www on this system:
+zfs create rpool/var/www
+
If this system will use GNOME:
+zfs create rpool/var/lib/AccountsService
+
If this system will use Docker (which manages its own datasets & +snapshots):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
+
If this system will use NFS (locking):
+zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs
+
Mount a tmpfs at /run:
+mkdir /mnt/run
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
A tmpfs is recommended later, but if you want a separate dataset for
+/tmp
:
zfs create -o com.sun:auto-snapshot=false rpool/tmp
+chmod 1777 /mnt/tmp
+
The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data.
+If you do nothing extra, /tmp
will be stored as part of the root
+filesystem. Alternatively, you can create a separate dataset for /tmp
,
+as shown above. This keeps the /tmp
data out of snapshots of your root
+filesystem. It also allows you to set a quota on rpool/tmp
, if you want
+to limit the maximum space used. Otherwise, you can use a tmpfs (RAM
+filesystem) later.
Copy in zpool.cache:
+mkdir /mnt/etc/zfs -p
+cp /etc/zfs/zpool.cache /mnt/etc/zfs/
+
Add repositories into chrooting directory:
+zypper --root /mnt ar http://download.opensuse.org/tumbleweed/repo/non-oss/ non-oss
+zypper --root /mnt ar http://download.opensuse.org/tumbleweed/repo/oss/ oss
+
Generate repository indexes:
+zypper --root /mnt refresh
+
You will get fingerprint exception, click a to say always trust and continue.:
+New repository or package signing key received:
+
+Repository: oss
+Key Name: openSUSE Project Signing Key <opensuse@opensuse.org>
+Key Fingerprint: 22C07BA5 34178CD0 2EFE22AA B88B2FD4 3DBDC284
+Key Created: Mon May 5 11:37:40 2014
+Key Expires: Thu May 2 11:37:40 2024
+Rpm Name: gpg-pubkey-3dbdc284-53674dd4
+
+Do you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r):
+
Install openSUSE Tumbleweed with zypper:
+If you install base pattern, zypper will install busybox-grep which masks default kernel package. +Thats why I recommend you to install enhanced_base pattern, if you’re new in openSUSE. But in enhanced_base, bloats +can annoy you, while you want to use it openSUSE on server. So, you need to select
+Install base packages of openSUSE Tumbleweed with zypper (Recommended for server):
+zypper --root /mnt install -t pattern base
+
Install enhanced base of openSUSE Tumbleweed with zypper (Recommended for desktop):
+zypper --root /mnt install -t pattern enhanced_base
+
Install openSUSE zypper package system into chroot:
+zypper --root /mnt install zypper
+
Recommended: Install openSUSE yast2 system into chroot:
+zypper --root /mnt install yast2
+
++++Note
+If your /etc/resolv.conf file is empty, proceed this command.
+echo “nameserver 8.8.4.4” | tee -a /mnt/etc/resolv.conf
+It will make easier to configure network and other configurations for beginners.
+
To install a desktop environment, see the openSUSE wiki
+Configure the hostname:
+Replace HOSTNAME
with the desired hostname:
echo HOSTNAME > /mnt/etc/hostname
+vi /mnt/etc/hosts
+
Add a line:
+127.0.1.1 HOSTNAME
+
or if the system has a real name in DNS:
+127.0.1.1 FQDN HOSTNAME
+
Hint: Use nano
if you find vi
confusing.
Copy network information:
+cp /etc/resolv.conf /mnt/etc
+
You will reconfigure network with yast2.
+Note
+If your /etc/resolv.conf file is empty, proceed this command.
+echo “nameserver 8.8.4.4” | tee -a /mnt/etc/resolv.conf
+Bind the virtual filesystems from the LiveCD environment to the new
+system and chroot
into it:
mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+mount -t tmpfs tmpfs /mnt/run
+mkdir /mnt/run/lock
+
+chroot /mnt /usr/bin/env DISK=$DISK bash --login
+
Note: This is using --rbind
, not --bind
.
Configure a basic system environment:
+ln -s /proc/self/mounts /etc/mtab
+zypper refresh
+
Even if you prefer a non-English system language, always ensure that
+en_US.UTF-8
is available:
locale -a
+
Output must include that languages:
+C
C.UTF-8
en_US.utf8
POSIX
Find yout locale from locale -a commands output then set it with following command.
+localectl set-locale LANG=en_US.UTF-8
+
Optional: Reinstallation for stability:
+After installation it may need. Some packages may have minor errors. +For that, do this if you wish. Since there is no command like +dpkg-reconfigure in openSUSE, zypper install -f stated as a alternative for +it +but it will reinstall packages.
+zypper install -f permissions-config iputils ca-certificates ca-certificates-mozilla pam shadow dbus-1 libutempter0 suse-module-tools util-linux
+
Install kernel:
+zypper install kernel-default kernel-firmware
+
Note
+If you installed base pattern, you need to deinstall busybox-grep to install kernel-default package.
+Install ZFS in the chroot environment for the new system:
+zypper addrepo https://download.opensuse.org/repositories/filesystems/openSUSE_Tumbleweed/filesystems.repo
+zypper refresh # Refresh all repositories
+zypper install zfs
+
For LUKS installs only, setup /etc/crypttab
:
zypper install cryptsetup
+
+echo luks1 /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part4) none \
+ luks,discard,initramfs > /etc/crypttab
+
The use of initramfs
is a work-around for cryptsetup does not support
+ZFS.
Hint: If you are creating a mirror or raidz topology, repeat the
+/etc/crypttab
entries for luks2
, etc. adjusting for each disk.
For LUKS installs only, fix cryptsetup naming for ZFS:
+echo 'ENV{DM_NAME}!="", SYMLINK+="$env{DM_NAME}"
+ENV{DM_NAME}!="", SYMLINK+="dm-name-$env{DM_NAME}"' >> /etc/udev/rules.d/99-local-crypt.rules
+
Install GRUB
+Choose one of the following options:
+Install GRUB for legacy (BIOS) booting:
+zypper install grub2-i386-pc
+
Install GRUB for UEFI booting:
+zypper install grub2-x86_64-efi dosfstools os-prober
+mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
+mkdir /boot/efi
+echo /dev/disk/by-uuid/$(blkid -s PARTUUID -o value ${DISK}-part2) \
+ /boot/efi vfat defaults 0 0 >> /etc/fstab
+mount /boot/efi
+
Notes:
+-s 1
for mkdosfs
is only necessary for drives which present4 KiB logical sectors (“4Kn” drives) to meet the minimum cluster size +(given the partition size of 512 MiB) for FAT32. It also works fine on +drives which present 512 B sectors.
+first disk. The other disk(s) will be handled later.
+Optional: Remove os-prober:
+zypper remove os-prober
+
This avoids error messages from update-bootloader. os-prober is only +necessary in dual-boot configurations.
+Set a root password:
+passwd
+
Enable importing bpool
+This ensures that bpool
is always imported, regardless of whether
+/etc/zfs/zpool.cache
exists, whether it is in the cachefile or not,
+or whether zfs-import-scan.service
is enabled.
vi /etc/systemd/system/zfs-import-bpool.service
+
[Unit]
+DefaultDependencies=no
+Before=zfs-import-scan.service
+Before=zfs-import-cache.service
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecStart=/sbin/zpool import -N -o cachefile=none bpool
+# Work-around to preserve zpool cache:
+ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
+ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
+
+[Install]
+WantedBy=zfs-import.target
+
systemctl enable zfs-import-bpool.service
+
Optional (but recommended): Mount a tmpfs to /tmp
If you chose to create a /tmp
dataset above, skip this step, as they
+are mutually exclusive choices. Otherwise, you can put /tmp
on a
+tmpfs (RAM filesystem) by enabling the tmp.mount
unit.
cp /usr/share/systemd/tmp.mount /etc/systemd/system/
+systemctl enable tmp.mount
+
Add zfs module into dracut:
+echo 'zfs'>> /etc/modules-load.d/zfs.conf
+
Refresh kernel files:
+kernel-install add $(uname -r) /boot/vmlinuz-$(uname -r)
+
Refresh the initrd files:
+mkinitrd
+
Note: After some installations, LUKS partition cannot seen by dracut, +this will print “Failure occured during following action: +configuring encrypted DM device X VOLUME_CRYPTSETUP_FAILED“. For fix this +issue you need to check cryptsetup installation. See for more information +Note: Although we add the zfs config to the system module into /etc/modules.d, if it is not seen by dracut, we have to add it to dracut by force. +dracut –kver $(uname -r) –force –add-drivers “zfs”
+Verify that the ZFS boot filesystem is recognized:
+grub2-probe /boot
+
Output must be zfs
+If you having trouble with grub2-probe command make this:
+echo 'export ZPOOL_VDEV_NAME_PATH=YES' >> /etc/profile
+export ZPOOL_VDEV_NAME_PATH=YES
+
then go back to grub2-probe step.
+Workaround GRUB’s missing zpool-features support:
+vi /etc/default/grub
+# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/suse"
+
Optional (but highly recommended): Make debugging GRUB easier:
+vi /etc/default/grub
+# Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT
+# Uncomment: GRUB_TERMINAL=console
+# Save and quit.
+
Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired.
+Update the boot configuration:
+update-bootloader
+
Note: Ignore errors from osprober
, if present.
+Note: If you have had trouble with the grub2 installation, I suggest you use systemd-boot.
+Note: If this command don’t gives any output, use classic grub.cfg generation with following command:
+grub2-mkconfig -o /boot/grub2/grub.cfg
Install the boot loader:
+For legacy (BIOS) booting, install GRUB to the MBR:
+grub2-install $DISK
+
Note that you are installing GRUB to the whole disk, not a partition.
+If you are creating a mirror or raidz topology, repeat the grub-install
+command for each disk in the pool.
For UEFI booting, install GRUB to the ESP:
+grub2-install --target=x86_64-efi --efi-directory=/boot/efi \
+ --bootloader-id=opensuse --recheck --no-floppy
+
It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later.
+Warning: This will break your Yast2 Bootloader Configuration. Make sure that you +are not able to fix the problem you are having with grub2. I decided to write this +part because sometimes grub2 doesn’t see the rpool pool in some cases.
+Install systemd-boot:
+bootctl install
+
Configure bootloader configuration:
+tee -a /boot/efi/loader/loader.conf << EOF
+default openSUSE_Tumbleweed.conf
+timeout 5
+console-mode auto
+EOF
+
Write Entries:
+tee -a /boot/efi/loader/entries/openSUSE_Tumbleweed.conf << EOF
+title openSUSE Tumbleweed
+linux /EFI/openSUSE/vmlinuz
+initrd /EFI/openSUSE/initrd
+options root=zfs=rpool/ROOT/suse boot=zfs
+EOF
+
Copy files into EFI:
+mkdir /boot/efi/EFI/openSUSE
+cp /boot/{vmlinuz,initrd} /boot/efi/EFI/openSUSE
+
Update systemd-boot variables:
+bootctl update
+
Fix filesystem mount ordering:
+We need to activate zfs-mount-generator
. This makes systemd aware of
+the separate mountpoints, which is important for things like /var/log
+and /var/tmp
. In turn, rsyslog.service
depends on var-log.mount
+by way of local-fs.target
and services using the PrivateTmp
feature
+of systemd automatically use After=var-tmp.mount
.
mkdir /etc/zfs/zfs-list.cache
+touch /etc/zfs/zfs-list.cache/bpool
+touch /etc/zfs/zfs-list.cache/rpool
+ln -s /usr/lib/zfs/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
+zed -F &
+
Verify that zed
updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
+cat /etc/zfs/zfs-list.cache/rpool
+
If either is empty, force a cache update and check again:
+zfs set canmount=on bpool/BOOT/suse
+zfs set canmount=noauto rpool/ROOT/suse
+
If they are still empty, stop zed (as below), start zed (as above) and try +again.
+Stop zed
:
fg
+Press Ctrl-C.
+
Fix the paths to eliminate /mnt
:
sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
+
Optional: Install SSH:
+zypper install --yes openssh-server
+
+vi /etc/ssh/sshd_config
+# Set: PermitRootLogin yes
+
Optional: Snapshot the initial installation:
+zfs snapshot bpool/BOOT/suse@install
+zfs snapshot rpool/ROOT/suse@install
+
In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space.
+Exit from the chroot
environment back to the LiveCD environment:
exit
+
Run these commands in the LiveCD environment to unmount all +filesystems:
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+
Reboot:
+reboot
+
Wait for the newly installed system to boot normally. Login as root.
+Create a user account:
+Replace username
with your desired username:
zfs create rpool/home/username
+adduser username
+
+cp -a /etc/skel/. /home/username
+chown -R username:username /home/username
+usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video username
+
Mirror GRUB
+If you installed to multiple disks, install GRUB on the additional +disks.
+For legacy (BIOS) booting:: +Check to be sure we using efi mode:
+efibootmgr -v
+
This must return a message contains legacy_boot
+Then reconfigure grub:
+grub-install $DISK
+
Hit enter until you get to the device selection screen. +Select (using the space bar) all of the disks (not partitions) in your pool.
+For UEFI booting:
+umount /boot/efi
+
For the second and subsequent disks (increment debian-2 to -3, etc.):
+dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \
+ of=/dev/disk/by-id/scsi-SATA_disk2-part2
+efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \
+ -p 2 -L "opensuse-2" -l '\EFI\opensuse\grubx64.efi'
+
+mount /boot/efi
+
Caution: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. There is a bug report upstream.
+Create a volume dataset (zvol) for use as a swap device:
+zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata -o secondarycache=none \
+ -o com.sun:auto-snapshot=false rpool/swap
+
You can adjust the size (the 4G
part) to your needs.
The compression algorithm is set to zle
because it is the cheapest
+available algorithm. As this guide recommends ashift=12
(4 kiB
+blocks on disk), the common case of a 4 kiB page size means that no
+compression algorithm can reduce I/O. The exception is all-zero pages,
+which are dropped by ZFS; but some form of compression has to be enabled
+to get this behavior.
Configure the swap device:
+Caution: Always use long /dev/zvol
aliases in configuration
+files. Never use a short /dev/zdX
device name.
mkswap -f /dev/zvol/rpool/swap
+echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab
+echo RESUME=none > /etc/initramfs-tools/conf.d/resume
+
The RESUME=none
is necessary to disable resuming from hibernation.
+This does not work, as the zvol is not present (because the pool has not
+yet been imported) at the time the resume script runs. If it is not
+disabled, the boot process hangs for 30 seconds waiting for the swap
+zvol to appear.
Enable the swap device:
+swapon -av
+
Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally.
Optional: Delete the snapshots of the initial installation:
+sudo zfs destroy bpool/BOOT/suse@install
+sudo zfs destroy rpool/ROOT/suse@install
+
Optional: Disable the root password:
+sudo usermod -p '*' root
+
Optional (but highly recommended): Disable root SSH logins:
+If you installed SSH earlier, revert the temporary change:
+vi /etc/ssh/sshd_config
+# Remove: PermitRootLogin yes
+
+systemctl restart sshd
+
Optional: Re-enable the graphical boot process:
+If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer.
+sudo vi /etc/default/grub
+# Add quiet to GRUB_CMDLINE_LINUX_DEFAULT
+# Comment out GRUB_TERMINAL=console
+# Save and quit.
+
+sudo update-bootloader
+
Note: Ignore errors from osprober
, if present.
Optional: For LUKS installs only, backup the LUKS header:
+sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \
+ --header-backup-file luks1-header.dat
+
Store that backup somewhere safe (e.g. cloud storage). It is protected by +your LUKS passphrase, but you may wish to use additional encryption.
+Hint: If you created a mirror or raidz topology, repeat this for each
+LUKS volume (luks2
, etc.).
Go through Step 1: Prepare The Install Environment.
+For LUKS, first unlock the disk(s):
+zypper install cryptsetup
+cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1
+# Repeat for additional disks, if this is a mirror or raidz topology.
+
Mount everything correctly:
+zpool export -a
+zpool import -N -R /mnt rpool
+zpool import -N -R /mnt bpool
+zfs load-key -a
+zfs mount rpool/ROOT/suse
+zfs mount -a
+
If needed, you can chroot into your installed environment:
+mount --make-private --rbind /dev /mnt/dev
+mount --make-private --rbind /proc /mnt/proc
+mount --make-private --rbind /sys /mnt/sys
+chroot /mnt /bin/bash --login
+mount /boot/efi
+mount -a
+
Do whatever you need to do to fix your system.
+When done, cleanup:
+exit
+mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
+ xargs -i{} umount -lf {}
+zpool export -a
+reboot
+
Systems that require the arcsas
blob driver should add it to the
+/etc/initramfs-tools/modules
file and run update-initramfs -c -k all
.
Upgrade or downgrade the Areca driver if something like
+RIP: 0010:[<ffffffff8101b316>] [<ffffffff8101b316>] native_read_tsc+0x6/0x20
+appears anywhere in kernel log. ZoL is unstable on systems that emit this
+error message.
Most problem reports for this tutorial involve mpt2sas
hardware that does
+slow asynchronous drive initialization, like some IBM M1015 or OEM-branded
+cards that have been flashed to the reference LSI firmware.
The basic problem is that disks on these controllers are not visible to the +Linux kernel until after the regular system is started, and ZoL does not +hotplug pool members. See https://github.com/zfsonlinux/zfs/issues/330.
+Most LSI cards are perfectly compatible with ZoL. If your card has this
+glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X
in
+/etc/default/zfs
. The system will wait X
seconds for all drives to
+appear before importing the pool.
Set a unique serial number on each virtual disk using libvirt or qemu
+(e.g. -drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890
).
To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host:
+sudo zypper install ovmf
+sudo vi /etc/libvirt/qemu.conf
+
Uncomment these lines:
+nvram = [
+ "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd",
+ "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd"
+]
+
sudo systemctl restart libvirtd.service
+
Set disk.EnableUUID = "TRUE"
in the vmx file or vsphere configuration.
+Doing this ensures that /dev/disk
aliases are created in the guest.
This section is compatible with Alpine, Arch, Fedora and RHEL guides. +Not necessary for NixOS. Incompatible with Ubuntu and Debian guides.
+Note: boot environments as described below are intended only for +system recovery purposes, that is, you boot into the alternate boot +environment once to perform system recovery on the default datasets:
+rpool/distro/root
+bpool/distro/root
+
then reboot to those datasets once you have successfully recovered the +system.
+Switching the default boot environment complicates bootloader recovery +and other maintenance operations and is thus currently not supported.
+If you want to use the @initial-installation
snapshot created
+during installation, set my_boot_env=initial-installation
and
+skip Step 3 and 4.
Identify which dataset is currently mounted as root
+/
and boot /boot
set -x
+boot_dataset=$(df -P /boot | tail -n1 | cut -f1 -d' ' || true )
+root_dataset=$(df -P / | tail -n1 | cut -f1 -d' ' || true )
+
Choose a name for the new boot environment
+my_boot_env=backup
+
Take snapshots of the /
and /boot
datasets
zfs snapshot "${boot_dataset}"@"${my_boot_env}"
+zfs snapshot "${root_dataset}"@"${my_boot_env}"
+
Create clones from read-only snapshots
+new_root_dataset="${root_dataset%/*}"/"${my_boot_env}"
+new_boot_dataset="${boot_dataset%/*}"/"${my_boot_env}"
+
+zfs clone -o canmount=noauto \
+ -o mountpoint=/ \
+ "${root_dataset}"@"${my_boot_env}" \
+ "${new_root_dataset}"
+
+zfs clone -o canmount=noauto \
+ -o mountpoint=legacy \
+ "${boot_dataset}"@"${my_boot_env}" \
+ "${new_boot_dataset}"
+
Mount clone and update file system table (fstab)
+MNT=$(mktemp -d)
+mount -t zfs -o zfsutil "${new_root_dataset}" "${MNT}"
+mount -t zfs "${new_boot_dataset}" "${MNT}"/boot
+
+sed -i s,"${root_dataset}","${new_root_dataset}",g "${MNT}"/etc/fstab
+sed -i s,"${boot_dataset}","${new_boot_dataset}",g "${MNT}"/etc/fstab
+
+if test -f "${MNT}"/boot/grub/grub.cfg; then
+ is_grub2=n
+ sed -i s,"${boot_dataset#bpool/}","${new_boot_dataset#bpool/}",g "${MNT}"/boot/grub/grub.cfg
+elif test -f "${MNT}"/boot/grub2/grub.cfg; then
+ is_grub2=y
+ sed -i s,"${boot_dataset#bpool/}","${new_boot_dataset#bpool/}",g "${MNT}"/boot/grub2/grub.cfg
+else
+ echo "ERROR: no grub menu found!"
+ exit 1
+fi
+
Do not proceed if no grub menu was found!
+Unmount clone
+umount -Rl "${MNT}"
+
Add new boot environment as GRUB menu entry
+echo "# ${new_boot_dataset}" > new_boot_env_entry_"${new_boot_dataset##*/}"
+printf '\n%s' "menuentry 'Boot environment ${new_boot_dataset#bpool/} from ${boot_dataset#bpool/}' " \
+ >> new_boot_env_entry_"${new_boot_dataset##*/}"
+if [ "${is_grub2}" = y ]; then
+ # shellcheck disable=SC2016
+ printf '{ search --set=drive1 --label bpool; configfile ($drive1)/%s@/grub2/grub.cfg; }' \
+ "${new_boot_dataset#bpool/}" >> new_boot_env_entry_"${new_boot_dataset##*/}"
+else
+ # shellcheck disable=SC2016
+ printf '{ search --set=drive1 --label bpool; configfile ($drive1)/%s@/grub/grub.cfg; }' \
+ "${new_boot_dataset#bpool/}" >> new_boot_env_entry_"${new_boot_dataset##*/}"
+fi
+
+find /boot/efis/ -name "grub.cfg" -print0 \
+| xargs -t -0I '{}' sh -vxc "tail -n1 new_boot_env_entry_${new_boot_dataset##*/} >> '{}'"
+
Do not delete new_boot_env_entry_"${new_boot_dataset##*/}"
file. It
+is needed when you want to remove the new boot environment from
+GRUB menu later.
After reboot, select boot environment entry from GRUB +menu to boot from the clone. Press ESC inside +submenu to return to the previous menu.
Steps above can also be used to create a new clone +from an existing snapshot.
To delete the boot environment, first store its name in a +variable:
+my_boot_env=backup
+
Ensure that the boot environment is not +currently used
+set -x
+boot_dataset=$(df -P /boot | tail -n1 | cut -f1 -d' ' || true )
+root_dataset=$(df -P / | tail -n1 | cut -f1 -d' ' || true )
+new_boot_dataset="${boot_dataset%/*}"/"${my_boot_env}"
+rm_boot_dataset=$(head -n1 new_boot_env_entry_"${new_boot_dataset##*/}" | sed 's|^# *||' || true )
+
+if [ "${boot_dataset}" = "${rm_boot_dataset}" ]; then
+ echo "ERROR: the dataset you want to delete is the current root! abort!"
+ exit 1
+fi
+
Then check the origin snapshot
+rm_root_dataset=rpool/"${rm_boot_dataset#bpool/}"
+
+rm_boot_dataset_origin=$(zfs get -H origin "${rm_boot_dataset}"|cut -f3 || true )
+rm_root_dataset_origin=$(zfs get -H origin "${rm_root_dataset}"|cut -f3 || true )
+
Finally, destroy clone (boot environment) and its +origin snapshot
+zfs destroy "${rm_root_dataset}"
+zfs destroy "${rm_root_dataset_origin}"
+zfs destroy "${rm_boot_dataset}"
+zfs destroy "${rm_boot_dataset_origin}"
+
Remove GRUB entry
+new_entry_escaped=$(tail -n1 new_boot_env_entry_"${new_boot_dataset##*/}" | sed -e 's/[\/&]/\\&/g' || true )
+find /boot/efis/ -name "grub.cfg" -print0 | xargs -t -0I '{}' sed -i "/${new_entry_escaped}/d" '{}'
+
When a disk fails in a mirrored setup, the disk can be replaced with +the following procedure.
+Shutdown the computer.
Replace the failed disk with another disk. The replacement should +be at least the same size or larger than the failed disk.
Boot the computer.
+When a disk fails, the system will boot, albeit several minutes +slower than normal.
+For NixOS, this is due to the initrd and systemd designed to only +import a pool in degraded state after a 90s timeout.
+Swap partition on that disk will also fail.
+Install GNU parted
with your distribution package manager.
Identify the bad disk and a working old disk
+ZPOOL_VDEV_NAME_PATH=1 zpool status
+
+pool: bpool
+status: DEGRADED
+action: Replace the device using 'zpool replace'.
+...
+config: bpool
+ mirror-0
+ 2387489723748 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-BAD-part2
+ /dev/disk/by-id/ata-disk_known_good-part2 ONLINE 0 0 0
+
Store the bad disk and a working old disk in a variable, omit the partition number -partN
disk_to_replace=/dev/disk/by-id/ata-disk_to_replace
+disk_known_good=/dev/disk/by-id/ata-disk_known_good
+
Identify the new disk
+find /dev/disk/by-id/
+
+/dev/disk/by-id/ata-disk_known_good-part1
+/dev/disk/by-id/ata-disk_known_good-part2
+...
+/dev/disk/by-id/ata-disk_known_good-part5
+/dev/disk/by-id/ata-disk_new <-- new disk w/o partition table
+
Store the new disk in a variable
+disk_new=/dev/disk/by-id/ata-disk_new
+
Create partition table on "${disk_new}"
, refer to respective
+installation pages for details.
Format and mount EFI system partition, refer to respective +installation pages for details.
Replace failed disk in ZFS pool
+zpool offline bpool "${disk_to_replace}"-part2
+zpool offline rpool "${disk_to_replace}"-part3
+zpool replace bpool "${disk_to_replace}"-part2 "${disk_new}"-part2
+zpool replace rpool "${disk_to_replace}"-part3 "${disk_new}"-part3
+zpool online bpool "${disk_new}"-part2
+zpool online rpool "${disk_new}"-part3
+
Let the new disk resilver. Check status with zpool status
.
Reinstall and mirror bootloader, refer to respective installation +pages for details.
+If you are using NixOS, see below.
+For NixOS, replace bad disk with new disk inside per-host +configuration file.
+sed -i "s|"${disk_to_replace##*/}"|"${disk_new##*/}"|" /etc/nixos/hosts/exampleHost/default.nix
+
Commit and apply the changed configuration, reinstall bootloader, then reboot
+git -C /etc/nixos commit -asm "replace "${disk_to_replace##*/}" with "${disk_new##*/}"."
+
+nixos-rebuild boot --install-bootloader
+
+reboot
+
This section is compatible with Alpine, Arch, Fedora, RHEL and NixOS +root on ZFS guides.
+Sometimes the GRUB bootloader might be accidentally overwritten, +rendering the system inaccessible. However, as long as the disk +partitions where boot pool and root pool resides remain untouched, the +system can still be booted easily.
+Download GRUB rescue image from this repo.
+You can also build the image yourself if you are familiar with Nix +package manager.
+Extract either x86_64-efi or i386-pc image from the archive.
Write the image to a disk.
Boot the computer from the GRUB rescue disk. Select your distro in +GRUB menu.
Reinstall bootloader. See respective installation pages for details.
The OpenZFS software is licensed under the Common Development and Distribution License +(CDDL) unless otherwise noted.
The OpenZFS documentation content is licensed under a Creative Commons Attribution-ShareAlike +license (CC BY-SA 3.0) +unless otherwise noted.
OpenZFS is an associated project of SPI (Software in the Public Interest). SPI is a 501(c)(3) nonprofit +organization which handles the donations, finances, and legal holdings of the project.
Note
+The Linux Kernel is licensed under the GNU General Public License +Version 2 (GPLv2). While +both (OpenZFS and Linux Kernel) are free open source licenses they are +restrictive licenses. The combination of them causes problems because it +prevents using pieces of code exclusively available under one license +with pieces of code exclusively available under the other in the same binary. +In the case of the Linux Kernel, this prevents us from distributing OpenZFS +as part of the Linux Kernel binary. However, there is nothing in either license +that prevents distributing it in the form of a binary module or in the form +of source code.
+Additional reading and opinions:
+ +The number of concurrent operations issued for the async write I/O class +follows a piece-wise linear function defined by a few adjustable points.
+ | o---------| <-- zfs_vdev_async_write_max_active
+ ^ | /^ |
+ | | / | |
+active | / | |
+ I/O | / | |
+count | / | |
+ | / | |
+ |-------o | | <-- zfs_vdev_async_write_min_active
+ 0|_______^______|_________|
+ 0% | | 100% of zfs_dirty_data_max
+ | |
+ | `-- zfs_vdev_async_write_active_max_dirty_percent
+ `--------- zfs_vdev_async_write_active_min_dirty_percent
+
Until the amount of dirty data exceeds a minimum percentage of the dirty +data allowed in the pool, the I/O scheduler will limit the number of +concurrent operations to the minimum. As that threshold is crossed, the +number of concurrent operations issued increases linearly to the maximum +at the specified maximum percentage of the dirty data allowed in the +pool.
+Ideally, the amount of dirty data on a busy pool will stay in the sloped +part of the function between +zfs_vdev_async_write_active_min_dirty_percent and +zfs_vdev_async_write_active_max_dirty_percent. If it exceeds the maximum +percentage, this indicates that the rate of incoming data is greater +than the rate that the backend storage can handle. In this case, we must +further throttle incoming writes, as described in the next section.
+Storage before ZFS involved rather expensive hardware that was unable to +protect against silent corruption and did not scale very well. The +introduction of ZFS has enabled people to use far less expensive +hardware than previously used in the industry with superior scaling. +This page attempts to provide some basic guidance to people buying +hardware for use in ZFS-based servers and workstations.
+Hardware that adheres to this guidance will enable ZFS to reach its full +potential for performance and reliability. Hardware that does not adhere +to it will serve as a handicap. Unless otherwise stated, such handicaps +apply to all storage stacks and are by no means specific to ZFS. Systems +built using competing storage stacks will also benefit from these +suggestions.
+Running the latest BIOS and CPU microcode is highly recommended.
+Computer microprocessors are very complex designs that often have bugs, +which are called errata. Modern microprocessors are designed to utilize +microcode. This puts part of the hardware design into quasi-software +that can be patched without replacing the entire chip. Errata are often +resolved through CPU microcode updates. These are often bundled in BIOS +updates. In some cases, the BIOS interactions with the CPU through +machine registers can be modified to fix things with the same microcode. +If a newer microcode is not bundled as part of a BIOS update, it can +often be loaded by the operating system bootloader or the operating +system itself.
+Bit flips can have fairly dramatic consequences for all computer +filesystems and ZFS is no exception. No technique used in ZFS (or any +other filesystem) is capable of protecting against bit flips. +Consequently, ECC Memory is highly recommended.
+Ordinary background radiation will randomly flip bits in computer +memory, which causes undefined behavior. These are known as “bit flips”. +Each bit flip can have any of four possible consequences depending on +which bit is flipped:
+Bit flips can have no effect.
+Bit flips that have no effect occur in unused memory.
Bit flips can cause runtime failures.
+This is the case when a bit flip occurs in something read from +disk.
Failures are typically observed when program code is altered.
If the bit flip is in a routine within the system’s kernel or +/sbin/init, the system will likely crash. Otherwise, reloading the +affected data can clear it. This is typically achieved by a +reboot.
It can cause data corruption.
+This is the case when the bit is in use by data being written to +disk.
If the bit flip occurs before ZFS’ checksum calculation, ZFS will +not realize that the data is corrupt.
If the bit flip occurs after ZFS’ checksum calculation, but before +write-out, ZFS will detect it, but it might not be able to correct +it.
It can cause metadata corruption.
+This is the case when a bit flips in an on-disk structure being +written to disk.
If the bit flip occurs before ZFS’ checksum calculation, ZFS will +not realize that the metadata is corrupt.
If the bit flip occurs after ZFS’ checksum calculation, but before +write-out, ZFS will detect it, but it might not be able to correct +it.
Recovery from such an event will depend on what was corrupted. In +the worst, case, a pool could be rendered unimportable.
+All filesystems have poor reliability in their absolute worst +case bit-flip failure scenarios. Such scenarios should be +considered extraordinarily rare.
ZFS depends on the block device layer for storage. Consequently, ZFS is +affected by the same things that affect other filesystems, such as +driver support and non-working hardware. Consequently, there are a few +things to note:
+Never place SATA disks into a SAS expander without a SAS interposer.
+If you do this and it does work, it is the exception, rather than +the rule.
Do not expect SAS controllers to be compatible with SATA port +multipliers.
+This configuration is typically not tested.
The disks could be unrecognized.
Support for SATA port multipliers is inconsistent across OpenZFS +platforms
+Linux drivers generally support them.
Illumos drivers generally do not support them.
FreeBSD drivers are somewhere between Linux and Illumos in terms +of support.
These have problems involving sector size reporting, SMART passthrough, +the ability to set ERC and other areas. ZFS will perform as well on such +devices as they are capable of allowing, but try to avoid them. They +should not be expected to have the same up-time as SAS and SATA drives +and should be considered unreliable.
+The ideal storage controller for ZFS has the following attributes:
+Driver support on major OpenZFS platforms
+Stability is important.
High per-port bandwidth
+PCI Express interface bandwidth divided by the number of ports
Low cost
+Support for RAID, Battery Backup Units and hardware write caches +is unnecessary.
Marc Bevand’s blog post From 32 to 2 ports: Ideal SATA/SAS Controllers +for ZFS & Linux MD RAID contains an +excellent list of storage controllers that meet these criteria. He +regularly updates it as newer controllers become available.
+Hardware RAID controllers should not be used with ZFS. While ZFS will +likely be more reliable than other filesystems on Hardware RAID, it will +not be as reliable as it would be on its own.
+Hardware RAID will limit opportunities for ZFS to perform self +healing on checksum failures. When ZFS does RAID-Z or mirroring, a +checksum failure on one disk can be corrected by treating the disk +containing the sector as bad for the purpose of reconstructing the +original information. This cannot be done when a RAID controller +handles the redundancy unless a duplicate copy is stored by ZFS in +the case that the corruption involving as metadata, the copies flag +is set or the RAID array is part of a mirror/raid-z vdev within ZFS.
Sector size information is not necessarily passed correctly by +hardware RAID on RAID 1. Sector size information cannot be passed +correctly on RAID 5/6. +Hardware RAID 1 is more likely to experience read-modify-write +overhead from partial sector writes while Hardware RAID 5/6 will almost +certainty suffer from partial stripe writes (i.e. the RAID write +hole). ZFS using the disks natively allows it to obtain the +sector size information reported by the disks to avoid +read-modify-write on sectors, while ZFS avoids partial stripe writes +on RAID-Z by design from using copy-on-write.
+There can be sector alignment problems on ZFS when a drive +misreports its sector size. Such drives are typically NAND-flash +based solid state drives and older SATA drives from the advanced +format (4K sector size) transition before Windows XP EoL occurred. +This can be manually corrected at +vdev creation.
It is possible for the RAID header to cause misalignment of sector +writes on RAID 1 by starting the array within a sector on an +actual drive, such that manual correction of sector alignment at +vdev creation does not solve the problem.
RAID controller failures can require that the controller be replaced with +the same model, or in less extreme cases, a model from the same +manufacturer. Using ZFS by itself allows any controller to be used.
If a hardware RAID controller’s write cache is used, an additional +failure point is introduced that can only be partially mitigated by +additional complexity from adding flash to save data in power loss +events. The data can still be lost if the battery fails when it is +required to survive a power loss event or there is no flash and power +is not restored in a timely manner. The loss of the data in the write +cache can severely damage anything stored on a RAID array when many +outstanding writes are cached. In addition, all writes are stored in +the cache rather than just synchronous writes that require a write +cache, which is inefficient, and the write cache is relatively small. +ZFS allows synchronous writes to be written directly to flash, which +should provide similar acceleration to hardware RAID and the ability +to accelerate many more in-flight operations.
Behavior during RAID reconstruction when silent corruption damages +data is undefined. There are reports of RAID 5 and 6 arrays being +lost during reconstruction when the controller encounters silent +corruption. ZFS’ checksums allow it to avoid this situation by +determining whether enough information exists to reconstruct data. If +not, the file is listed as damaged in zpool status and the +system administrator has the opportunity to restore it from a backup.
IO response times will be reduced whenever the OS blocks on IO +operations because the system CPU blocks on a much weaker embedded +CPU used in the RAID controller. This lowers IOPS relative to what +ZFS could have achieved.
The controller’s firmware is an additional layer of complexity that +cannot be inspected by arbitrary third parties. The ZFS source code +is open source and can be inspected by anyone.
If multiple RAID arrays are formed by the same controller and one +fails, the identifiers provided by the arrays exposed to the OS might +become inconsistent. Giving the drives directly to the OS allows this +to be avoided via naming that maps to a unique port or unique drive +identifier.
+e.g. If you have arrays A, B, C and D; array B dies, the +interaction between the hardware RAID controller and the OS might +rename arrays C and D to look like arrays B and C respectively. +This can fault pools verbatim imported from the cachefile.
Not all RAID controllers behave this way. This issue has +been observed on both Linux and FreeBSD when system administrators +used single drive RAID 0 arrays, however. It has also been observed +with controllers from different vendors.
One might be inclined to try using single-drive RAID 0 arrays to try to +use a RAID controller like a HBA, but this is not recommended for many +of the reasons listed for other hardware RAID types. It is best to use a +HBA instead of a RAID controller, for both performance and reliability.
+Historically, all hard drives had 512-byte sectors, with the exception +of some SCSI drives that could be modified to support slightly larger +sectors. In 2009, the industry migrated from 512-byte sectors to +4096-byte “Advanced Format” sectors. Since Windows XP is not compatible +with 4096-byte sectors or drives larger than 2TB, some of the first +advanced format drives implemented hacks to maintain Windows XP +compatibility.
+The first advanced format drives on the market misreported their +sector size as 512-bytes for Windows XP compatibility. As of 2013, it +is believed that such hard drives are no longer in production. +Advanced format hard drives made during or after this time should +report their true physical sector size.
Drives storing 2TB and smaller might have a jumper that can be set to +map all sectors off by 1. This to provide proper alignment for +Windows XP, which started its first partition at sector 63. This +jumper setting should be off when using such drives with ZFS.
As of 2014, there are still 512-byte and 4096-byte drives on the market, +but they are known to properly identify themselves unless behind a USB +to SATA controller. Replacing a 512-byte sector drive with a 4096-byte +sector drives in a vdev created with 512-byte sector drives will +adversely affect performance. Replacing a 4096-byte sector drive with a +512-byte sector drive will have no negative effect on performance.
+ZFS is said to be able to use cheap drives. This was true when it was +introduced and hard drives supported Error recovery control. Since ZFS’ +introduction, error recovery control has been removed from low-end +drives from certain manufacturers, most notably Western Digital. +Consistent performance requires hard drives that support error recovery +control.
+Hard drives store data using small polarized regions a magnetic surface. +Reading from and/or writing to this surface poses a few reliability +problems. One is that imperfections in the surface can corrupt bits. +Another is that vibrations can cause drive heads to miss their targets. +Consequently, hard drive sectors are composed of three regions:
+A sector number
The actual data
ECC
The sector number and ECC enables hard drives to detect and respond to +such events. When either event occurs during a read, hard drives will +retry the read many times until they either succeed or conclude that the +data cannot be read. The latter case can take a substantial amount of +time and consequently, IO to the drive will stall.
+Enterprise hard drives and some consumer hard drives implement a feature +called Time-Limited Error Recovery (TLER) by Western Digital, Error +Recovery Control (ERC) by Seagate and Command Completion Time Limit by +Hitachi and Samsung, which permits the time drives are willing to spend +on such events to be limited by the system administrator.
+Drives that lack such functionality can be expected to have arbitrarily +high limits. Several minutes is not impossible. Drives with this +functionality typically default to 7 seconds. ZFS does not currently +adjust this setting on drives. However, it is advisable to write a +script to set the error recovery time to a low value, such as 0.1 +seconds until ZFS is modified to control it. This must be done on every +boot.
+High RPM drives have lower seek times, which is historically regarded as +being desirable. They increase cost and sacrifice storage density in +order to achieve what is typically no more than a factor of 6 +improvement over their lower RPM counterparts.
+To provide some numbers, a 15k RPM drive from a major manufacturer is +rated for 3.4 millisecond average read and 3.9 millisecond average +write. Presumably, this number assumes that the target sector is at most +half the number of drive tracks away from the head and half the disk +away. Being even further away is worst-case 2 times slower. Manufacturer +numbers for 7200 RPM drives are not available, but they average 13 to 16 +milliseconds in empirical measurements. 5400 RPM drives can be expected +to be slower.
+ARC and ZIL are able to mitigate much of the benefit of lower seek +times. Far larger increases in IOPS performance can be obtained by +adding additional RAM for ARC, L2ARC devices and SLOG devices. Even +higher increases in performance can be obtained by replacing hard drives +with solid state storage entirely. Such things are typically more cost +effective than high RPM drives when considering IOPS.
+Drives with command queues are able to reorder IO operations to increase +IOPS. This is called Native Command Queuing on SATA and Tagged Command +Queuing on PATA/SCSI/SAS. ZFS stores objects in metaslabs and it can use +several metastabs at any given time. Consequently, ZFS is not only +designed to take advantage of command queuing, but good ZFS performance +requires command queuing. Almost all drives manufactured within the past +10 years can be expected to support command queuing. The exceptions are:
+Consumer PATA/IDE drives
First generation SATA drives, which used IDE to SATA translation +chips, from 2003 to 2004.
SATA drives operating under IDE emulation that was configured in the +system BIOS.
Each OpenZFS system has different methods for checking whether command
+queuing is supported. On Linux, hdparm -I /path/to/device \| grep
+Queue
is used. On FreeBSD, camcontrol identify $DEVICE
is used.
As of 2014, Solid state storage is dominated by NAND-flash and most +articles on solid state storage focus on it exclusively. As of 2014, the +most popular form of flash storage used with ZFS involve drives with +SATA interfaces. Enterprise models with SAS interfaces are beginning to +become available.
+As of 2017, Solid state storage using NAND-flash with PCI-E interfaces +are widely available on the market. They are predominantly enterprise +drives that utilize a NVMe interface that has lower overhead than the +ATA used in SATA or SCSI used in SAS. There is also an interface known +as M.2 that is primarily used by consumer SSDs, although not necessarily +limited to them. It can provide electrical connectivity for multiple +buses, such as SATA, PCI-E and USB. M.2 SSDs appear to use either SATA +or NVME.
+Many NVMe SSDs support both 512-byte sectors and 4096-byte sectors. They +often ship with 512-byte sectors, which are less performant than +4096-byte sectors. Some also support metadata for T10/DIF CRC to try to +improve reliability, although this is unnecessary with ZFS.
+NVMe drives should be
+formatted
+to use 4096-byte sectors without metadata prior to being given to ZFS
+for best performance unless they indicate that 512-byte sectors are as
+performant as 4096-byte sectors, although this is unlikely. Lower
+numbers in the Rel_Perf of Supported LBA Sizes from smartctl -a
+/dev/$device_namespace
(for example smartctl -a /dev/nvme1n1
)
+indicate higher performance low level formats, with 0 being the best.
+The current formatting will be marked by a plus sign under the format
+Fmt.
You may format a drive using nvme format /dev/nvme1n1 -l $ID
. The $ID
+corresponds to the Id field value from the Supported LBA Sizes SMART
+information.
On-flash data structures are highly complex and traditionally have been +highly vulnerable to corruption. In the past, such corruption would +result in the loss of *all* drive data and an event such as a PSU +failure could result in multiple drives simultaneously failing. Since +the drive firmware is not available for review, the traditional +conclusion was that all drives that lack hardware features to avoid +power failure events cannot be trusted, which was found to be the case +multiple times in the +past [1] [2] [3]. +Discussion of power failures bricking NAND flash SSDs appears to have +vanished from literature following the year 2015. SSD manufacturers now +claim that firmware power loss protection is robust enough to provide +equivalent protection to hardware power loss protection. Kingston is one +example. +Firmware power loss protection is used to guarantee the protection of +flushed data and the drives’ own metadata, which is all that filesystems +such as ZFS need.
+However, those that either need or want strong guarantees that firmware +bugs are unlikely to be able to brick drives following power loss events +should continue to use drives that provide hardware power loss +protection. The basic concept behind how hardware power failure +protection works has been documented by +Intel +for those who wish to read about the details. As of 2020, use of +hardware power loss protection is now a feature solely of enterprise +SSDs that attempt to protect unflushed data in addition to drive +metadata and flushed data. This additional protection beyond protecting +flushed data and the drive metadata provides no additional benefit to +ZFS, but it does not hurt it.
+It should also be noted that drives in data centers and laptops are +unlikely to experience power loss events, reducing the usefulness of +hardware power loss protection. This is especially the case in +datacenters where redundant power, UPS power and the use of IPMI to do +forced reboots should prevent most drives from experiencing power loss +events.
+Lists of drives that provide hardware power loss protection are +maintained below for those who need/want it. Since ZFS, like other +filesystems, only requires power failure protection for flushed data and +drive metadata, older drives that only protect these things are included +on the lists.
+A non-exhaustive list of NVMe drives with power failure protection is as +follows:
+Intel 750
Intel DC P3500/P3600/P3608/P3700
Micron 7300/7400/7450 PRO/MAX
Samsung PM963 (M.2 form factor)
Samsung PM1725/PM1725a
Samsung XS1715
Toshiba ZD6300
Seagate Nytro 5000 M.2 (XP1920LE30002 tested; read notes below +before buying)
+Inexpensive 22110 M.2 enterprise drive using consumer MLC that is +optimized for read mostly workloads. It is not a good choice for a +SLOG device, which is a write mostly workload.
The +manual +for this drive specifies airflow requirements. If the drive does +not receive sufficient airflow from case fans, it will overheat at +idle. It’s thermal throttling will severely degrade performance +such that write throughput performance will be limited to 1/10 of +the specification and read latencies will reach several hundred +milliseconds. Under continuous load, the device will continue to +become hotter until it suffers a “degraded reliability” event +where all data on at least one NVMe namespace is lost. The NVMe +namespace is then unusable until a secure erase is done. Even with +sufficient airflow under normal circumstances, data loss is +possible under load following the failure of fans in an enterprise +environment. Anyone deploying this into production in an +enterprise environment should be mindful of this failure mode.
Those who wish to use this drive in a low airflow situation can +workaround this failure mode by placing a passive heatsink such as +this on the +NAND flash controller. It is the chip under the sticker closest to +the capacitors. This was tested by placing the heatsink over the +sticker (as removing it was considered undesirable). The heatsink +will prevent the drive from overheating to the point of data loss, +but it will not fully alleviate the overheating situation under +load without active airflow. A scrub will cause it to overheat +after a few hundred gigabytes are read. However, the thermal +throttling will quickly cool the drive from 76 degrees Celsius to +74 degrees Celsius, restoring performance.
+It might be possible to use the heatsink in an enterprise +environment to provide protection against data loss following +fan failures. However, this was not evaluated. Furthermore, +operating temperatures for consumer NAND flash should be at or +above 40 degrees Celsius for long term data integrity. +Therefore, the use of a heatsink to provide protection against +data loss following fan failures in an enterprise environment +should be evaluated before deploying drives into production to +ensure that the drive is not overcooled.
A non-exhaustive list of SAS drives with power failure protection is as +follows:
+Samsung PM1633/PM1633a
Samsung SM1625
Samsung PM853T
Toshiba PX05SHB***/PX04SHB***/PX04SHQ***
Toshiba PX05SLB***/PX04SLB***/PX04SLQ***
Toshiba PX05SMB***/PX04SMB***/PX04SMQ***
Toshiba PX05SRB***/PX04SRB***/PX04SRQ***
Toshiba PX05SVB***/PX04SVB***/PX04SVQ***
A non-exhaustive list of SATA drives with power failure protection is as +follows:
+Crucial MX100/MX200/MX300
Crucial M500/M550/M600
Intel 320
+Early reports claimed that the 330 and 335 had power failure +protection too, but they do +not.
Intel 710
Intel 730
Intel DC S3500/S3510/S3610/S3700/S3710
Kingston DC500R/DC500M
Micron 5210 Ion
+First QLC drive on the list. High capacity with a low price per +gigabyte.
Samsung PM863/PM863a
Samsung SM843T (do not confuse with SM843)
Samsung SM863/SM863a
Samsung 845DC Evo
Samsung 845DC Pro
+ +Toshiba HK4E/HK3E2
Toshiba HK4R/HK3R2/HK3R
These lists have been compiled on a volunteer basis by OpenZFS +contributors (mainly Richard Yao) from trustworthy sources of +information. The lists are intended to be vendor neutral and are not +intended to benefit any particular manufacturer. Any perceived bias +toward any manufacturer is caused by a lack of awareness and a lack of +time to research additional options. Confirmation of the presence of +adequate power loss protection by a reliable source is the only +requirement for inclusion into this list. Adequate power loss protection +means that the drive must protect both its own internal metadata and all +flushed data. Protection of unflushed data is irrelevant and therefore +not a requirement. ZFS only expects storage to protect flushed data. +Consequently, solid state drives whose power loss protection only +protects flushed data is sufficient for ZFS to ensure that data remains +safe.
+Anyone who believes an unlisted drive to provide adequate power failure +protection may contact the Mailing Lists with +a request for inclusion and substantiation for the claim that power +failure protection is provided. Examples of substantiation include +pictures of drive internals showing the presence of capacitors, +statements by well regarded independent review sites such as Anandtech +and manufacturer specification sheets. The latter are accepted on the +honor system until a manufacturer is found to misstate reality on the +protection of the drives’ own internal metadata structures and/or the +protection of flushed data. Thus far, all manufacturers have been +honest.
+The smallest unit on a NAND chip that can be written is a flash page. +The first NAND-flash SSDs on the market had 4096-byte pages. Further +complicating matters is that the the page size has been doubled twice +since then. NAND flash SSDs should report these pages as being +sectors, but so far, all of them incorrectly report 512-byte sectors for +Windows XP compatibility. The consequence is that we have a similar +situation to what we had with early advanced format hard drives.
+As of 2014, most NAND-flash SSDs on the market have 8192-byte page +sizes. However, models using 128-Gbit NAND from certain manufacturers +have a 16384-byte page size. Maximum performance requires that vdevs be +created with correct ashift values (13 for 8192-byte and 14 for +16384-byte). However, not all OpenZFS platforms support this. The Linux +port supports ashift=13, while others are limited to ashift=12 +(4096-byte).
+As of 2017, NAND-flash SSDs are tuned for 4096-byte IOs. Matching the +flash page size is unnecessary and ashift=12 is usually the correct +choice. Public documentation on flash page size is also nearly +non-existent.
+It should be noted that this is a separate case from +discard on zvols or hole punching on filesystems. Those work regardless +of whether ATA TRIM / SCSI UNMAP is sent to the actual block devices.
+The ATA TRIM command in SATA 3.0 and earlier is a non-queued command. +Issuing a TRIM command on a SATA drive conforming to SATA 3.0 or earlier +will cause the drive to drain its IO queue and stop servicing requests +until it finishes, which hurts performance. SATA 3.1 removed this +limitation, but very few SATA drives on the market are conformant to +SATA 3.1 and it is difficult to distinguish them from SATA 3.0 drives. +At the same time, SCSI UNMAP has no such problems.
+These are SSDs with far better latencies and write endurance than NAND +flash SSDs. They are byte addressable, such that ashift=9 is fine for +use on them. Unlike NAND flash SSDs, they do not require any special +power failure protection circuitry for reliability. There is also no +need to run TRIM on them. However, they cost more per GB than NAND flash +(as of 2020). The enterprise models make excellent SLOG devices. Here is +a list of models that are known to perform well:
+ +Note that SLOG devices rarely have more than 4GB in use at any given +time, so the smaller sized devices are generally the best choice in +terms of cost, with larger sizes giving no benefit. Larger sizes could +be a good choice for other vdev types, depending on performance needs +and cost considerations.
+Ensuring that computers are properly grounded is highly recommended. +There have been cases in user homes where machines experienced random +failures when plugged into power receptacles that had open grounds (i.e. +no ground wire at all). This can cause random failures on any computer +system, whether it uses ZFS or not.
+Power should also be relatively stable. Large dips in voltages from +brownouts are preferably avoided through the use of UPS units or line +conditioners. Systems subject to unstable power that do not outright +shutdown can exhibit undefined behavior. PSUs with longer hold-up times +should be able to provide partial protection against this, but hold up +times are often undocumented and are not a substitute for a UPS or line +conditioner.
+PSUs are supposed to deassert a PWR_OK signal to indicate that provided +voltages are no longer within the rated specification. This should force +an immediate shutdown. However, the system clock of a developer +workstation was observed to significantly deviate from the expected +value following during a series of ~1 second brown outs. This machine +did not use a UPS at the time. However, the PWR_OK mechanism should have +protected against this. The observation of the PWR_OK signal failing to +force a shutdown with adverse consequences (to the system clock in this +case) suggests that the PWR_OK mechanism is not a strict guarantee.
+A PSU hold-up time is the amount of time that a PSU can continue to +output power at maximum output within standard voltage tolerances +following the loss of input power. This is important for supporting UPS +units because the transfer +time +taken by a standard UPS to supply power from its battery can leave +machines without power for “5-12 ms”. Intel’s ATX Power Supply design +guide +specifies a hold up time of 17 milliseconds at maximum continuous +output. The hold-up time is a inverse function of how much power is +being output by the PSU, with lower power output increasing holdup +times.
+Capacitor aging in PSUs will lower the hold-up time below what it was +when new, which could cause reliability issues as equipment ages. +Machines using substandard PSUs with hold-up times below the +specification therefore require higher end UPS units for protection to +ensure that the transfer time does not exceed the hold-up time. A +hold-up time below the transfer time during a transfer to battery power +can cause undefined behavior should the PWR_OK signal not become +deasserted to force the machine to power off.
+If in doubt, use a double conversion UPS unit. Double conversion UPS +units always run off the battery, such that the transfer time is 0. This +is unless they are high efficiency models that are hybrids between +standard UPS units and double conversion UPS units, although these are +reported to have much lower transfer times than standard PSUs. You could +also contact your PSU manufacturer for the hold up time specification, +but if reliability for years is a requirement, you should use a higher +end UPS with a low transfer time.
+Note that double conversion units are at most 94% efficient unless they +support a high efficiency mode, which adds latency to the time to +transition to battery power.
+The lead acid batteries in UPS units generally need to be replaced +regularly to ensure that they provide power during power outages. For +home systems, this is every 3 to 5 years, although this varies with +temperature [4]. For +enterprise systems, contact your vendor.
+Footnotes
+ +Most of the ZFS kernel module parameters are accessible in the SysFS
+/sys/module/zfs/parameters
directory. Current values can be observed
+by
cat /sys/module/zfs/parameters/PARAMETER
+
Many of these can be changed by writing new values. These are denoted by +Change|Dynamic in the PARAMETER details below.
+echo NEWVALUE >> /sys/module/zfs/parameters/PARAMETER
+
If the parameter is not dynamically adjustable, an error can occur and +the value will not be set. It can be helpful to check the permissions +for the PARAMETER file in SysFS.
+In some cases, the parameter must be set prior to loading the kernel
+modules or it is desired to have the parameters set automatically at
+boot time. For many distros, this can be accomplished by creating a file
+named /etc/modprobe.d/zfs.conf
containing a text line for each
+module parameter using the format:
# change PARAMETER for workload XZY to solve problem PROBLEM_DESCRIPTION
+# changed by YOUR_NAME on DATE
+options zfs PARAMETER=VALUE
+
Some parameters related to ZFS operations are located in module
+parameters other than in the zfs
kernel module. These are documented
+in the individual parameter description. Unless otherwise noted, the
+tunable applies to the zfs
kernel module. For example, the icp
+kernel module parameters are visible in the
+/sys/module/icp/parameters
directory and can be set by default at
+boot time by changing the /etc/modprobe.d/icp.conf
file.
See the man page for modprobe.d for more information.
+The zfs(4) and spl(4) man
+pages (previously zfs-
and spl-module-parameters(5)
, respectively,
+prior to OpenZFS 2.1) contain brief descriptions of
+the module parameters. Alas, man pages are not as suitable for quick
+reference as documentation pages. This page is intended to be a better
+cross-reference and capture some of the wisdom of ZFS developers and
+practitioners.
The ZFS kernel module, zfs.ko
, parameters are detailed below.
To observe the list of parameters along with a short synopsis of each
+parameter, use the modinfo
command:
modinfo zfs
+
When set, the hole_birth optimization will not be used and all holes
+will always be sent by zfs send
In the source code,
+ignore_hole_birth is an alias for and SysFS PARAMETER for
+send_holes_without_birth_time.
ignore_hole_birth |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Enable if you suspect your datasets are
+affected by a bug in hole_birth during
+ |
+
Data Type |
+boolean |
+
Range |
+0=disabled, 1=enabled |
+
Default |
+1 (hole birth optimization is ignored) |
+
Change |
+Dynamic |
+
Versions Affected |
+TBD |
+
Controls whether buffers present on special vdevs are eligible for +caching into L2ARC.
+l2arc_exclude_special |
+Notes |
+
---|---|
Tags |
+ARC, +L2ARC, +special_vdev, |
+
When to change |
+If cache and special devices exist and caching +data on special devices in L2ARC is not desired |
+
Data Type |
+boolean |
+
Range |
+0=disabled, 1=enabled |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+TBD |
+
Turbo L2ARC cache warm-up. When the L2ARC is cold the fill interval will +be set to aggressively fill as fast as possible.
+l2arc_feed_again |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If cache devices exist and it is desired to +fill them as fast as possible |
+
Data Type |
+boolean |
+
Range |
+0=disabled, 1=enabled |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+TBD |
+
Minimum time period for aggressively feeding the L2ARC. The L2ARC feed
+thread wakes up once per second (see
+l2arc_feed_secs) to look for data to feed into
+the L2ARC. l2arc_feed_min_ms
only affects the turbo L2ARC cache
+warm-up and allows the aggressiveness to be adjusted.
l2arc_feed_min_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If cache devices exist and +l2arc_feed_again and +the feed is too aggressive, then this tunable +can be adjusted to reduce the impact of the +fill |
+
Data Type |
+uint64 |
+
Units |
+milliseconds |
+
Range |
+0 to (1000 * l2arc_feed_secs) |
+
Default |
+200 |
+
Change |
+Dynamic |
+
Versions Affected |
+0.6 and later |
+
Seconds between waking the L2ARC feed thread. One feed thread works for +all cache devices in turn.
+If the pool that owns a cache device is imported readonly, then the feed +thread is delayed 5 * l2arc_feed_secs before +moving onto the next cache device. If multiple pools are imported with +cache devices and one pool with cache is imported readonly, the L2ARC +feed rate to all caches can be slowed.
+l2arc_feed_secs |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+uint64 |
+
Units |
+seconds |
+
Range |
+1 to UINT64_MAX |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+0.6 and later |
+
How far through the ARC lists to search for L2ARC cacheable content, +expressed as a multiplier of l2arc_write_max
+l2arc_headroom |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the rate of change in the ARC is faster than +the overall L2ARC feed rate, then increasing +l2arc_headroom can increase L2ARC efficiency. +Setting the value too large can cause the L2ARC +feed thread to consume more CPU time looking +for data to feed. |
+
Data Type |
+uint64 |
+
Units |
+unit |
+
Range |
+0 to UINT64_MAX |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+0.6 and later |
+
Percentage scale for l2arc_headroom when L2ARC +contents are being successfully compressed before writing.
+l2arc_headroom_boost |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If average compression efficiency is greater +than 2:1, then increasing +l2a +rc_headroom_boost +can increase the L2ARC feed rate |
+
Data Type |
+uint64 |
+
Units |
+percent |
+
Range |
+100 to UINT64_MAX, when set to 100, the +L2ARC headroom boost feature is effectively +disabled |
+
Default |
+200 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Disable writing compressed data to cache devices. Disabling allows the +legacy behavior of writing decompressed data to cache devices.
+l2arc_nocompress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing compressed L2ARC feature |
+
Data Type |
+boolean |
+
Range |
+0=store compressed blocks in cache device, +1=store uncompressed blocks in cache device |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+deprecated in v0.7.0 by new compressed ARC +design |
+
Percent of ARC size allowed for L2ARC-only headers. +Since L2ARC buffers are not evicted on memory pressure, too large amount of +headers on system with irrationaly large L2ARC can render it slow or unusable. +This parameter limits L2ARC writes and rebuild to achieve it.
+l2arc_nocompress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When workload really require enormous L2ARC. |
+
Data Type |
+int |
+
Range |
+0 to 100 |
+
Default |
+33 |
+
Change |
+Dynamic |
+
Versions Affected |
+v2.0 and later |
+
Controls whether only MFU metadata and data are cached from ARC into L2ARC. +This may be desirable to avoid wasting space on L2ARC when reading/writing +large amounts of data that are not expected to be accessed more than once. +By default both MRU and MFU data and metadata are cached in the L2ARC.
+l2arc_mfuonly |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When accessing a large amount of data only +once. |
+
Data Type |
+boolean |
+
Range |
+0=store MRU and MFU blocks in cache device, +1=store MFU blocks in cache device |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v2.0 and later |
+
Disables writing prefetched, but unused, buffers to cache devices.
+l2arc_noprefetch |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Setting to 0 can increase L2ARC hit rates for +workloads where the ARC is too small for a read +workload that benefits from prefetching. Also, +if the main pool devices are very slow, setting +to 0 can improve some workloads such as +backups. |
+
Data Type |
+boolean |
+
Range |
+0=write prefetched but unused buffers to cache +devices, 1=do not write prefetched but unused +buffers to cache devices |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 and later |
+
Disables writing to cache devices while they are being read.
+l2arc_norw |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In the early days of SSDs, some devices did not +perform well when reading and writing +simultaneously. Modern SSDs do not have these +issues. |
+
Data Type |
+boolean |
+
Range |
+0=read and write simultaneously, 1=avoid writes +when reading for antique SSDs |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The minimum required size (in bytes) of an L2ARC device in order to +write log blocks in it. The log blocks are used upon importing the pool +to rebuild the persistent L2ARC. For L2ARC devices less than 1GB the +overhead involved offsets most of benefit so log blocks are not written +for cache devices smaller than this.
+l2arc_rebuild_blocks_min_l2size |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+The cache device is small and +the pool is frequently imported. |
+
Data Type |
+bytes |
+
Range |
+0 to UINT64_MAX |
+
Default |
+1,073,741,824 |
+
Change |
+Dynamic |
+
Versions Affected |
+v2.0 and later |
+
Rebuild the persistent L2ARC when importing a pool.
+l2arc_rebuild_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If there are problems importing a pool or +attaching an L2ARC device. |
+
Data Type |
+boolean |
+
Range |
+0=disable persistent L2ARC rebuild, +1=enable persistent L2ARC rebuild |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v2.0 and later |
+
Once the cache device has been filled TRIM ahead of the current write size
+l2arc_write_max
on L2ARC devices by this percentage. This can speed
+up future writes depending on the performance characteristics of the
+cache device.
When set to 100% TRIM twice the space required to accommodate upcoming +writes. A minimum of 64MB will be trimmed. If set it enables TRIM of +the whole L2ARC device when it is added to a pool. By default, this +option is disabled since it can put significant stress on the underlying +storage devices.
+l2arc_trim_ahead |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Consider setting for cache devices which +effeciently handle TRIM commands. |
+
Data Type |
+ulong |
+
Units |
+percent of l2arc_write_max |
+
Range |
+0 to 100 |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v2.0 and later |
+
Until the ARC fills, increases the L2ARC fill rate
+l2arc_write_max by l2arc_write_boost
.
l2arc_write_boost |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To fill the cache devices more aggressively +after pool import. |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+0 to UINT64_MAX |
+
Default |
+8,388,608 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Maximum number of bytes to be written to each cache device for each +L2ARC feed thread interval (see l2arc_feed_secs). +The actual limit can be adjusted by +l2arc_write_boost. By default +l2arc_feed_secs is 1 second, delivering a maximum +write workload to cache devices of 8 MiB/sec.
+l2arc_write_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the cache devices can sustain the write +workload, increasing the rate of cache device +fill when workloads generate new data at a rate +higher than l2arc_write_max can increase L2ARC +hit rate |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+1 to UINT64_MAX |
+
Default |
+8,388,608 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Sets the metaslab granularity. Nominally, ZFS will try to allocate this +amount of data to a top-level vdev before moving on to the next +top-level vdev. This is roughly similar to what would be referred to as +the “stripe size” in traditional RAID arrays.
+When tuning for HDDs, it can be more efficient to have a few larger,
+sequential writes to a device rather than switching to the next device.
+Monitoring the size of contiguous writes to the disks relative to the
+write throughput can be used to determine if increasing
+metaslab_aliquot
can help. For modern devices, it is unlikely that
+decreasing metaslab_aliquot
from the default will help.
If there is only one top-level vdev, this tunable is not used.
+metaslab_aliquot |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If write performance increases as devices more +efficiently write larger, contiguous blocks |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+0 to UINT64_MAX |
+
Default |
+524,288 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Enables metaslab group biasing based on a top-level vdev’s utilization +relative to the pool. Nominally, all top-level devs are the same size +and the allocation is spread evenly. When the top-level vdevs are not of +the same size, for example if a new (empty) top-level is added to the +pool, this allows the new top-level vdev to get a larger portion of new +allocations.
+metaslab_bias_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If a new top-level vdev is added and you do +not want to bias new allocations to the new +top-level vdev |
+
Data Type |
+boolean |
+
Range |
+0=spread evenly across top-level vdevs, +1=bias spread to favor less full top-level +vdevs |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Enables metaslab allocation based on largest free segment rather than +total amount of free space. The goal is to avoid metaslabs that exhibit +free space fragmentation: when there is a lot of small free spaces, but +few larger free spaces.
+If zfs_metaslab_segment_weight_enabled
is enabled, then
+metaslab_fragmentation_factor_enabled
+is ignored.
zfs +_metaslab_segment_weight_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing allocation and +fragmentation |
+
Data Type |
+boolean |
+
Range |
+0=do not consider metaslab +fragmentation, 1=avoid metaslabs +where free space is highly +fragmented |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
When using segment-based metaslab selection (see
+zfs_metaslab_segment_weight_enabled),
+continue allocating from the active metaslab until
+zfs_metaslab_switch_threshold
worth of free space buckets have been
+exhausted.
zfs_metaslab_switch_threshold |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing allocation and +fragmentation |
+
Data Type |
+uint64 |
+
Units |
+free spaces |
+
Range |
+0 to UINT64_MAX |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
When enabled, all metaslabs are loaded into memory during pool import. +Nominally, metaslab space map information is loaded and unloaded as +needed (see metaslab_debug_unload)
+It is difficult to predict how much RAM is required to store a space +map. An empty or completely full metaslab has a small space map. +However, a highly fragmented space map can consume significantly more +memory.
+Enabling metaslab_debug_load
can increase pool import time.
metaslab_debug_load |
+Notes |
+
---|---|
Tags |
+allocation, +memory, +metaslab |
+
When to change |
+When RAM is plentiful and pool import time is +not a consideration |
+
Data Type |
+boolean |
+
Range |
+0=do not load all metaslab info at pool +import, 1=dynamically load metaslab info as +needed |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
When enabled, prevents metaslab information from being dynamically +unloaded from RAM. Nominally, metaslab space map information is loaded +and unloaded as needed (see +metaslab_debug_load)
+It is difficult to predict how much RAM is required to store a space +map. An empty or completely full metaslab has a small space map. +However, a highly fragmented space map can consume significantly more +memory.
+Enabling metaslab_debug_unload
consumes RAM that would otherwise be
+freed.
metaslab_debug_unload |
+Notes |
+
---|---|
Tags |
+allocation, +memory, +metaslab |
+
When to change |
+When RAM is plentiful and the penalty for +dynamically reloading metaslab info from +the pool is high |
+
Data Type |
+boolean |
+
Range |
+0=dynamically unload metaslab info, +1=unload metaslab info only upon pool +export |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Enable use of the fragmentation metric in computing metaslab weights.
+In version v0.7.0, if
+zfs_metaslab_segment_weight_enabled
+is enabled, then metaslab_fragmentation_factor_enabled
is ignored.
metas +lab_fragmentation_factor_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To test metaslab fragmentation |
+
Data Type |
+boolean |
+
Range |
+0=do not consider metaslab free +space fragmentation, 1=try to +avoid fragmented metaslabs |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
When a vdev is added, it will be divided into approximately, but no more +than, this number of metaslabs.
+metaslabs_per_vdev |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing metaslab allocation |
+
Data Type |
+uint64 |
+
Units |
+metaslabs |
+
Range |
+16 to UINT64_MAX |
+
Default |
+200 |
+
Change |
+Prior to pool creation or adding new top-level +vdevs |
+
Versions Affected |
+all |
+
Enable metaslab group preloading. Each top-level vdev has a metaslab
+group. By default, up to 3 copies of metadata can exist and are
+distributed across multiple top-level vdevs.
+metaslab_preload_enabled
allows the corresponding metaslabs to be
+preloaded, thus improving allocation efficiency.
metaslab_preload_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing metaslab allocation |
+
Data Type |
+boolean |
+
Range |
+0=do not preload metaslab info, +1=preload up to 3 metaslabs |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Modern HDDs have uniform bit density and constant angular velocity.
+Therefore, the outer recording zones are faster (higher bandwidth) than
+the inner zones by the ratio of outer to inner track diameter. The
+difference in bandwidth can be 2:1, and is often available in the HDD
+detailed specifications or drive manual. For HDDs when
+metaslab_lba_weighting_enabled
is true, write allocation preference
+is given to the metaslabs representing the outer recording zones. Thus
+the allocation to metaslabs prefers faster bandwidth over free space.
If the devices are not rotational, yet misrepresent themselves to the OS
+as rotational, then disabling metaslab_lba_weighting_enabled
can
+result in more even, free-space-based allocation.
metaslab_lba_weighting_enabled |
+Notes |
+
---|---|
Tags |
+allocation, +metaslab, +HDD, SSD |
+
When to change |
+disable if using only SSDs and +version v0.6.4 or earlier |
+
Data Type |
+boolean |
+
Range |
+0=do not use LBA weighting, 1=use +LBA weighting |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Verification |
+The rotational setting described
+by a block device in sysfs by
+observing
+ |
+
Versions Affected |
+prior to v0.6.5, the check for +non-rotation media did not exist |
+
By default, the zpool import
command searches for pool information
+in the zpool.cache
file. If the pool to be imported has an entry in
+zpool.cache
then the devices do not have to be scanned to determine
+if they are pool members. The path to the cache file is spa_config_path.
For more information on zpool import
and the -o cachefile
and
+-d
options, see the man page for zpool(8)
See also zfs_autoimport_disable
+spa_config_path |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If creating a non-standard distribution and the +cachefile property is inconvenient |
+
Data Type |
+string |
+
Default |
+
|
+
Change |
+Dynamic, applies only to the next invocation of
+ |
+
Versions Affected |
+all |
+
Multiplication factor used to estimate actual disk consumption from the +size of data being written. The default value is a worst case estimate, +but lower values may be valid for a given pool depending on its +configuration. Pool administrators who understand the factors involved +may wish to specify a more realistic inflation factor, particularly if +they operate close to quota or capacity limits.
+The worst case space requirement for allocation is single-sector
+max-parity RAIDZ blocks, in which case the space requirement is exactly
+4 times the size, accounting for a maximum of 3 parity blocks. This is
+added to the maximum number of ZFS copies
parameter (copies max=3).
+Additional space is required if the block could impact deduplication
+tables. Altogether, the worst case is 24.
If the estimation is not correct, then quotas or out-of-space conditions +can lead to optimistic expectations of the ability to allocate. +Applications are typically not prepared to deal with such failures and +can misbehave.
+spa_asize_inflation |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the allocation requirements for the +workload are well known and quotas are used |
+
Data Type |
+uint64 |
+
Units |
+unit |
+
Range |
+1 to 24 |
+
Default |
+24 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.3 and later |
+
An extreme rewind import (see zpool import -X
) normally performs a
+full traversal of all blocks in the pool for verification. If this
+parameter is set to 0, the traversal skips non-metadata blocks. It can
+be toggled once the import has started to stop or start the traversal of
+non-metadata blocks. See also
+spa_load_verify_metadata.
spa_load_verify_data |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+At the risk of data integrity, to speed +extreme import of large pool |
+
Data Type |
+boolean |
+
Range |
+0=do not verify data upon pool import, +1=verify pool data upon import |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
An extreme rewind import (see zpool import -X
) normally performs a
+full traversal of all blocks in the pool for verification. If this
+parameter is set to 0, the traversal is not performed. It can be toggled
+once the import has started to stop or start the traversal. See
+spa_load_verify_data
spa_load_verify_metadata |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+At the risk of data integrity, to speed +extreme import of large pool |
+
Data Type |
+boolean |
+
Range |
+0=do not verify metadata upon pool +import, 1=verify pool metadata upon +import |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Maximum number of concurrent I/Os during the data verification performed
+during an extreme rewind import (see zpool import -X
)
spa_load_verify_maxinflight |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+During an extreme rewind import, to +match the concurrent I/O capabilities +of the pool devices |
+
Data Type |
+int |
+
Units |
+I/Os |
+
Range |
+1 to MAX_INT |
+
Default |
+10,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Normally, the last 3.2% (1/(2^spa_slop_shift
)) of pool space is
+reserved to ensure the pool doesn’t run completely out of space, due to
+unaccounted changes (e.g. to the MOS). This also limits the worst-case
+time to allocate space. When less than this amount of free space exists,
+most ZPL operations (e.g. write, create) return error:no space (ENOSPC).
Changing spa_slop_shift affects the currently loaded ZFS module and all +imported pools. spa_slop_shift is not stored on disk. Beware when +importing full pools on systems with larger spa_slop_shift can lead to +over-full conditions.
+The minimum SPA slop space is limited to 128 MiB. +The maximum SPA slop space is limited to 128 GiB.
+spa_slop_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For large pools, when 3.2% may be too
+conservative and more usable space is desired,
+consider increasing |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to MAX_INT, however the practical upper limit +is 15 for a system with 4TB of RAM |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later (max. slop space since v2.1.0) |
+
If prefetching is enabled, do not prefetch blocks larger than
+zfetch_array_rd_sz
size.
zfetch_array_rd_sz |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To allow prefetching when using large block sizes |
+
Data Type |
+unsigned long |
+
Units |
+bytes |
+
Range |
+0 to MAX_ULONG |
+
Default |
+1,048,576 (1 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Limits the maximum number of bytes to prefetch per stream.
+zfetch_max_distance |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Consider increasing read workloads that use +large blocks and exhibit high prefetch hit +ratios |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+0 to UINT_MAX |
+
Default |
+8,388,608 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
Maximum number of prefetch streams per file.
+For version v0.7.0 and later, when prefetching small files the number of +prefetch streams is automatically reduced below to prevent the streams +from overlapping.
+zfetch_max_streams |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the workload benefits from prefetching and
+has more than |
+
Data Type |
+uint |
+
Units |
+streams |
+
Range |
+1 to MAX_UINT |
+
Default |
+8 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Prefetch streams that have been accessed in zfetch_min_sec_reap
+seconds are automatically stopped.
zfetch_min_sec_reap |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To test prefetch efficiency |
+
Data Type |
+uint |
+
Units |
+seconds |
+
Range |
+0 to MAX_UINT |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Percentage of ARC metadata space that can be used for dnodes.
+The value calculated for zfs_arc_dnode_limit_percent
can be
+overridden by zfs_arc_dnode_limit.
zfs_arc_dnode_limit_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Consider increasing if |
+
Data Type |
+int |
+
Units |
+percent of arc_meta_limit |
+
Range |
+0 to 100 |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
When the number of bytes consumed by dnodes in the ARC exceeds
+zfs_arc_dnode_limit
bytes, demand for new metadata can take from the
+space consumed by dnodes.
The default value 0, indicates that a percent which is based on +zfs_arc_dnode_limit_percent of the +ARC meta buffers that may be used for dnodes.
+zfs_arc_dnode_limit
is similar to
+zfs_arc_meta_prune which serves a similar
+purpose for metadata.
zfs_arc_dnode_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Consider increasing if |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+0 to MAX_UINT64 |
+
Default |
+0 (uses +zfs_arc_dnode_lim +it_percent) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
Percentage of ARC dnodes to try to evict in response to demand for +non-metadata when the number of bytes consumed by dnodes exceeds +zfs_arc_dnode_limit.
+zfs_arc_dnode_reduce_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing dnode cache efficiency |
+
Data Type |
+uint64 |
+
Units |
+percent of size of dnode space used +above +zfs_arc_d +node_limit |
+
Range |
+0 to 100 |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The ARC’s buffer hash table is sized based on the assumption of an
+average block size of zfs_arc_average_blocksize
. The default of 8
+KiB uses approximately 1 MiB of hash table per 1 GiB of physical memory
+with 8-byte pointers.
zfs_arc_average_blocksize |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For workloads where the known average
+blocksize is larger, increasing
+ |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+512 to 16,777,216 |
+
Default |
+8,192 |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+all |
+
Number ARC headers to evict per sublist before proceeding to another +sublist. This batch-style operation prevents entire sublists from being +evicted at once but comes at a cost of additional unlocking and locking.
+zfs_arc_evict_batch_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ARC multilist features |
+
Data Type |
+int |
+
Units |
+count of ARC headers |
+
Range |
+1 to INT_MAX |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
When the ARC is shrunk due to memory demand, do not retry growing the
+ARC for zfs_arc_grow_retry
seconds. This operates as a damper to
+prevent oscillating grow/shrink cycles when there is memory pressure.
If zfs_arc_grow_retry
= 0, the internal default of 5 seconds is
+used.
zfs_arc_grow_retry |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+seconds |
+
Range |
+1 to MAX_INT |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Throttle ARC memory consumption, effectively throttling I/O, when free
+system memory drops below this percentage of total system memory.
+Setting zfs_arc_lotsfree_percent
to 0 disables the throttle.
The arcstat_memory_throttle_count counter in
+/proc/spl/kstat/arcstats
can indicate throttle activity.
zfs_arc_lotsfree_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Maximum size of ARC in bytes.
+If set to 0 then the maximum size of ARC +is determined by the amount of system memory installed:
+Linux: 1/2 of system memory
FreeBSD: the larger of all_system_memory - 1GB
and 5/8 × all_system_memory
zfs_arc_max
can be changed dynamically with some caveats. It cannot
+be set back to 0 while running and reducing it below the current ARC
+size will not cause the ARC to shrink without memory pressure to induce
+shrinking.
zfs_arc_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Reduce if ARC competes too much with other +applications, increase if ZFS is the primary +application and can use more RAM |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+67,108,864 to RAM size in bytes |
+
Default |
+0 (see description above, OS-dependent) |
+
Change |
+Dynamic (see description above) |
+
Verification |
+
|
+
Versions Affected |
+all |
+
The number of restart passes to make while scanning the ARC attempting +the free buffers in order to stay below the +zfs_arc_meta_limit.
+zfs_arc_meta_adjust_restarts |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ARC metadata adjustment feature |
+
Data Type |
+int |
+
Units |
+restarts |
+
Range |
+0 to INT_MAX |
+
Default |
+4,096 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Sets the maximum allowed size metadata buffers in the ARC. When
+zfs_arc_meta_limit is reached metadata buffers
+are reclaimed, even if the overall c_max
has not been reached.
In version v0.7.0, with a default value = 0,
+zfs_arc_meta_limit_percent
is used to set arc_meta_limit
zfs_arc_meta_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For workloads where the metadata to data ratio +in the ARC can be changed to improve ARC hit +rates |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+0 to |
+
Default |
+0 |
+
Change |
+Dynamic, except that it cannot be set back to +0 for a specific percent of the ARC; it must +be set to an explicit value |
+
Verification |
+
|
+
Versions Affected |
+all |
+
Sets the limit to ARC metadata, arc_meta_limit
, as a percentage of
+the maximum size target of the ARC, c_max
Prior to version v0.7.0, the
+zfs_arc_meta_limit was used to set the limit
+as a fixed size. zfs_arc_meta_limit_percent
provides a more
+convenient interface for setting the limit.
zfs_arc_meta_limit_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For workloads where the metadata to +data ratio in the ARC can be changed +to improve ARC hit rates |
+
Data Type |
+uint64 |
+
Units |
+percent of |
+
Range |
+0 to 100 |
+
Default |
+75 |
+
Change |
+Dynamic |
+
Verification |
+
|
+
Versions Affected |
+v0.7.0 and later |
+
The minimum allowed size in bytes that metadata buffers may consume in +the ARC. This value defaults to 0 which disables a floor on the amount +of the ARC devoted meta data.
+When evicting data from the ARC, if the metadata_size
is less than
+arc_meta_min
then data is evicted instead of metadata.
zfs_arc_meta_min |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+16,777,216 to |
+
Default |
+0 (use internal default 16 MiB) |
+
Change |
+Dynamic |
+
Verification |
+
|
+
Versions Affected |
+all |
+
zfs_arc_meta_prune
sets the number of dentries and znodes to be
+scanned looking for entries which can be dropped. This provides a
+mechanism to ensure the ARC can honor the arc_meta_limit and
reclaim
+otherwise pinned ARC buffers. Pruning may be required when the ARC size
+drops to arc_meta_limit
because dentries and znodes can pin buffers
+in the ARC. Increasing this value will cause to dentry and znode caches
+to be pruned more aggressively and the arc_prune thread becomes more
+active. Setting zfs_arc_meta_prune
to 0 will disable pruning.
zfs_arc_meta_prune |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint64 |
+
Units |
+entries |
+
Range |
+0 to INT_MAX |
+
Default |
+10,000 |
+
Change |
+Dynamic |
+
! Verification |
+Prune activity is counted by the
+ |
+
Versions Affected |
+v0.6.5 and later |
+
Defines the strategy for ARC metadata eviction (meta reclaim strategy). +A value of 0 (META_ONLY) will evict only the ARC metadata. A value of 1 +(BALANCED) indicates that additional data may be evicted if required in +order to evict the requested amount of metadata.
+zfs_arc_meta_strategy |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ARC metadata eviction |
+
Data Type |
+int |
+
Units |
+enum |
+
Range |
+0=evict metadata only, 1=also evict data +buffers if they can free metadata buffers +for eviction |
+
Default |
+1 (BALANCED) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Minimum ARC size limit. When the ARC is asked to shrink, it will stop
+shrinking at c_min
as tuned by zfs_arc_min
.
zfs_arc_min |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the primary focus of the system is ZFS, then +increasing can ensure the ARC gets a minimum +amount of RAM |
+
Data Type |
+uint64 |
+
Units |
+bytes |
+
Range |
+33,554,432 to |
+
Default |
+For kernel: greater of 33,554,432 (32 MiB) and
+memory size / 32. For user-land: greater of
+33,554,432 (32 MiB) and |
+
Change |
+Dynamic |
+
Verification |
+
|
+
Versions Affected |
+all |
+
Minimum time prefetched blocks are locked in the ARC.
+A value of 0 represents the default of 1 second. However, once changed, +dynamically setting to 0 will not return to the default.
+zfs_arc_min_prefetch_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to INT_MAX |
+
Default |
+0 (use internal default of 1000 ms) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
Minimum time “prescient prefetched” blocks are locked in the ARC. These +blocks are meant to be prefetched fairly aggressively ahead of the code +that may use them.
+A value of 0 represents the default of 6 seconds. However, once changed, +dynamically setting to 0 will not return to the default.
+z +fs_arc_min_prescient_prefetch_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to INT_MAX |
+
Default |
+0 (use internal default of 6000 +ms) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
To allow more fine-grained locking, each ARC state contains a series of +lists (sublists) for both data and metadata objects. Locking is +performed at the sublist level. This parameters controls the number of +sublists per ARC state, and also applies to other uses of the multilist +data structure.
+zfs_multilist_num_sublists |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+lists |
+
Range |
+1 to INT_MAX |
+
Default |
+0 (internal value is greater of number +of online CPUs or 4) |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+v0.7.0 and later |
+
The ARC size is considered to be overflowing if it exceeds the current
+ARC target size (/proc/spl/kstat/zfs/arcstats
entry c
) by a
+threshold determined by zfs_arc_overflow_shift
. The threshold is
+calculated as a fraction of c using the formula: (ARC target size)
+c >> zfs_arc_overflow_shift
The default value of 8 causes the ARC to be considered to be overflowing +if it exceeds the target size by 1/256th (0.3%) of the target size.
+When the ARC is overflowing, new buffer allocations are stalled until +the reclaim thread catches up and the overflow condition no longer +exists.
+zfs_arc_overflow_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to INT_MAX |
+
Default |
+8 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
arc_p_min_shift is used to shift of ARC target size
+(/proc/spl/kstat/zfs/arcstats
entry c
) for calculating both
+minimum and maximum most recently used (MRU) target size
+(/proc/spl/kstat/zfs/arcstats
entry p
)
A value of 0 represents the default setting of arc_p_min_shift
= 4.
+However, once changed, dynamically setting zfs_arc_p_min_shift
to 0
+will not return to the default.
zfs_arc_p_min_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to INT_MAX |
+
Default |
+0 (internal default = 4) |
+
Change |
+Dynamic |
+
Verification |
+Observe changes to
+ |
+
Versions Affected |
+all |
+
When data is being added to the ghost lists, the MRU target size is +adjusted. The amount of adjustment is based on the ratio of the MRU/MFU +sizes. When enabled, the ratio is capped to 10, avoiding large +adjustments.
+zfs_arc_p_dampener_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ARC ghost list behaviour |
+
Data Type |
+boolean |
+
Range |
+0=avoid large adjustments, 1=permit +large adjustments |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
arc_shrink_shift
is used to adjust the ARC target sizes when large
+reduction is required. The current ARC target size, c
, and MRU size
+p
can be reduced by by the current size >> arc_shrink_shift
. For
+the default value of 7, this reduces the target by approximately 0.8%.
A value of 0 represents the default setting of arc_shrink_shift = 7. +However, once changed, dynamically setting arc_shrink_shift to 0 will +not return to the default.
+zfs_arc_shrink_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+During memory shortfall, reducing
+ |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to INT_MAX |
+
Default |
+0 ( |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zfs_arc_pc_percent
allows ZFS arc to play more nicely with the
+kernel’s LRU pagecache. It can guarantee that the arc size won’t
+collapse under scanning pressure on the pagecache, yet still allows arc
+to be reclaimed down to zfs_arc_min if necessary. This value is
+specified as percent of pagecache size (as measured by
+NR_FILE_PAGES
) where that percent may exceed 100. This only operates
+during memory pressure/reclaim.
zfs_arc_pc_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When using file systems under memory
+shortfall, if the page scanner causes the ARC
+to shrink too fast, then adjusting
+ |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+0 (disabled) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_arc_sys_free
is the target number of bytes the ARC should leave
+as free memory on the system. Defaults to the larger of 1/64 of physical
+memory or 512K. Setting this option to a non-zero value will override
+the default.
A value of 0 represents the default setting of larger of 1/64 of +physical memory or 512 KiB. However, once changed, dynamically setting +zfs_arc_sys_free to 0 will not return to the default.
+zfs_arc_sys_free |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Change if more free memory is desired as a +margin against memory demand by applications |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0 to ULONG_MAX |
+
Default |
+0 (default to larger of 1/64 of physical memory +or 512 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Disable reading zpool.cache file (see +spa_config_path) when loading the zfs module.
+zfs_autoimport_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Leave as default so that zfs behaves as +other Linux kernel modules |
+
Data Type |
+boolean |
+
Range |
+0=read |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_commit_timeout_pct
controls the amount of time that a log (ZIL)
+write block (lwb) remains “open” when it isn’t “full” and it has a
+thread waiting to commit to stable storage. The timeout is scaled based
+on a percentage of the last lwb latency to avoid significantly impacting
+the latency of each individual intent log transaction (itx).
zfs_commit_timeout_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+1 to 100 |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
/proc/spl/kstat/zfs/dbgmsg
file./proc/spl/kstat/zfs/dbgmsg
file clears the log.See also zfs_dbgmsg_maxsize
+zfs_dbgmsg_enable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To view ZFS internal debug log |
+
Data Type |
+boolean |
+
Range |
+0=do not log debug messages, 1=log debug messages |
+
Default |
+0 (1 for debug builds) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
The /proc/spl/kstat/zfs/dbgmsg
file size limit is set by
+zfs_dbgmsg_maxsize.
See also zfs_dbgmsg_enable
+zfs_dbgmsg_maxsize |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to INT_MAX |
+
Default |
+4 MiB |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
The zfs_dbuf_state_index
feature is currently unused. It is normally
+used for controlling values in the /proc/spl/kstat/zfs/dbufs
file.
zfs_dbuf_state_index |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+int |
+
Units |
+TBD |
+
Range |
+TBD |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
When a pool sync operation takes longer than zfs_deadman_synctime_ms
+milliseconds, a “slow spa_sync” message is logged to the debug log (see
+zfs_dbgmsg_enable). If zfs_deadman_enabled
+is set to 1, then all pending IO operations are also checked and if any
+haven’t completed within zfs_deadman_synctime_ms milliseconds, a “SLOW
+IO” message is logged to the debug log and a “deadman” system event (see
+zpool events command) with the details of the hung IO is posted.
zfs_deadman_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To disable logging of slow I/O |
+
Data Type |
+boolean |
+
Range |
+0=do not log slow I/O, 1=log slow I/O |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
Once a pool sync operation has taken longer than +zfs_deadman_synctime_ms milliseconds, +continue to check for slow operations every +zfs_deadman_checktime_ms milliseconds.
+zfs_deadman_checktime_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When debugging slow I/O |
+
Data Type |
+ulong |
+
Units |
+milliseconds |
+
Range |
+1 to ULONG_MAX |
+
Default |
+60,000 (1 minute) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
When an individual I/O takes longer than zfs_deadman_ziotime_ms
+milliseconds, then the operation is considered to be “hung”. If
+zfs_deadman_enabled is set then the deadman
+behaviour is invoked as described by the
+zfs_deadman_failmode option.
zfs_deadman_ziotime_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ABD features |
+
Data Type |
+ulong |
+
Units |
+milliseconds |
+
Range |
+1 to ULONG_MAX |
+
Default |
+300,000 (5 minutes) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
The I/O deadman timer expiration time has two meanings
+determines when the spa_deadman()
logic should fire, indicating
+the txg sync has not completed in a timely manner
determines if an I/O is considered “hung”
In version v0.8.0, any I/O that has not completed in
+zfs_deadman_synctime_ms
is considered “hung” resulting in one of
+three behaviors controlled by the
+zfs_deadman_failmode parameter.
zfs_deadman_synctime_ms
takes effect if
+zfs_deadman_enabled = 1.
zfs_deadman_synctime_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When debugging slow I/O |
+
Data Type |
+ulong |
+
Units |
+milliseconds |
+
Range |
+1 to ULONG_MAX |
+
Default |
+600,000 (10 minutes) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
zfs_deadman_failmode controls the behavior of the I/O deadman timer when +it detects a “hung” I/O. Valid values are:
+wait - Wait for the “hung” I/O (default)
continue - Attempt to recover from a “hung” I/O
panic - Panic the system
zfs_deadman_failmode |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In some cluster cases, panic can be appropriate |
+
Data Type |
+string |
+
Range |
+wait, continue, or panic |
+
Default |
+wait |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
ZFS can prefetch deduplication table (DDT) entries.
+zfs_dedup_prefetch
allows DDT prefetches to be enabled.
zfs_dedup_prefetch |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For systems with limited RAM using the dedup +feature, disabling deduplication table +prefetch can reduce memory pressure |
+
Data Type |
+boolean |
+
Range |
+0=do not prefetch, 1=prefetch dedup table +entries |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
zfs_delete_blocks
defines a large file for the purposes of delete.
+Files containing more than zfs_delete_blocks
will be deleted
+asynchronously while smaller files are deleted synchronously. Decreasing
+this value reduces the time spent in an unlink(2)
system call at the
+expense of a longer delay before the freed space is available.
The zfs_delete_blocks
value is specified in blocks, not bytes. The
+size of blocks can vary and is ultimately limited by the filesystem’s
+recordsize property.
zfs_delete_blocks |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If applications delete large files and blocking
+on |
+
Data Type |
+ulong |
+
Units |
+blocks |
+
Range |
+1 to ULONG_MAX |
+
Default |
+20,480 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The ZFS write throttle begins to delay each transaction when the amount
+of dirty data reaches the threshold zfs_delay_min_dirty_percent
of
+zfs_dirty_data_max. This value should be >=
+zfs_vdev_async_write_active_max_dirty_percent.
zfs_delay_min_dirty_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+60 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_delay_scale
controls how quickly the ZFS write throttle
+transaction delay approaches infinity. Larger values cause longer delays
+for a given amount of dirty data.
For the smoothest delay, this value should be about 1 billion divided by
+the maximum number of write operations per second the pool can sustain.
+The throttle will smoothly handle between 10x and 1/10th
+zfs_delay_scale
.
Note: zfs_delay_scale
*
+zfs_dirty_data_max must be < 2^64.
zfs_delay_scale |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+ulong |
+
Units |
+scalar (nanoseconds) |
+
Range |
+0 to ULONG_MAX |
+
Default |
+500,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_dirty_data_max
is the ZFS write throttle dirty space limit. Once
+this limit is exceeded, new writes are delayed until space is freed by
+writes being committed to the pool.
zfs_dirty_data_max takes precedence over +zfs_dirty_data_max_percent.
+zfs_dirty_data_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+1 to +zfs_d +irty_data_max_max |
+
Default |
+10% of physical RAM |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_dirty_data_max_percent
is an alternative method of specifying
+zfs_dirty_data_max, the ZFS write throttle
+dirty space limit. Once this limit is exceeded, new writes are delayed
+until space is freed by writes being committed to the pool.
zfs_dirty_data_max takes precedence over
+zfs_dirty_data_max_percent
.
zfs_dirty_data_max_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+1 to 100 |
+
Default |
+10% of physical RAM |
+
Change |
+Prior to zfs module load or a memory +hot plug event |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_dirty_data_max_max
is the maximum allowable value of
+zfs_dirty_data_max.
zfs_dirty_data_max_max
takes precedence over
+zfs_dirty_data_max_max_percent.
zfs_dirty_data_max_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+1 to physical RAM size |
+
Default |
+physical_ram/4 +since v0.7: min(physical_ram/4, 4GiB) +since v2.0 for 32-bit systems: min(physical_ram/4, 1GiB) + |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_dirty_data_max_max_percent
an alternative to
+zfs_dirty_data_max_max for setting the
+maximum allowable value of zfs_dirty_data_max
zfs_dirty_data_max_max takes precedence
+over zfs_dirty_data_max_max_percent
zfs_dirty_data_max_max_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See section “ZFS TRANSACTION DELAY” |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+1 to 100 |
+
Default |
+25% of physical RAM |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+v0.6.4 and later |
+
When there is at least zfs_dirty_data_sync
dirty data, a transaction
+group sync is started. This allows a transaction group sync to occur
+more frequently than the transaction group timeout interval (see
+zfs_txg_timeout) when there is dirty data to be
+written.
zfs_dirty_data_sync |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+1 to ULONG_MAX |
+
Default |
+67,108,864 (64 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 through v0.8.x, deprecation planned +for v2 |
+
When there is at least zfs_dirty_data_sync_percent
of
+zfs_dirty_data_max dirty data, a transaction
+group sync is started. This allows a transaction group sync to occur
+more frequently than the transaction group timeout interval (see
+zfs_txg_timeout) when there is dirty data to be
+written.
zfs_dirty_data_sync_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
++ |
Default |
+20 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2, deprecates +zfs_dirt +y_data_sync |
+
Fletcher-4 is the default checksum algorithm for metadata and data. When
+the zfs kernel module is loaded, a set of microbenchmarks are run to
+determine the fastest algorithm for the current hardware. The
+zfs_fletcher_4_impl
parameter allows a specific implementation to be
+specified other than the default (fastest). Selectors other than
+fastest and scalar require instruction set extensions to be
+available and will only appear if ZFS detects their presence. The
+scalar implementation works on all processors.
The results of the microbenchmark are visible in the
+/proc/spl/kstat/zfs/fletcher_4_bench
file. Larger numbers indicate
+better performance. Since ZFS is processor endian-independent, the
+microbenchmark is run against both big and little-endian transformation.
zfs_fletcher_4_impl |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing Fletcher-4 algorithms |
+
Data Type |
+string |
+
Range |
+fastest, scalar, superscalar, +superscalar4, sse2, ssse3, avx2, +avx512f, or aarch64_neon depending on +hardware support |
+
Default |
+fastest |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The processing of the free_bpobj object can be enabled by
+zfs_free_bpobj_enabled
zfs_free_bpobj_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If there’s a problem with processing +free_bpobj (e.g. i/o error or bug) |
+
Data Type |
+boolean |
+
Range |
+0=do not process free_bpobj objects, +1=process free_bpobj objects |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_free_max_blocks
sets the maximum number of blocks to be freed in
+a single transaction group (txg). For workloads that delete (free) large
+numbers of blocks in a short period of time, the processing of the frees
+can negatively impact other operations, including txg commits.
+zfs_free_max_blocks
acts as a limit to reduce the impact.
zfs_free_max_blocks |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For workloads that delete large files,
+ |
+
Data Type |
+ulong |
+
Units |
+blocks |
+
Range |
+1 to ULONG_MAX |
+
Default |
+100,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
Maximum asynchronous read I/Os active to each device.
+zfs_vdev_async_read_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev_ma +x_active |
+
Default |
+3 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Minimum asynchronous read I/Os active to each device.
+zfs_vdev_async_read_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +( +zfs_vdev_async_read_max_active +- 1) |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
When the amount of dirty data exceeds the threshold
+zfs_vdev_async_write_active_max_dirty_percent
of
+zfs_dirty_data_max dirty data, then
+zfs_vdev_async_write_max_active
+is used to limit active async writes. If the dirty data is between
+zfs_vdev_async_write_active_min_dirty_percent
+and zfs_vdev_async_write_active_max_dirty_percent
, the active I/O
+limit is linearly interpolated between
+zfs_vdev_async_write_min_active
+and
+zfs_vdev_async_write_max_active
zfs_vdev_asyn +c_write_active_max_dirty_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+percent of +zfs_dirty_d +ata_max |
+
Range |
+0 to 100 |
+
Default |
+60 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
If the amount of dirty data is between
+zfs_vdev_async_write_active_min_dirty_percent
and
+zfs_vdev_async_write_active_max_dirty_percent
+of zfs_dirty_data_max, the active I/O limit is
+linearly interpolated between
+zfs_vdev_async_write_min_active
+and
+zfs_vdev_async_write_max_active
zfs_vdev_asyn +c_write_active_min_dirty_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+percent of zfs_dirty_data_max |
+
Range |
+0 to +(z +fs_vdev_async_write_active_max_d +irty_percent +- 1) |
+
Default |
+30 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_async_write_max_active
sets the maximum asynchronous write
+I/Os active to each device.
zfs_vdev_async_write_max_active |
+Notes |
+
---|---|
Tags |
+vdev, +` +ZIO_scheduler <#zio-scheduler>`__ |
+
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev_max +_active |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_async_write_min_active
sets the minimum asynchronous write
+I/Os active to each device.
Lower values are associated with better latency on rotational media but +poorer resilver performance. The default value of 2 was chosen as a +compromise. A value of 3 has been shown to improve resilver performance +further at a cost of further increasing latency.
+zfs_vdev_async_write_min_active |
+Notes |
+
---|---|
Tags |
+vdev, +` +ZIO_scheduler <#zio-scheduler>`__ |
+
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+1 for v0.6.x, 2 for v0.7.0 and +later |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
The maximum number of I/Os active to each device. Ideally,
+zfs_vdev_max_active
>= the sum of each queue’s max_active.
Once queued to the device, the ZFS I/O scheduler is no longer able to +prioritize I/O operations. The underlying device drivers have their own +scheduler and queue depth limits. Values larger than the device’s +maximum queue depth can have the affect of increased latency as the I/Os +are queued in the intervening device driver layers.
+zfs_vdev_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+sum of each queue’s min_active to UINT32_MAX |
+
Default |
+1,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_scrub_max_active
sets the maximum scrub or scan read I/Os
+active to each device.
zfs_vdev_scrub_max_active |
+Notes |
+
---|---|
Tags |
+vdev, +ZIO_scheduler, +scrub, +resilver |
+
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vd +ev_max_active |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_scrub_min_active
sets the minimum scrub or scan read I/Os
+active to each device.
zfs_vdev_scrub_min_active |
+Notes |
+
---|---|
Tags |
+vdev, +ZIO_scheduler, +scrub, +resilver |
+
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Maximum synchronous read I/Os active to each device.
+zfs_vdev_sync_read_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev_m +ax_active |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_sync_read_min_active
sets the minimum synchronous read I/Os
+active to each device.
zfs_vdev_sync_read_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_sync_write_max_active
sets the maximum synchronous write
+I/Os active to each device.
zfs_vdev_sync_write_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev_ma +x_active |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
zfs_vdev_sync_write_min_active
sets the minimum synchronous write
+I/Os active to each device.
zfs_vdev_sync_write_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Maximum number of queued allocations per top-level vdev expressed as a
+percentage of
+zfs_vdev_async_write_max_active.
+This allows the system to detect devices that are more capable of
+handling allocations and to allocate more blocks to those devices. It
+also allows for dynamic allocation distribution when devices are
+imbalanced as fuller devices will tend to be slower than empty devices.
+Once the queue depth reaches (zfs_vdev_queue_depth_pct
*
+zfs_vdev_async_write_max_active /
+100) then allocator will stop allocating blocks on that top-level device
+and switch to the next.
See also zio_dva_throttle_enabled
+zfs_vdev_queue_depth_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to UINT32_MAX |
+
Default |
+1,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
Disable duplicate buffer eviction from ARC.
+zfs_disable_dup_eviction |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0=duplicate buffers can be evicted, 1=do +not evict duplicate buffers |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5, deprecated in v0.7.0 |
+
Snapshots of filesystems are normally automounted under the filesystem’s
+.zfs/snapshot
subdirectory. When not in use, snapshots are unmounted
+after zfs_expire_snapshot seconds.
zfs_expire_snapshot |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+seconds |
+
Range |
+0 disables automatic unmounting, maximum time +is INT_MAX |
+
Default |
+300 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.1 and later |
+
Allow the creation, removal, or renaming of entries in the
+.zfs/snapshot
subdirectory to cause the creation, destruction, or
+renaming of snapshots. When enabled this functionality works both
+locally and over NFS exports which have the “no_root_squash” option set.
zfs_admin_snapshot |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0=do not allow snapshot manipulation via the +filesystem, 1=allow snapshot manipulation via +the filesystem |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
Set additional debugging flags (see +zfs_dbgmsg_enable)
+flag value |
+symbolic name |
+description |
+
---|---|---|
0x1 |
+ZFS_DEBUG_DPRINTF |
+Enable dprintf entries in +the debug log |
+
0x2 |
+ZFS_DEBUG_DBUF_VERIFY |
+Enable extra dnode +verifications |
+
0x4 |
+ZFS_DEBUG_DNODE_VERIFY |
+Enable extra dnode +verifications |
+
0x8 |
+ZFS_DEBUG_SNAPNAMES |
+Enable snapshot name +verification |
+
0x10 |
+ZFS_DEBUG_MODIFY |
+Check for illegally +modified ARC buffers |
+
0x20 |
+ZFS_DEBUG_SPA |
+Enable spa_dbgmsg entries +in the debug log |
+
0x40 |
+ZFS_DEBUG_ZIO_FREE |
+Enable verification of +block frees |
+
0x80 |
+Z +FS_DEBUG_HISTOGRAM_VERIFY |
+Enable extra spacemap +histogram verifications |
+
0x100 |
+ZFS_DEBUG_METASLAB_VERIFY |
+Verify space accounting +on disk matches in-core +range_trees |
+
0x200 |
+ZFS_DEBUG_SET_ERROR |
+Enable SET_ERROR and +dprintf entries in the +debug log |
+
zfs_flags |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When debugging ZFS |
+
Data Type |
+int |
+
Default |
+0 no debug flags set, for debug builds: all +except ZFS_DEBUG_DPRINTF and ZFS_DEBUG_SPA |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
If destroy encounters an I/O error (EIO) while reading metadata (eg
+indirect blocks), space referenced by the missing metadata cannot be
+freed. Normally, this causes the background destroy to become “stalled”,
+as the destroy is unable to make forward progress. While in this stalled
+state, all remaining space to free from the error-encountering
+filesystem is temporarily leaked. Set zfs_free_leak_on_eio = 1
to
+ignore the EIO, permanently leak the space from indirect blocks that can
+not be read, and continue to free everything else that it can.
The default, stalling behavior is useful if the storage partially fails +(eg some but not all I/Os fail), and then later recovers. In this case, +we will be able to continue pool operations while it is partially +failed, and when it recovers, we can continue to free the space, with no +leaks. However, note that this case is rare.
+Typically pools either:
+fail completely (but perhaps temporarily (eg a top-level vdev going +offline)
have localized, permanent errors (eg disk returns the wrong data due +to bit flip or firmware bug)
In case (1), the zfs_free_leak_on_eio
setting does not matter
+because the pool will be suspended and the sync thread will not be able
+to make forward progress. In case (2), because the error is permanent,
+the best effort do is leak the minimum amount of space. Therefore, it is
+reasonable for zfs_free_leak_on_eio
be set, but by default the more
+conservative approach is taken, so that there is no possibility of
+leaking space in the “partial temporary” failure case.
zfs_free_leak_on_eio |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When debugging I/O errors during destroy |
+
Data Type |
+boolean |
+
Range |
+0=normal behavior, 1=ignore error and +permanently leak space |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
During a zfs destroy
operation using feature@async_destroy
a
+minimum of zfs_free_min_time_ms
time will be spent working on
+freeing blocks per txg commit.
zfs_free_min_time_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to (zfs_txg_timeout * 1000) |
+
Default |
+1,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 and later |
+
If a pool does not have a log device, data blocks equal to or larger
+than zfs_immediate_write_sz
are treated as if the dataset being
+written to had the property setting logbias=throughput
Terminology note: logbias=throughput
writes the blocks in “indirect
+mode” to the ZIL where the data is written to the pool and a pointer to
+the data is written to the ZIL.
zfs_immediate_write_sz |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+long |
+
Units |
+bytes |
+
Range |
+512 to 16,777,216 (valid block sizes) |
+
Default |
+32,768 (32 KiB) |
+
Change |
+Dynamic |
+
Verification |
+Data blocks that exceed
+ |
+
Versions Affected |
+all |
+
ZFS supports logical record (block) sizes from 512 bytes to 16 MiB. The
+benefits of larger blocks, and thus larger average I/O sizes, can be
+weighed against the cost of copy-on-write of large block to modify one
+byte. Additionally, very large blocks can have a negative impact on both
+I/O latency at the device level and the memory allocator. The
+zfs_max_recordsize
parameter limits the upper bound of the dataset
+volblocksize and recordsize properties.
Larger blocks can be created by enabling zpool
large_blocks
+feature and changing this zfs_max_recordsize
. Pools with larger
+blocks can always be imported and used, regardless of the value of
+zfs_max_recordsize
.
For 32-bit systems, zfs_max_recordsize
also limits the size of
+kernel virtual memory caches used in the ZFS I/O pipeline (zio_buf_*
+and zio_data_buf_*
).
See also the zpool
large_blocks
feature.
zfs_max_recordsize |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To create datasets with larger volblocksize or +recordsize |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+512 to 16,777,216 (valid block sizes) |
+
Default |
+1,048,576 |
+
Change |
+Dynamic, set prior to creating volumes or +changing filesystem recordsize |
+
Versions Affected |
+v0.6.5 and later |
+
zfs_mdcomp_disable
allows metadata compression to be disabled.
zfs_mdcomp_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When CPU cycles cost less than I/O |
+
Data Type |
+boolean |
+
Range |
+0=compress metadata, 1=do not compress metadata |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+from v0.6.0 to v0.8.0 |
+
Allow metaslabs to keep their active state as long as their
+fragmentation percentage is less than or equal to this value. When
+writing, an active metaslab whose fragmentation percentage exceeds
+zfs_metaslab_fragmentation_threshold
is avoided allowing metaslabs
+with less fragmentation to be preferred.
Metaslab fragmentation is used to calculate the overall pool
+fragmentation
property value. However, individual metaslab
+fragmentation levels are observable using the zdb
with the -mm
+option.
zfs_metaslab_fragmentation_threshold
works at the metaslab level and
+each top-level vdev has approximately
+metaslabs_per_vdev metaslabs. See also
+zfs_mg_fragmentation_threshold
zfs_metaslab_fragmentation_thresh +old |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing metaslab allocation |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+1 to 100 |
+
Default |
+70 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Metaslab groups (top-level vdevs) are considered eligible for
+allocations if their fragmentation percentage metric is less than or
+equal to zfs_mg_fragmentation_threshold
. If a metaslab group exceeds
+this threshold then it will be skipped unless all metaslab groups within
+the metaslab class have also crossed the
+zfs_mg_fragmentation_threshold
threshold.
zfs_mg_fragmentation_threshold |
+Notes |
+
---|---|
Tags |
+allocation, +` +fragmentation <#fragmentation>`__, +vdev |
+
When to change |
+Testing metaslab allocation |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+1 to 100 |
+
Default |
+85 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
Metaslab groups (top-level vdevs) with free space percentage greater
+than zfs_mg_noalloc_threshold
are eligible for new allocations. If a
+metaslab group’s free space is less than or equal to the threshold, the
+allocator avoids allocating to that group unless all groups in the pool
+have reached the threshold. Once all metaslab groups have reached the
+threshold, all metaslab groups are allowed to accept allocations. The
+default value of 0 disables the feature and causes all metaslab groups
+to be eligible for allocations.
This parameter allows one to deal with pools having heavily imbalanced
+vdevs such as would be the case when a new vdev has been added. Setting
+the threshold to a non-zero percentage will stop allocations from being
+made to vdevs that aren’t filled to the specified percentage and allow
+lesser filled vdevs to acquire more allocations than they otherwise
+would under the older zfs_mg_alloc_failures
facility.
zfs_mg_noalloc_threshold |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To force rebalancing as top-level vdevs +are added or expanded |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+0 (disabled) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The pool multihost
multimodifier protection (MMP) subsystem can
+record historical updates in the
+/proc/spl/kstat/zfs/POOL_NAME/multihost
file for debugging purposes.
+The number of lines of history is determined by zfs_multihost_history.
zfs_multihost_history |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When testing multihost feature |
+
Data Type |
+int |
+
Units |
+lines |
+
Range |
+0 to INT_MAX |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_multihost_interval
controls the frequency of multihost writes
+performed by the pool multihost multimodifier protection (MMP)
+subsystem. The multihost write period is (zfs_multihost_interval
/
+number of leaf-vdevs) milliseconds. Thus on average a multihost write
+will be issued for each leaf vdev every zfs_multihost_interval
+milliseconds. In practice, the observed period can vary with the I/O
+load and this observed value is the delay which is stored in the
+uberblock.
On import the multihost activity check waits a minimum amount of time
+determined by (zfs_multihost_interval
*
+zfs_multihost_import_intervals)
+with a lower bound of 1 second. The activity check time may be further
+extended if the value of mmp delay found in the best uberblock indicates
+actual multihost updates happened at longer intervals than
+zfs_multihost_interval
Note: the multihost protection feature applies to storage devices that +can be shared between multiple systems.
+zfs_multihost_interval |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To optimize pool import time against +possibility of simultaneous import by +another system |
+
Data Type |
+ulong |
+
Units |
+milliseconds |
+
Range |
+100 to ULONG_MAX |
+
Default |
+1000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_multihost_import_intervals
controls the duration of the activity
+test on pool import for the multihost multimodifier protection (MMP)
+subsystem. The activity test can be expected to take a minimum time of
+(zfs_multihost_import_interval
s *
+zfs_multihost_interval * random(25%)
)
+milliseconds. The random period of up to 25% improves simultaneous
+import detection. For example, if two hosts are rebooted at the same
+time and automatically attempt to import the pool, then is is highly
+probable that one host will win.
Smaller values of zfs_multihost_import_intervals
reduces the import
+time but increases the risk of failing to detect an active pool. The
+total activity check time is never allowed to drop below one second.
Note: the multihost protection feature applies to storage devices that +can be shared between multiple systems.
+zfs_multihost_import_intervals |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+intervals |
+
Range |
+1 to UINT_MAX |
+
Default |
+20 since v0.8, previously 10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_multihost_fail_intervals
controls the behavior of the pool when
+write failures are detected in the multihost multimodifier protection
+(MMP) subsystem.
If zfs_multihost_fail_intervals = 0
then multihost write failures
+are ignored. The write failures are reported to the ZFS event daemon
+(zed
) which can take action such as suspending the pool or offlining
+a device.
zfs_multihost_fail_intervals > 0
then sequential multihost
+write failures will cause the pool to be suspended. This occurs when
+(zfs_multihost_fail_intervals
*
+zfs_multihost_interval) milliseconds
+have passed since the last successful multihost write.zfs_multihost_fail_intervals |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+intervals |
+
Range |
+0 to UINT_MAX |
+
Default |
+10 since v0.8, previously 5 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
+overwhelmed by high rates of error reports which can be generated by
+failing, high-performance devices. zfs_delays_per_second
limits the
+rate of delay events reported to zed.
zfs_delays_per_second |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If processing delay events at a higher rate +is desired |
+
Data Type |
+uint |
+
Units |
+events per second |
+
Range |
+0 to UINT_MAX |
+
Default |
+20 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.7 and later |
+
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
+overwhelmed by high rates of error reports which can be generated by
+failing, high-performance devices. zfs_checksums_per_second
limits
+the rate of checksum events reported to zed.
Note: do not set this value lower than the SERD limit for checksum
+in zed. By default, checksum_N
= 10 and checksum_T
= 10 minutes,
+resulting in a practical lower limit of 1.
zfs_checksums_per_second |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If processing checksum error events at a +higher rate is desired |
+
Data Type |
+uint |
+
Units |
+events per second |
+
Range |
+0 to UINT_MAX |
+
Default |
+20 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.7 and later |
+
When zfs_no_scrub_io = 1
scrubs do not actually scrub data and
+simply doing a metadata crawl of the pool instead.
zfs_no_scrub_io |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing scrub feature |
+
Data Type |
+boolean |
+
Range |
+0=perform scrub I/O, 1=do not perform scrub I/O |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 and later |
+
When zfs_no_scrub_prefetch = 1
, prefetch is disabled for scrub I/Os.
zfs_no_scrub_prefetch |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing scrub feature |
+
Data Type |
+boolean |
+
Range |
+0=prefetch scrub I/Os, 1=do not prefetch scrub I/Os |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.4 and later |
+
ZFS uses barriers (volatile cache flush commands) to ensure data is +committed to permanent media by devices. This ensures consistent +on-media state for devices where caches are volatile (eg HDDs).
+For devices with nonvolatile caches, the cache flush operation can be a +no-op. However, in some RAID arrays, cache flushes can cause the entire +cache to be flushed to the backing devices.
+To ensure on-media consistency, keep cache flush enabled.
+zfs_nocacheflush |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the storage device has nonvolatile cache, +then disabling cache flush can save the cost of +occasional cache flush commands |
+
Data Type |
+boolean |
+
Range |
+0=send cache flush commands, 1=do not send +cache flush commands |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The NOP-write feature is enabled by default when a
+crytographically-secure checksum algorithm is in use by the dataset.
+zfs_nopwrite_enabled
allows the NOP-write feature to be completely
+disabled.
zfs_nopwrite_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0=disable NOP-write feature, 1=enable +NOP-write feature |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 and later |
+
zfs_dmu_offset_next_sync
enables forcing txg sync to find holes.
+This causes ZFS to act like older versions when SEEK_HOLE
or
+SEEK_DATA
flags are used: when a dirty dnode causes txgs to be
+synced so the previous data can be found.
zfs_dmu_offset_next_sync |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to exchange strict hole reporting for +performance |
+
Data Type |
+boolean |
+
Range |
+0=do not force txg sync to find holes, +1=force txg sync to find holes |
+
Default |
+1 since v2.1.5, previously 0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_pd_bytes_max
limits the number of bytes prefetched during a pool
+traversal (eg zfs send
or other data crawling operations). These
+prefetches are referred to as “prescient prefetches” and are always 100%
+hit rate. The traversal operations do not use the default data or
+metadata prefetcher.
zfs_pd_bytes_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int32 |
+
Units |
+bytes |
+
Range |
+0 to INT32_MAX |
+
Default |
+52,428,800 (50 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+TBD |
+
zfs_per_txg_dirty_frees_percent
as a percentage of
+zfs_dirty_data_max controls the percentage of
+dirtied blocks from frees in one txg. After the threshold is crossed,
+additional dirty blocks from frees wait until the next txg. Thus, when
+deleting large files, filling consecutive txgs with deletes/frees, does
+not throttle other, perhaps more important, writes.
A side effect of this throttle can impact zfs receive
workloads that
+contain a large number of frees and the
+ignore_hole_birth optimization is disabled. The
+symptom is that the receive workload causes an increase in the frequency
+of txg commits. The frequency of txg commits is observable via the
+otime
column of /proc/spl/kstat/zfs/POOLNAME/txgs
. Since txg
+commits also flush data from volatile caches in HDDs to media, HDD
+performance can be negatively impacted. Also, since the frees do not
+consume much bandwidth over the pipe, the pipe can appear to stall. Thus
+the overall progress of receives is slower than expected.
A value of zero will disable this throttle.
+zfs_per_txg_dirty_frees_percent |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+For |
+
Data Type |
+ulong |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+30 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_prefetch_disable
controls the predictive prefetcher.
Note that it leaves “prescient” prefetch (eg prefetch for zfs send
)
+intact (see zfs_pd_bytes_max)
zfs_prefetch_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In some case where the workload is +completely random reads, overall performance +can be better if prefetch is disabled |
+
Data Type |
+boolean |
+
Range |
+0=prefetch enabled, 1=prefetch disabled |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Verification |
+prefetch efficacy is observed by
+ |
+
Versions Affected |
+all |
+
zfs_read_chunk_size
is the limit for ZFS filesystem reads. If an
+application issues a read()
larger than zfs_read_chunk_size
,
+then the read()
is divided into multiple operations no larger than
+zfs_read_chunk_size
zfs_read_chunk_size |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+512 to ULONG_MAX |
+
Default |
+1,048,576 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Historical statistics for the last zfs_read_history
reads are
+available in /proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To observe read operation details |
+
Data Type |
+int |
+
Units |
+lines |
+
Range |
+0 to INT_MAX |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
When zfs_read_history> 0
,
+zfs_read_history_hits controls whether ARC hits are displayed in the
+read history file, /proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history_hits |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To observe read operation details with ARC +hits |
+
Data Type |
+boolean |
+
Range |
+0=do not include data for ARC hits, +1=include ARC hit data |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zfs_recover
can be set to true (1) to attempt to recover from
+otherwise-fatal errors, typically caused by on-disk corruption. When
+set, calls to zfs_panic_recover()
will turn into warning messages
+rather than calling panic()
zfs_recover |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+zfs_recover should only be used as a last +resort, as it typically results in leaked +space, or worse |
+
Data Type |
+boolean |
+
Range |
+0=normal operation, 1=attempt recovery zpool +import |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Verification |
+check output of |
+
Versions Affected |
+v0.6.4 or later |
+
Resilvers are processed by the sync thread in syncing context. While
+resilvering, ZFS spends at least zfs_resilver_min_time_ms
time
+working on a resilver between txg commits.
The zfs_txg_timeout tunable sets a nominal
+timeout value for the txg commits. By default, this timeout is 5 seconds
+and the zfs_resilver_min_time_ms
is 3 seconds. However, many
+variables contribute to changing the actual txg times. The measured txg
+interval is observed as the otime
column (in nanoseconds) in the
+/proc/spl/kstat/zfs/POOL_NAME/txgs
file.
See also zfs_txg_timeout and +zfs_scan_min_time_ms
+zfs_resilver_min_time_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In some resilvering cases, increasing
+ |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to +zfs_txg_timeout +converted to milliseconds |
+
Default |
+3,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Scrubs are processed by the sync thread in syncing context. While
+scrubbing, ZFS spends at least zfs_scan_min_time_ms
time working on
+a scrub between txg commits.
See also zfs_txg_timeout and +zfs_resilver_min_time_ms
+zfs_scan_min_time_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In some scrub cases, increasing
+ |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to zfs_txg_timeout +converted to milliseconds |
+
Default |
+1,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
To preserve progress across reboots the sequential scan algorithm
+periodically needs to stop metadata scanning and issue all the
+verifications I/Os to disk every zfs_scan_checkpoint_intval
seconds.
zfs_scan_checkpoint_intval |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+seconds |
+
Range |
+1 to INT_MAX |
+
Default |
+7,200 (2 hours) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
This tunable affects how scrub and resilver I/O segments are ordered. A +higher number indicates that we care more about how filled in a segment +is, while a lower number indicates we care more about the size of the +extent without considering the gaps within a segment.
+zfs_scan_fill_weight |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing sequential scrub and resilver |
+
Data Type |
+int |
+
Units |
+scalar |
+
Range |
+0 to INT_MAX |
+
Default |
+3 |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_scan_issue_strategy
controls the order of data verification
+while scrubbing or resilvering.
value |
+description |
+
---|---|
0 |
+fs will use strategy 1 during normal verification and +strategy 2 while taking a checkpoint |
+
1 |
+data is verified as sequentially as possible, given the +amount of memory reserved for scrubbing (see +zfs_scan_mem_lim_fact). This +can improve scrub performance if the pool’s data is heavily +fragmented. |
+
2 |
+the largest mostly-contiguous chunk of found data is +verified first. By deferring scrubbing of small segments, +we may later find adjacent data to coalesce and increase +the segment size. |
+
zfs_scan_issue_strategy |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+enum |
+
Range |
+0 to 2 |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+TBD |
+
Setting zfs_scan_legacy = 1
enables the legacy scan and scrub
+behavior instead of the newer sequential behavior.
zfs_scan_legacy |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+In some cases, the new scan mode can consumer +more memory as it collects and sorts I/Os; +using the legacy algorithm can be more memory +efficient at the expense of HDD read efficiency |
+
Data Type |
+boolean |
+
Range |
+0=use new method: scrubs and resilvers will +gather metadata in memory before issuing +sequential I/O, 1=use legacy algorithm will be +used where I/O is initiated as soon as it is +discovered |
+
Default |
+0 |
+
Change |
+Dynamic, however changing to 0 does not affect +in-progress scrubs or resilvers |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_scan_max_ext_gap
limits the largest gap in bytes between scrub
+and resilver I/Os that will still be considered sequential for sorting
+purposes.
zfs_scan_max_ext_gap |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+512 to ULONG_MAX |
+
Default |
+2,097,152 (2 MiB) |
+
Change |
+Dynamic, however changing to 0 does not +affect in-progress scrubs or resilvers |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_scan_mem_lim_fact
limits the maximum fraction of RAM used for
+I/O sorting by sequential scan algorithm. When the limit is reached
+scanning metadata is stopped and data verification I/O is started. Data
+verification I/O continues until the memory used by the sorting
+algorithm drops by
+zfs_scan_mem_lim_soft_fact
Memory used by the sequential scan algorithm can be observed as the kmem
+sio_cache. This is visible from procfs as
+grep sio_cache /proc/slabinfo
and can be monitored using
+slab-monitoring tools such as slabtop
zfs_scan_mem_lim_fact |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+divisor of physical RAM |
+
Range |
+TBD |
+
Default |
+20 (physical RAM / 20 or 5%) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_scan_mem_lim_soft_fact
sets the fraction of the hard limit,
+zfs_scan_mem_lim_fact, used to determined
+the RAM soft limit for I/O sorting by the sequential scan algorithm.
+After zfs_scan_mem_lim_fact has been
+reached, metadata scanning is stopped until the RAM usage drops by
+zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_soft_fact |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+divisor of (physical RAM / +zfs_scan_mem +_lim_fact) |
+
Range |
+1 to INT_MAX |
+
Default |
+20 (for default +zfs_scan_mem +_lim_fact, +0.25% of physical RAM) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_scan_vdev_limit
is the maximum amount of data that can be
+concurrently issued at once for scrubs and resilvers per leaf vdev.
+zfs_scan_vdev_limit
attempts to strike a balance between keeping the
+leaf vdev queues full of I/Os while not overflowing the queues causing
+high latency resulting in long txg sync times. While
+zfs_scan_vdev_limit
represents a bandwidth limit, the existing I/O
+limit of zfs_vdev_scrub_max_active
+remains in effect, too.
zfs_scan_vdev_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+512 to ULONG_MAX |
+
Default |
+4,194,304 (4 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_send_corrupt_data
enables zfs send
to send of corrupt data
+by ignoring read and checksum errors. The corrupted or unreadable blocks
+are replaced with the value 0x2f5baddb10c
(ZFS bad block)
zfs_send_corrupt_data |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When data corruption exists and an attempt
+to recover at least some data via
+ |
+
Data Type |
+boolean |
+
Range |
+0=do not send corrupt data, 1=replace +corrupt data with cookie |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 and later |
+
The SPA sync process is performed in multiple passes. Once the pass
+number reaches zfs_sync_pass_deferred_free
, frees are no long
+processed and must wait for the next SPA sync.
The zfs_sync_pass_deferred_free
value is expected to be removed as a
+tunable once the optimal value is determined during field testing.
The zfs_sync_pass_deferred_free
pass must be greater than 1 to
+ensure that regular blocks are not deferred.
zfs_sync_pass_deferred_free |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing SPA sync process |
+
Data Type |
+int |
+
Units |
+SPA sync passes |
+
Range |
+1 to INT_MAX |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The SPA sync process is performed in multiple passes. Once the pass
+number reaches zfs_sync_pass_dont_compress
, data block compression
+is no longer processed and must wait for the next SPA sync.
The zfs_sync_pass_dont_compress
value is expected to be removed as a
+tunable once the optimal value is determined during field testing.
zfs_sync_pass_dont_compress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing SPA sync process |
+
Data Type |
+int |
+
Units |
+SPA sync passes |
+
Range |
+1 to INT_MAX |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The SPA sync process is performed in multiple passes. Once the pass
+number reaches zfs_sync_pass_rewrite
, blocks can be split into gang
+blocks.
The zfs_sync_pass_rewrite
value is expected to be removed as a
+tunable once the optimal value is determined during field testing.
zfs_sync_pass_rewrite |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing SPA sync process |
+
Data Type |
+int |
+
Units |
+SPA sync passes |
+
Range |
+1 to INT_MAX |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zfs_sync_taskq_batch_pct
controls the number of threads used by the
+DSL pool sync taskq, dp_sync_taskq
zfs_sync_taskq_batch_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to adjust the number of
+ |
+
Data Type |
+int |
+
Units |
+percent of number of online CPUs |
+
Range |
+1 to 100 |
+
Default |
+75 |
+
Change |
+Prior to zfs module load |
+
Versions Affected |
+v0.7.0 and later |
+
Historical statistics for the last zfs_txg_history
txg commits are
+available in /proc/spl/kstat/zfs/POOL_NAME/txgs
The work required to measure the txg commit (SPA statistics) is low. +However, for debugging purposes, it can be useful to observe the SPA +statistics.
+zfs_txg_history |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To observe details of SPA sync behavior. |
+
Data Type |
+int |
+
Units |
+lines |
+
Range |
+0 to INT_MAX |
+
Default |
+0 for version v0.6.0 to v0.7.6, 100 for version v0.8.0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The open txg is committed to the pool periodically (SPA sync) and
+zfs_txg_timeout
represents the default target upper limit.
txg commits can occur more frequently and a rapid rate of txg commits +often indicates a busy write workload, quota limits reached, or the free +space is critically low.
+Many variables contribute to changing the actual txg times. txg commits
+can also take longer than zfs_txg_timeout
if the ZFS write throttle
+is not properly tuned or the time to sync is otherwise delayed (eg slow
+device). Shorter txg commit intervals can occur due to
+zfs_dirty_data_sync for write-intensive
+workloads. The measured txg interval is observed as the otime
column
+(in nanoseconds) in the /proc/spl/kstat/zfs/POOL_NAME/txgs
file.
See also zfs_dirty_data_sync and +zfs_txg_history
+zfs_txg_timeout |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To optimize the work done by txg commit +relative to the pool requirements. See also +section ZFS I/O +Scheduler |
+
Data Type |
+int |
+
Units |
+seconds |
+
Range |
+1 to INT_MAX |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
To reduce IOPs, small, adjacent I/Os can be aggregated (coalesced) into
+a large I/O. For reads, aggregations occur across small adjacency gaps.
+For writes, aggregation can occur at the ZFS or disk level.
+zfs_vdev_aggregation_limit
is the upper bound on the size of the
+larger, aggregated I/O.
Setting zfs_vdev_aggregation_limit = 0
effectively disables
+aggregation by ZFS. However, the block device scheduler can still merge
+(aggregate) I/Os. Also, many devices, such as modern HDDs, contain
+schedulers that can aggregate I/Os.
In general, I/O aggregation can improve performance for devices, such as
+HDDs, where ordering I/O operations for contiguous LBAs is a benefit.
+For random access devices, such as SSDs, aggregation might not improve
+performance relative to the CPU cycles needed to aggregate. For devices
+that represent themselves as having no rotation, the
+zfs_vdev_aggregation_limit_non_rotating
+parameter is used instead of zfs_vdev_aggregation_limit
zfs_vdev_aggregation_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the workload does not benefit from
+aggregation, the
+ |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to 1,048,576 (default) or 16,777,216
+(if |
+
Default |
+1,048,576, or 131,072 for <v0.8 |
+
Change |
+Dynamic |
+
Verification |
+ZFS aggregation is observed with
+ |
+
Versions Affected |
+all |
+
Note: with the current ZFS code, the vdev cache is not helpful and in
+some cases actually harmful. Thusit is disabled by setting the
+zfs_vdev_cache_size = 0
zfs_vdev_cache_size
is the size of the vdev cache.
zfs_vdev_cache_size |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to MAX_INT |
+
Default |
+0 (vdev cache is disabled) |
+
Change |
+Dynamic |
+
Verification |
+vdev cache statistics are available in the
+ |
+
Versions Affected |
+all |
+
Note: with the current ZFS code, the vdev cache is not helpful and in +some cases actually harmful. Thus it is disabled by setting the +zfs_vdev_cache_size to zero. This related +tunable is, by default, inoperative.
+All read I/Os smaller than zfs_vdev_cache_max
+are turned into (1 << zfs_vdev_cache_bshift
) byte reads by the vdev
+cache. At most zfs_vdev_cache_size bytes will
+be kept in each vdev’s cache.
zfs_vdev_cache_bshift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to INT_MAX |
+
Default |
+16 (65,536 bytes) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Note: with the current ZFS code, the vdev cache is not helpful and in +some cases actually harmful. Thus it is disabled by setting the +zfs_vdev_cache_size to zero. This related +tunable is, by default, inoperative.
+All read I/Os smaller than zfs_vdev_cache_max will be turned into
+(1 <<
zfs_vdev_cache_bshift byte reads
+by the vdev cache. At most zfs_vdev_cache_size
bytes will be kept in
+each vdev’s cache.
zfs_vdev_cache_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+512 to INT_MAX |
+
Default |
+16,384 (16 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
The mirror read algorithm uses current load and an incremental weighting
+value to determine the vdev to service a read operation. Lower values
+determine the preferred vdev. The weighting value is
+zfs_vdev_mirror_rotating_inc
for rotating media and
+zfs_vdev_mirror_non_rotating_inc
+for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
+observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_inc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Increasing for mirrors with both +rotating and nonrotating media more +strongly favors the nonrotating +media |
+
Data Type |
+int |
+
Units |
+scalar |
+
Range |
+0 to MAX_INT |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The mirror read algorithm uses current load and an incremental weighting
+value to determine the vdev to service a read operation. Lower values
+determine the preferred vdev. The weighting value is
+zfs_vdev_mirror_rotating_inc for
+rotating media and zfs_vdev_mirror_non_rotating_inc
for nonrotating
+media.
Verify the rotational setting described by a block device in sysfs by
+observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_inc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+scalar |
+
Range |
+0 to INT_MAX |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
For rotating media in a mirror, if the next I/O offset is within
+zfs_vdev_mirror_rotating_seek_offset
+then the weighting factor is incremented by
+(zfs_vdev_mirror_rotating_seek_inc / 2
). Otherwise the weighting
+factor is increased by zfs_vdev_mirror_rotating_seek_inc
. This
+algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
+observing /sys/block/DISK_NAME/queue/rotational
z +fs_vdev_mirror_rotating_seek_inc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+scalar |
+
Range |
+0 to INT_MAX |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
For rotating media in a mirror, if the next I/O offset is within
+zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is
+incremented by
+(zfs_vdev_mirror_rotating_seek_inc/ 2
).
+Otherwise the weighting factor is increased by
+zfs_vdev_mirror_rotating_seek_inc
. This algorithm prefers rotating
+media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
+observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_off +set |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to INT_MAX |
+
Default |
+1,048,576 (1 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
For nonrotating media in a mirror, a seek penalty is applied as +sequential I/O’s can be aggregated into fewer operations, avoiding +unnecessary per-command overhead, often boosting performance.
+Verify the rotational setting described by a block device in SysFS by
+observing /sys/block/DISK_NAME/queue/rotational
zfs_v +dev_mirror_non_rotating_seek_inc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+scalar |
+
Range |
+0 to INT_MAX |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into
+into a large I/O. For reads, aggregations occur across small adjacency
+gaps where the gap is less than zfs_vdev_read_gap_limit
zfs_vdev_read_gap_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to INT_MAX |
+
Default |
+32,768 (32 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into
+into a large I/O. For writes, aggregations occur across small adjacency
+gaps where the gap is less than zfs_vdev_write_gap_limit
zfs_vdev_write_gap_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to INT_MAX |
+
Default |
+4,096 (4 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
Prior to version 0.8.3, when the pool is imported, for whole disk vdevs,
+the block device I/O scheduler is set to zfs_vdev_scheduler
.
+The most common schedulers are: noop, cfq, bfq, and deadline.
+In some cases, the scheduler is not changeable using this method.
+Known schedulers that cannot be changed are: scsi_mq and none.
+In these cases, the scheduler is unchanged and an error message can be
+reported to logs.
The parameter was disabled in v0.8.3 but left in place to avoid breaking
+loading of the zfs
module if the parameter is specified in modprobe
+configuration on existing installations. It is recommended that users
+leave the default scheduler “unless you’re encountering a specific
+problem, or have clearly measured a performance improvement for your
+workload,”
+and if so, to change it via the /sys/block/<device>/queue/scheduler
+interface and/or udev rule.
zfs_vdev_scheduler |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+since ZFS has its own I/O scheduler, using a +simple scheduler can result in more consistent +performance |
+
Data Type |
+string |
+
Range |
+expected: noop, cfq, bfq, and deadline |
+
Default |
+noop |
+
Change |
+Dynamic, but takes effect upon pool creation +or import |
+
Versions Affected |
+all, but no effect since v0.8.3 |
+
zfs_vdev_raidz_impl
overrides the raidz parity algorithm. By
+default, the algorithm is selected at zfs module load time by the
+results of a microbenchmark of algorithms based on the current hardware.
Once the module is loaded, the content of
+/sys/module/zfs/parameters/zfs_vdev_raidz_impl
shows available
+options with the currently selected enclosed in []
. Details of the
+results of the microbenchmark are observable in the
+/proc/spl/kstat/zfs/vdev_raidz_bench
file.
algorithm |
+architecture |
+description |
+
---|---|---|
fastest |
+all |
+fastest implementation +selected by +microbenchmark |
+
original |
+all |
+original raidz +implementation |
+
scalar |
+all |
+scalar raidz +implementation |
+
sse2 |
+64-bit x86 |
+uses SSE2 instruction +set |
+
ssse3 |
+64-bit x86 |
+uses SSSE3 instruction +set |
+
avx2 |
+64-bit x86 |
+uses AVX2 instruction +set |
+
avx512f |
+64-bit x86 |
+uses AVX512F +instruction set |
+
avx512bw |
+64-bit x86 |
+uses AVX512F & AVX512BW +instruction sets |
+
aarch64_neon |
+aarch64/64 bit ARMv8 |
+uses NEON |
+
aarch64_neonx2 |
+aarch64/64 bit ARMv8 |
+uses NEON with more +unrolling |
+
zfs_vdev_raidz_impl |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+testing raidz algorithms |
+
Data Type |
+string |
+
Range |
+see table above |
+
Default |
+fastest |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_zevent_cols
is a soft wrap limit in columns (characters) for ZFS
+events logged to the console.
zfs_zevent_cols |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+if 80 columns isn’t enough |
+
Data Type |
+int |
+
Units |
+characters |
+
Range |
+1 to INT_MAX |
+
Default |
+80 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
If zfs_zevent_console
is true (1), then ZFS events are logged to the
+console.
More logging and log filtering capabilities are provided by zed
zfs_zevent_console |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to log ZFS events to the console |
+
Data Type |
+boolean |
+
Range |
+0=do not log to console, 1=log to console |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zfs_zevent_len_max
is the maximum ZFS event queue length. A value of
+0 results in a calculated value (16 * number of CPUs) with a minimum of
+64. Events in the queue can be viewed with the zpool events
command.
zfs_zevent_len_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+increase to see more ZFS events |
+
Data Type |
+int |
+
Units |
+events |
+
Range |
+0 to INT_MAX |
+
Default |
+0 (calculate as described above) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
During a SPA sync, intent log transaction groups (itxg) are cleaned. The
+cleaning work is dispatched to the DSL pool ZIL clean taskq
+(dp_zil_clean_taskq
).
+zfs_zil_clean_taskq_minalloc is the
+minimum and zfs_zil_clean_taskq_maxalloc
is the maximum number of
+cached taskq entries for dp_zil_clean_taskq
. The actual number of
+taskq entries dynamically varies between these values.
When zfs_zil_clean_taskq_maxalloc
is exceeded transaction records
+(itxs) are cleaned synchronously with possible negative impact to the
+performance of SPA sync.
Ideally taskq entries are pre-allocated prior to being needed by
+zil_clean()
, thus avoiding dynamic allocation of new taskq entries.
zfs_zil_clean_taskq_maxalloc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If more |
+
Data Type |
+int |
+
Units |
+
|
+
Range |
+zfs_zil_clean_taskq_minallo
+c
+to |
+
Default |
+1,048,576 |
+
Change |
+Dynamic, takes effect per-pool when +the pool is imported |
+
Versions Affected |
+v0.8.0 |
+
During a SPA sync, intent log transaction groups (itxg) are cleaned. The
+cleaning work is dispatched to the DSL pool ZIL clean taskq
+(dp_zil_clean_taskq
). zfs_zil_clean_taskq_minalloc
is the
+minimum and
+zfs_zil_clean_taskq_maxalloc is the
+maximum number of cached taskq entries for dp_zil_clean_taskq
. The
+actual number of taskq entries dynamically varies between these values.
zfs_zil_clean_taskq_minalloc
is the minimum number of ZIL
+transaction records (itxs).
Ideally taskq entries are pre-allocated prior to being needed by
+zil_clean()
, thus avoiding dynamic allocation of new taskq entries.
zfs_zil_clean_taskq_minalloc |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+dp_zil_clean_taskq taskq entries |
+
Range |
++ |
Default |
+1,024 |
+
Change |
+Dynamic, takes effect per-pool when +the pool is imported |
+
Versions Affected |
+v0.8.0 |
+
zfs_zil_clean_taskq_nthr_pct
controls the number of threads used by
+the DSL pool ZIL clean taskq (dp_zil_clean_taskq
). The default value
+of 100% will create a maximum of one thread per cpu.
zfs_zil_clean_taskq_nthr_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ZIL clean and SPA sync +performance |
+
Data Type |
+int |
+
Units |
+percent of number of CPUs |
+
Range |
+1 to 100 |
+
Default |
+100 |
+
Change |
+Dynamic, takes effect per-pool when +the pool is imported |
+
Versions Affected |
+v0.8.0 |
+
If zil_replay_disable = 1
, then when a volume or filesystem is
+brought online, no attempt to replay the ZIL is made and any existing
+ZIL is destroyed. This can result in loss of data without notice.
zil_replay_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+boolean |
+
Range |
+0=replay ZIL, 1=destroy ZIL |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 |
+
zil_slog_bulk
is the log device write size limit per commit executed
+with synchronous priority. Writes below zil_slog_bulk
are executed
+with synchronous priority. Writes above zil_slog_bulk
are executed
+with lower (asynchronous) priority to reduct potential log device abuse
+by a single active ZIL writer.
zil_slog_bulk |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0 to ULONG_MAX |
+
Default |
+786,432 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
If a ZFS I/O operation takes more than zio_delay_max
milliseconds to
+complete, then an event is logged. Note that this is only a logging
+facility, not a timeout on operations. See also zpool events
zio_delay_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging slow I/O |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to INT_MAX |
+
Default |
+30,000 (30 seconds) |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zio_dva_throttle_enabled
controls throttling of block allocations in
+the ZFS I/O (ZIO) pipeline. When enabled, the maximum number of pending
+allocations per top-level vdev is limited by
+zfs_vdev_queue_depth_pct
zio_dva_throttle_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ZIO block allocation algorithms |
+
Data Type |
+boolean |
+
Range |
+0=do not throttle ZIO block allocations, +1=throttle ZIO block allocations |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zio_requeue_io_start_cut_in_line
controls prioritization of a
+re-queued ZFS I/O (ZIO) in the ZIO pipeline by the ZIO taskq.
zio_requeue_io_start_cut_in_line |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+boolean |
+
Range |
+0=don’t prioritize re-queued +I/Os, 1=prioritize re-queued +I/Os |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+all |
+
zio_taskq_batch_pct
sets the number of I/O worker threads as a
+percentage of online CPUs. These workers threads are responsible for IO
+work such as compression and checksum calculations.
Each block is handled by one worker thread, so maximum overall worker +thread throughput is function of the number of concurrent blocks being +processed, the number of worker threads, and the algorithms used. The +default value of 75% is chosen to avoid using all CPUs which can result +in latency issues and inconsistent application performance, especially +when high compression is enabled.
+The taskq batch processes are:
+taskq |
+process name |
+Notes |
+
---|---|---|
Write issue |
+z_wr_iss[_#] |
+Can be CPU intensive, runs at lower +priority than other taskqs |
+
Other taskqs exist, but most have fixed numbers of instances and +therefore require recompiling the kernel module to adjust.
+zio_taskq_batch_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+To tune parallelism in multiprocessor systems |
+
Data Type |
+int |
+
Units |
+percent of number of CPUs |
+
Range |
+1 to 100, fractional number of CPUs are +rounded down |
+
Default |
+75 |
+
Change |
+Prior to zfs module load |
+
Verification |
+The number of taskqs for each batch group can
+be observed using |
+
Versions Affected |
+TBD |
+
zvol_inhibit_dev
controls the creation of volume device nodes upon
+pool import.
zvol_inhibit_dev |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Inhibiting can slightly improve startup time on +systems with a very large number of volumes |
+
Data Type |
+boolean |
+
Range |
+0=create volume device nodes, 1=do not create +volume device nodes |
+
Default |
+0 |
+
Change |
+Dynamic, takes effect per-pool when the pool is +imported |
+
Versions Affected |
+v0.6.0 and later |
+
zvol_major
is the default major number for volume devices.
zvol_major |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+uint |
+
Default |
+230 |
+
Change |
+Dynamic, takes effect per-pool when the pool is +imported or volumes are created |
+
Versions Affected |
+all |
+
Discard (aka ATA TRIM or SCSI UNMAP) operations done on volumes are done
+in batches zvol_max_discard_blocks
blocks. The block size is
+determined by the volblocksize
property of a volume.
Some applications, such as mkfs
, discard the whole volume at once
+using the maximum possible discard size. As a result, many gigabytes of
+discard requests are not uncommon. Unfortunately, if a large amount of
+data is already allocated in the volume, ZFS can be quite slow to
+process discard requests. This is especially true if the volblocksize is
+small (eg default=8KB). As a result, very large discard requests can
+take a very long time (perhaps minutes under heavy load) to complete.
+This can cause a number of problems, most notably if the volume is
+accessed remotely (eg via iSCSI), in which case the client has a high
+probability of timing out on the request.
Limiting the zvol_max_discard_blocks
can decrease the amount of
+discard workload request by setting the discard_max_bytes
and
+discard_max_hw_bytes
for the volume’s block device in SysFS. This
+value is readable by volume device consumers.
zvol_max_discard_blocks |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+if volume discard activity severely +impacts other workloads |
+
Data Type |
+ulong |
+
Units |
+number of blocks of size volblocksize |
+
Range |
+0 to ULONG_MAX |
+
Default |
+16,384 |
+
Change |
+Dynamic, takes effect per-pool when the +pool is imported or volumes are created |
+
Verification |
+Observe value of
+ |
+
Versions Affected |
+v0.6.0 and later |
+
When importing a pool with volumes or adding a volume to a pool,
+zvol_prefetch_bytes
are prefetch from the start and end of the
+volume. Prefetching these regions of the volume is desirable because
+they are likely to be accessed immediately by blkid(8)
or by the
+kernel scanning for a partition table.
zvol_prefetch_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+0 to UINT_MAX |
+
Default |
+131,072 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.5 and later |
+
When processing I/O requests for a volume submit them synchronously. +This effectively limits the queue depth to 1 for each I/O submitter. +When set to 0 requests are handled asynchronously by the “zvol” thread +pool.
+See also zvol_threads
+zvol_request_sync |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing concurrent volume requests |
+
Data Type |
+boolean |
+
Range |
+0=do concurrent (async) volume requests, 1=do +sync volume requests |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.2 and later |
+
zvol_threads controls the maximum number of threads handling concurrent +volume I/O requests.
+The default of 32 threads behaves similarly to a disk with a 32-entry +command queue. The actual number of threads required can vary widely by +workload and available CPUs. If lock analysis shows high contention in +the zvol taskq threads, then reducing the number of zvol_threads or +workload queue depth can improve overall throughput.
+See also zvol_request_sync
+zvol_threads |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Matching the number of concurrent volume +requests with workload requirements can improve +concurrency |
+
Data Type |
+uint |
+
Units |
+threads |
+
Range |
+1 to UINT_MAX |
+
Default |
+32 |
+
Change |
+Dynamic, takes effect per-volume when the pool +is imported or volumes are created |
+
Verification |
+
|
+
Versions Affected |
+v0.7.0 and later |
+
zvol_volmode
defines volume block devices behaviour when the
+volmode
property is set to default
Note: to maintain compatibility with ZFS on BSD, “geom” is synonymous +with “full”
+value |
+volmode |
+Description |
+
---|---|---|
1 |
+full |
+legacy fully functional behaviour (default) |
+
2 |
+dev |
+hide partitions on volume block devices |
+
3 |
+none |
+not exposing volumes outside ZFS |
+
zvol_volmode |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+enum |
+
Range |
+1, 2, or 3 |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_qat_disable
controls the Intel QuickAssist Technology (QAT)
+driver providing hardware acceleration for gzip compression. When the
+QAT hardware is present and qat driver available, the default behaviour
+is to enable QAT.
zfs_qat_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing QAT functionality |
+
Data Type |
+boolean |
+
Range |
+0=use QAT acceleration if available, 1=do not +use QAT acceleration |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7, renamed to +zfs_qat_ +compress_disable +in v0.8 |
+
zfs_qat_checksum_disable
controls the Intel QuickAssist Technology
+(QAT) driver providing hardware acceleration for checksums. When the QAT
+hardware is present and qat driver available, the default behaviour is
+to enable QAT.
zfs_qat_checksum_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing QAT functionality |
+
Data Type |
+boolean |
+
Range |
+0=use QAT acceleration if available, +1=do not use QAT acceleration |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
zfs_qat_compress_disable
controls the Intel QuickAssist Technology
+(QAT) driver providing hardware acceleration for gzip compression. When
+the QAT hardware is present and qat driver available, the default
+behaviour is to enable QAT.
zfs_qat_compress_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing QAT functionality |
+
Data Type |
+boolean |
+
Range |
+0=use QAT acceleration if available, +1=do not use QAT acceleration |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
zfs_qat_encrypt_disable
controls the Intel QuickAssist Technology
+(QAT) driver providing hardware acceleration for encryption. When the
+QAT hardware is present and qat driver available, the default behaviour
+is to enable QAT.
zfs_qat_encrypt_disable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing QAT functionality |
+
Data Type |
+boolean |
+
Range |
+0=use QAT acceleration if available, 1=do +not use QAT acceleration |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
The dbuf_cache_hiwater_pct
and
+dbuf_cache_lowater_pct define the
+operating range for dbuf cache evict thread. The hiwater and lowater are
+percentages of the dbuf_cache_max_bytes
+value. When the dbuf cache grows above ((100% +
+dbuf_cache_hiwater_pct
) *
+dbuf_cache_max_bytes) then the dbuf cache
+thread begins evicting. When the dbug cache falls below ((100% -
+dbuf_cache_lowater_pct) *
+dbuf_cache_max_bytes) then the dbuf cache
+thread stops evicting.
dbuf_cache_hiwater_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing dbuf cache algorithms |
+
Data Type |
+uint |
+
Units |
+percent |
+
Range |
+0 to UINT_MAX |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The dbuf_cache_hiwater_pct and dbuf_cache_lowater_pct define the
+operating range for dbuf cache evict thread. The hiwater and lowater are
+percentages of the dbuf_cache_max_bytes
+value. When the dbuf cache grows above ((100% +
+dbuf_cache_hiwater_pct) *
+dbuf_cache_max_bytes) then the dbuf cache
+thread begins evicting. When the dbug cache falls below ((100% -
+dbuf_cache_lowater_pct
) *
+dbuf_cache_max_bytes) then the dbuf cache
+thread stops evicting.
dbuf_cache_lowater_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing dbuf cache algorithms |
+
Data Type |
+uint |
+
Units |
+percent |
+
Range |
+0 to UINT_MAX |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The dbuf cache maintains a list of dbufs that are not currently held but +have been recently released. These dbufs are not eligible for ARC +eviction until they are aged out of the dbuf cache. Dbufs are added to +the dbuf cache once the last hold is released. If a dbuf is later +accessed and still exists in the dbuf cache, then it will be removed +from the cache and later re-added to the head of the cache. Dbufs that +are aged out of the cache will be immediately destroyed and become +eligible for ARC eviction.
+The size of the dbuf cache is set by dbuf_cache_max_bytes
. The
+actual size is dynamically adjusted to the minimum of current ARC target
+size (c
) >> dbuf_cache_max_shift and the
+default dbuf_cache_max_bytes
dbuf_cache_max_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing dbuf cache algorithms |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+16,777,216 to ULONG_MAX |
+
Default |
+104,857,600 (100 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
The dbuf_cache_max_bytes minimum is the
+lesser of dbuf_cache_max_bytes and the
+current ARC target size (c
) >> dbuf_cache_max_shift
dbuf_cache_max_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing dbuf cache algorithms |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to 63 |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
Each of the concurrent object allocators grabs
+2^dmu_object_alloc_chunk_shift
dnode slots at a time. The default is
+to grab 128 slots, or 4 blocks worth. This default value was
+experimentally determined to be the lowest value that eliminates the
+measurable effect of lock contention in the DMU object allocation code
+path.
dmu_object_alloc_chunk_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the workload creates many files
+concurrently on a system with many
+CPUs, then increasing
+ |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+7 to 9 |
+
Default |
+7 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
Alias for ignore_hole_birth
+zfs_abd_scatter_enabled
controls the ARC Buffer Data (ABD)
+scatter/gather feature.
When disabled, the legacy behaviour is selected using linear buffers.
+For linear buffers, all the data in the ABD is stored in one contiguous
+buffer in memory (from a zio_[data_]buf_*
kmem cache).
When enabled (default), the data in the ABD is split into equal-sized
+chunks (from the abd_chunk_cache
kmem_cache), with pointers to the
+chunks recorded in an array at the end of the ABD structure. This allows
+more efficient memory allocation for buffers, especially when large
+recordsizes are used.
zfs_abd_scatter_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ABD |
+
Data Type |
+boolean |
+
Range |
+0=use linear allocation only, 1=allow +scatter/gather |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Verification |
+ABD statistics are observable in
+ |
+
Versions Affected |
+v0.7.0 and later |
+
zfs_abd_scatter_max_order
sets the maximum order for physical page
+allocation when ABD is enabled (see
+zfs_abd_scatter_enabled)
See also Buddy Memory Allocation in the Linux kernel documentation.
+zfs_abd_scatter_max_order |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ABD features |
+
Data Type |
+int |
+
Units |
+orders |
+
Range |
+1 to 10 (upper limit is +hardware-dependent) |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Verification |
+ABD statistics are observable in
+ |
+
Versions Affected |
+v0.7.0 and later |
+
When compression is enabled for a dataset, later reads of the data can +store the blocks in ARC in their on-disk, compressed state. This can +increse the effective size of the ARC, as counted in blocks, and thus +improve the ARC hit ratio.
+zfs_compressed_arc_enabled |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing ARC compression feature |
+
Data Type |
+boolean |
+
Range |
+0=compressed ARC disabled (legacy +behaviour), 1=compress ARC data |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Verification |
+raw ARC statistics are observable in
+ |
+
Versions Affected |
+v0.7.0 and later |
+
For encrypted datasets, the salt is regenerated every
+zfs_key_max_salt_uses
blocks. This automatic regeneration reduces
+the probability of collisions due to the Birthday problem. When set to
+the default (400,000,000) the probability of collision is approximately
+1 in 1 trillion.
zfs_key_max_salt_uses |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing encryption features |
+
Data Type |
+ulong |
+
Units |
+blocks encrypted |
+
Range |
+1 to ULONG_MAX |
+
Default |
+400,000,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 and later |
+
zfs_object_mutex_size
facilitates resizing the the per-dataset znode
+mutex array for testing deadlocks therein.
zfs_object_mutex_size |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Testing znode mutex array deadlocks |
+
Data Type |
+uint |
+
Units |
+orders |
+
Range |
+1 to UINT_MAX |
+
Default |
+64 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 and later |
+
When scrubbing or resilvering, by default, ZFS checks to ensure it is
+not over the hard memory limit before each txg commit. If finer-grained
+control of this is needed zfs_scan_strict_mem_lim
can be set to 1 to
+enable checking before scanning each block.
zfs_scan_strict_mem_lim |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+Do not change |
+
Data Type |
+boolean |
+
Range |
+0=normal scan behaviour, 1=check hard +memory limit strictly during scan |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.0 |
+
zfs_send_queue_length
is the maximum number of bytes allowed in the
+zfs send queue.
zfs_send_queue_length |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When using the largest recordsize or +volblocksize (16 MiB), increasing can +improve send efficiency |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+Must be at least twice the maximum +recordsize or volblocksize in use |
+
Default |
+16,777,216 bytes (16 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.1 |
+
zfs_recv_queue_length
is the maximum number of bytes allowed in the
+zfs receive queue.
zfs_recv_queue_length |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+When using the largest recordsize or +volblocksize (16 MiB), increasing can +improve receive efficiency |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+Must be at least twice the maximum +recordsize or volblocksize in use |
+
Default |
+16,777,216 bytes (16 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.1 |
+
arc_min_prefetch_lifespan
is the minimum time for a prefetched block
+to remain in ARC before it is eligible for eviction.
zfs_arc_min_prefetch_lifespan |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+clock ticks |
+
Range |
+0 = use default value |
+
Default |
+1 second (as expressed in clock ticks) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
zfs_scan_ignore_errors
allows errors discovered during scrub or
+resilver to be ignored. This can be tuned as a workaround to remove the
+dirty time list (DTL) when completing a pool scan. It is intended to be
+used during pool repair or recovery to prevent resilvering when the pool
+is imported.
zfs_scan_ignore_errors |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See description above |
+
Data Type |
+boolean |
+
Range |
+0 = do not ignore errors, 1 = ignore +errors during pool scrub or resilver |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.8.1 |
+
zfs_top_maxinflight
is used to limit the maximum number of I/Os
+queued to top-level vdevs during scrub or resilver operations. The
+actual top-level vdev limit is calculated by multiplying the number of
+child vdevs by zfs_top_maxinflight
This limit is an additional cap
+over and above the scan limits
zfs_top_maxinflight |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+for modern ZFS versions, the ZIO scheduler +limits usually take precedence |
+
Data Type |
+int |
+
Units |
+I/O operations |
+
Range |
+1 to MAX_INT |
+
Default |
+32 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 |
+
zfs_resilver_delay
sets a time-based delay for resilver I/Os. This
+delay is in addition to the ZIO scheduler’s treatment of scrub
+workloads. See also zfs_scan_idle
zfs_resilver_delay |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+increasing can reduce impact of resilver +workload on dynamic workloads |
+
Data Type |
+int |
+
Units |
+clock ticks |
+
Range |
+0 to MAX_INT |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 |
+
zfs_scrub_delay
sets a time-based delay for scrub I/Os. This delay
+is in addition to the ZIO scheduler’s treatment of scrub workloads. See
+also zfs_scan_idle
zfs_scrub_delay |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+increasing can reduce impact of scrub workload +on dynamic workloads |
+
Data Type |
+int |
+
Units |
+clock ticks |
+
Range |
+0 to MAX_INT |
+
Default |
+4 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 |
+
When a non-scan I/O has occurred in the past zfs_scan_idle
clock
+ticks, then zfs_resilver_delay or
+zfs_scrub_delay are enabled.
zfs_scan_idle |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+as part of a resilver/scrub tuning effort |
+
Data Type |
+int |
+
Units |
+clock ticks |
+
Range |
+0 to MAX_INT |
+
Default |
+50 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.0 |
+
By default, ZFS will choose the highest performance, hardware-optimized
+implementation of the AES encryption algorithm. The icp_aes_impl
+tunable overrides this automatic choice.
Note: icp_aes_impl
is set in the icp
kernel module, not the
+zfs
kernel module.
To observe the available options
+cat /sys/module/icp/parameters/icp_aes_impl
The default option is
+shown in brackets ‘[]’
icp_aes_impl |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+icp |
+
When to change |
+debugging ZFS encryption on hardware |
+
Data Type |
+string |
+
Range |
+varies by hardware |
+
Default |
+automatic, depends on the hardware |
+
Change |
+dynamic |
+
Versions Affected |
+planned for v2 |
+
By default, ZFS will choose the highest performance, hardware-optimized
+implementation of the GCM encryption algorithm. The icp_gcm_impl
+tunable overrides this automatic choice.
Note: icp_gcm_impl
is set in the icp
kernel module, not the
+zfs
kernel module.
To observe the available options
+cat /sys/module/icp/parameters/icp_gcm_impl
The default option is
+shown in brackets ‘[]’
icp_gcm_impl |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+icp |
+
When to change |
+debugging ZFS encryption on hardware |
+
Data Type |
+string |
+
Range |
+varies by hardware |
+
Default |
+automatic, depends on the hardware |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_abd_scatter_min_size
changes the ARC buffer data (ABD)
+allocator’s threshold for using linear or page-based scatter buffers.
+Allocations smaller than zfs_abd_scatter_min_size
use linear ABDs.
Scatter ABD’s use at least one page each, so sub-page allocations waste +some space when allocated as scatter allocations. For example, 2KB +scatter allocation wastes half of each page. Using linear ABD’s for +small allocations results in slabs containing many allocations. This can +improve memory efficiency, at the expense of more work for ARC evictions +attempting to free pages, because all the buffers on one slab need to be +freed in order to free the slab and its underlying pages.
+Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it’s +possible for them to actually waste more memory than scatter +allocations:
+one page per buf = wasting 3/4 or 7/8
one buf per slab = wasting 15/16
Spill blocks are typically 512B and are heavily used on systems running
+selinux with the default dnode size and the xattr=sa
property set.
By default, linear allocations for 512B and 1KB, and scatter allocations +for larger (>= 1.5KB) allocation requests.
+zfs_abd_scatter_min_size |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+debugging memory allocation, especially +for large pages |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to MAX_INT |
+
Default |
+1536 (512B and 1KB allocations will be +linear) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_unlink_suspend_progress
changes the policy for removing pending
+unlinks. When enabled, files will not be asynchronously removed from the
+list of pending unlinks and the space they consume will be leaked. Once
+this option has been disabled and the dataset is remounted, the pending
+unlinks will be processed and the freed space returned to the pool.
zfs_unlink_suspend_progress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+used by the ZFS test suite (ZTS) to +facilitate testing |
+
Data Type |
+boolean |
+
Range |
+0 = use async unlink removal, 1 = do +not async unlink thus leaking space |
+
Default |
+0 |
+
Change |
+prior to dataset mount |
+
Versions Affected |
+planned for v2 |
+
spa_load_verify_shift
sets the fraction of ARC that can be used by
+inflight I/Os when verifying the pool during import. This value is a
+“shift” representing the fraction of ARC target size
+(grep -w c /proc/spl/kstat/zfs/arcstats
). The ARC target size is
+shifted to the right. Thus a value of ‘2’ results in the fraction = 1/4,
+while a value of ‘4’ results in the fraction = 1/8.
For large memory machines, pool import can consume large amounts of ARC:
+much larger than the value of maxinflight. This can result in
+spa_load_verify_maxinflight having a
+value of 0 causing the system to hang. Setting spa_load_verify_shift
+can reduce this limit and allow importing without hanging.
spa_load_verify_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+troubleshooting pool import on large memory +machines |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+1 to MAX_INT |
+
Default |
+4 |
+
Change |
+prior to importing a pool |
+
Versions Affected |
+planned for v2 |
+
spa_load_print_vdev_tree
enables printing of the attempted pool
+import’s vdev tree to kernel message to the ZFS debug message log
+/proc/spl/kstat/zfs/dbgmsg
Both the provided vdev tree and MOS vdev
+tree are printed, which can be useful for debugging problems with the
+zpool cachefile
spa_load_print_vdev_tree |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+troubleshooting pool import failures |
+
Data Type |
+boolean |
+
Range |
+0 = do not print pool configuration in +logs, 1 = print pool configuration in +logs |
+
Default |
+0 |
+
Change |
+prior to pool import |
+
Versions Affected |
+planned for v2 |
+
When importing a pool in readonly mode
+(zpool import -o readonly=on ...
) then up to
+zfs_max_missing_tvds
top-level vdevs can be missing, but the import
+can attempt to progress.
Note: This is strictly intended for advanced pool recovery cases since
+missing data is almost inevitable. Pools with missing devices can only
+be imported read-only for safety reasons, and the pool’s failmode
+property is automatically set to continue
The expected use case is to recover pool data immediately after +accidentally adding a non-protected vdev to a protected pool.
+With 1 missing top-level vdev, ZFS should be able to import the pool +and mount all datasets. User data that was not modified after the +missing device has been added should be recoverable. Thus snapshots +created prior to the addition of that device should be completely +intact.
With 2 missing top-level vdevs, some datasets may fail to mount since +there are dataset statistics that are stored as regular metadata. +Some data might be recoverable if those vdevs were added recently.
With 3 or more top-level missing vdevs, the pool is severely damaged +and MOS entries may be missing entirely. Chances of data recovery are +very low. Note that there are also risks of performing an inadvertent +rewind as we might be missing all the vdevs with the latest +uberblocks.
zfs_max_missing_tvds |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+troubleshooting pools with missing devices |
+
Data Type |
+int |
+
Units |
+missing top-level vdevs |
+
Range |
+0 to MAX_INT |
+
Default |
+0 |
+
Change |
+prior to pool import |
+
Versions Affected |
+planned for v2 |
+
dbuf_metadata_cache_shift
sets the size of the dbuf metadata cache
+as a fraction of ARC target size. This is an alternate method for
+setting dbuf metadata cache size than
+dbuf_metadata_cache_max_bytes.
dbuf_metadata_cache_max_bytes
+overrides dbuf_metadata_cache_shift
This value is a “shift” representing the fraction of ARC target size
+(grep -w c /proc/spl/kstat/zfs/arcstats
). The ARC target size is
+shifted to the right. Thus a value of ‘2’ results in the fraction = 1/4,
+while a value of ‘6’ results in the fraction = 1/64.
dbuf_metadata_cache_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+shift |
+
Range |
+practical range is +(` +dbuf_cache_shift <#dbuf-cache-shift>`__ ++ 1) to MAX_INT |
+
Default |
+6 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
dbuf_metadata_cache_max_bytes
sets the size of the dbuf metadata
+cache as a number of bytes. This is an alternate method for setting dbuf
+metadata cache size than
+dbuf_metadata_cache_shift
dbuf_metadata_cache_max_bytes
+overrides dbuf_metadata_cache_shift
dbuf_metadata_cache_max_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 = use
+dbuf_metadata_cache_sh
+ift
+to ARC |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
dbuf_cache_shift
sets the size of the dbuf cache as a fraction of
+ARC target size. This is an alternate method for setting dbuf cache size
+than dbuf_cache_max_bytes.
dbuf_cache_max_bytes overrides
+dbuf_cache_shift
This value is a “shift” representing the fraction of ARC target size
+(grep -w c /proc/spl/kstat/zfs/arcstats
). The ARC target size is
+shifted to the right. Thus a value of ‘2’ results in the fraction = 1/4,
+while a value of ‘5’ results in the fraction = 1/32.
Performance tuning of dbuf cache can be monitored using:
+dbufstat
command
node_exporter ZFS +module for prometheus environments
telegraf ZFS plugin for +general-purpose metric collection
/proc/spl/kstat/zfs/dbufstats
kstat
dbuf_cache_shift |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to improve performance of read-intensive +channel programs |
+
Data Type |
+int |
+
Units |
+shift |
+
Range |
+5 to MAX_INT |
+
Default |
+5 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
dbuf_cache_max_bytes
sets the size of the dbuf cache in bytes. This
+is an alternate method for setting dbuf cache size than
+dbuf_cache_shift
Performance tuning of dbuf cache can be monitored using:
+dbufstat
command
node_exporter ZFS +module for prometheus environments
telegraf ZFS plugin for +general-purpose metric collection
/proc/spl/kstat/zfs/dbufstats
kstat
dbuf_cache_max_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 = use
+dbuf_cache_shift to
+ARC |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
When testing allocation code, metaslab_force_ganging
forces blocks
+above the specified size to be ganged.
metaslab_force_ganging |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+for development testing purposes only |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+SPA_MINBLOCKSIZE to (SPA_MAXBLOCKSIZE + 1) |
+
Default |
+SPA_MAXBLOCKSIZE + 1 (16,777,217 bytes) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
When adding a top-level vdev, zfs_vdev_default_ms_count
is the
+target number of metaslabs.
zfs_vdev_default_ms_count |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+for development testing purposes only |
+
Data Type |
+int |
+
Range |
+16 to MAX_INT |
+
Default |
+200 |
+
Change |
+prior to creating a pool or adding a +top-level vdev |
+
Versions Affected |
+planned for v2 |
+
During top-level vdev removal, chunks of data are copied from the vdev
+which may include free space in order to trade bandwidth for IOPS.
+vdev_removal_max_span
sets the maximum span of free space included
+as unnecessary data in a chunk of copied data.
vdev_removal_max_span |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to MAX_INT |
+
Default |
+32,768 (32 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
When removing a device, zfs_removal_ignore_errors
controls the
+process for handling hard I/O errors. When set, if a device encounters a
+hard IO error during the removal process the removal will not be
+cancelled. This can result in a normally recoverable block becoming
+permanently damaged and is not recommended. This should only be used as
+a last resort when the pool cannot be returned to a healthy state prior
+to removing the device.
zfs_removal_ignore_errors |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+See description for caveat |
+
Data Type |
+boolean |
+
Range |
+during device removal: 0 = hard errors +are not ignored, 1 = hard errors are +ignored |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_removal_suspend_progress
is used during automated testing of the
+ZFS code to incease test coverage.
zfs_removal_suspend_progress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+do not change |
+
Data Type |
+boolean |
+
Range |
+0 = do not suspend during vdev removal |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
During vdev removal, the vdev indirection layer sleeps for
+zfs_condense_indirect_commit_entry_delay_ms
milliseconds during
+mapping generation. This parameter is used during automated testing of
+the ZFS code to improve test coverage.
zfs_condens +e_indirect_commit_entry_delay_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+do not change |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+0 to MAX_INT |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
During vdev removal, condensing process is an attempt to save memory by
+removing obsolete mappings. zfs_condense_indirect_vdevs_enable
+enables condensing indirect vdev mappings. When set, ZFS attempts to
+condense indirect vdev mappings if the mapping uses more than
+zfs_condense_min_mapping_bytes
+bytes of memory and if the obsolete space map object uses more than
+zfs_condense_max_obsolete_bytes
+bytes on disk.
zf +s_condense_indirect_vdevs_enable |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0 = do not save memory, 1 = save +memory by condensing obsolete +mapping after vdev removal |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
After vdev removal, zfs_condense_max_obsolete_bytes
sets the limit
+for beginning the condensing process. Condensing begins if the obsolete
+space map takes up more than zfs_condense_max_obsolete_bytes
of
+space on disk (logically). The default of 1 GiB is small enough relative
+to a typical pool that the space consumed by the obsolete space map is
+minimal.
See also +zfs_condense_indirect_vdevs_enable
+zfs_condense_max_obsolete_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+no not change |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0 to MAX_ULONG |
+
Default |
+1,073,741,824 (1 GiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
After vdev removal, zfs_condense_min_mapping_bytes
is the lower
+limit for determining when to condense the in-memory obsolete space map.
+The condensing process will not continue unless a minimum of
+zfs_condense_min_mapping_bytes
of memory can be freed.
See also +zfs_condense_indirect_vdevs_enable
+zfs_condense_min_mapping_bytes |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+do not change |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0 to MAX_ULONG |
+
Default |
+128 KiB |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_initializing_max_active
sets the maximum initializing I/Os
+active to each device.
zfs_vdev_initializing_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev_max_ +active |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_initializing_min_active
sets the minimum initializing I/Os
+active to each device.
zfs_vdev_initializing_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_removal_max_active
sets the maximum top-level vdev removal
+I/Os active to each device.
zfs_vdev_removal_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_vdev +_max_active |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_removal_min_active
sets the minimum top-level vdev removal
+I/Os active to each device.
zfs_vdev_removal_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_trim_max_active
sets the maximum trim I/Os active to each
+device.
zfs_vdev_trim_max_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
+1 to +zfs_v +dev_max_active |
+
Default |
+2 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_trim_min_active
sets the minimum trim I/Os active to each
+device.
zfs_vdev_trim_min_active |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+uint32 |
+
Units |
+I/O operations |
+
Range |
++ |
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
When initializing a vdev, ZFS writes patterns of
+zfs_initialize_value
bytes to the device.
zfs_initialize_value |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging initialization code |
+
Data Type |
+uint32 or uint64 |
+
Default |
+0xdeadbeef for 32-bit systems, +0xdeadbeefdeadbeee for 64-bit systems |
+
Change |
+prior to running |
+
Versions Affected |
+planned for v2 |
+
zfs_lua_max_instrlimit
limits the maximum time for a ZFS channel
+program to run.
zfs_lua_max_instrlimit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to enforce a CPU usage limit on ZFS +channel programs |
+
Data Type |
+ulong |
+
Units |
+LUA instructions |
+
Range |
+0 to MAX_ULONG |
+
Default |
+100,000,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
‘zfs_lua_max_memlimit’ is the maximum memory limit for a ZFS channel +program.
+zfs_lua_max_memlimit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0 to MAX_ULONG |
+
Default |
+104,857,600 (100 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_max_dataset_nesting
limits the depth of nested datasets. Deeply
+nested datasets can overflow the stack. The maximum stack depth depends
+on kernel compilation options, so it is impractical to predict the
+possible limits. For kernels compiled with small stack sizes,
+zfs_max_dataset_nesting
may require changes.
zfs_max_dataset_nesting |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+can be tuned temporarily to fix existing +datasets that exceed the predefined limit |
+
Data Type |
+int |
+
Units |
+datasets |
+
Range |
+0 to MAX_INT |
+
Default |
+50 |
+
Change |
+Dynamic, though once on-disk the value +for the pool is set |
+
Versions Affected |
+planned for v2 |
+
zfs_ddt_data_is_special
enables the deduplication table (DDT) to
+reside on a special top-level vdev.
zfs_ddt_data_is_special |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when using a special top-level vdev and +no dedup top-level vdev and it is desired +to store the DDT in the main pool +top-level vdevs |
+
Data Type |
+boolean |
+
Range |
+0=do not use special vdevs to store DDT, +1=store DDT in special vdevs |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
If special vdevs are in use, zfs_user_indirect_is_special
enables
+user data indirect blocks (a form of metadata) to be written to the
+special vdevs.
zfs_user_indirect_is_special |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to force user data indirect blocks +to remain in the main pool top-level +vdevs |
+
Data Type |
+boolean |
+
Range |
+0=do not write user indirect blocks +to a special vdev, 1=write user +indirect blocks to a special vdev |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
After device removal, if an indirect split block contains more than
+zfs_reconstruct_indirect_combinations_max
many possible unique
+combinations when being reconstructed, it can be considered too
+computationally expensive to check them all. Instead, at most
+zfs_reconstruct_indirect_combinations_max
randomly-selected
+combinations are attempted each time the block is accessed. This allows
+all segment copies to participate fairly in the reconstruction when all
+combinations cannot be checked and prevents repeated use of one bad
+copy.
zfs_recon +struct_indirect_combinations_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+attempts |
+
Range |
+0=do not limit attempts, 1 to +MAX_INT = limit for attempts |
+
Default |
+4096 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_send_unmodified_spill_blocks
enables sending of unmodified spill
+blocks in the send stream. Under certain circumstances, previous
+versions of ZFS could incorrectly remove the spill block from an
+existing object. Including unmodified copies of the spill blocks creates
+a backwards compatible stream which will recreate a spill block if it
+was incorrectly removed.
zfs_send_unmodified_spill_blocks |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0=do not send unmodified spill +blocks, 1=send unmodified spill +blocks |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_spa_discard_memory_limit
sets the limit for maximum memory used
+for prefetching a pool’s checkpoint space map on each vdev while
+discarding a pool checkpoint.
zfs_spa_discard_memory_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to MAX_INT |
+
Default |
+16,777,216 (16 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_special_class_metadata_reserve_pct
sets a threshold for space in
+special vdevs to be reserved exclusively for metadata. This prevents
+small blocks or dedup table from completely consuming a special vdev.
zfs_special_class_metadata_reserve_pct |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+percent |
+
Range |
+0 to 100 |
+
Default |
+25 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_trim_extent_bytes_max
sets the maximum size of a trim (aka
+discard, scsi unmap) command. Ranges larger than
+zfs_trim_extent_bytes_max
are split in to chunks no larger than
+zfs_trim_extent_bytes_max
bytes prior to being issued to the device.
+Use zpool iostat -w
to observe the latency of trim commands.
zfs_trim_extent_bytes_max |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+if the device can efficiently handle +larger trim requests |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+zfs_trim_extent_by +tes_min +to MAX_UINT |
+
Default |
+134,217,728 (128 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_trim_extent_bytes_min
sets the minimum size of trim (aka
+discard, scsi unmap) commands. Trim ranges smaller than
+zfs_trim_extent_bytes_min
are skipped unless they’re part of a
+larger range which was broken in to chunks. Some devices have
+performance degradation during trim operations, so using a larger
+zfs_trim_extent_bytes_min
can reduce the total amount of space
+trimmed. Use zpool iostat -w
to observe the latency of trim
+commands.
zfs_trim_extent_bytes_min |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when trim is in use and device +performance suffers from trimming small +allocations |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+0=trim all unallocated space, otherwise +minimum physical block size to MAX_ |
+
Default |
+32,768 (32 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_trim_metaslab_skip
enables uninitialized metaslabs to be
+skipped during the trim (aka discard, scsi unmap) process.
+zfs_trim_metaslab_skip
can be useful for pools constructed from
+large thinly-provisioned devices where trim operations perform slowly.zpool iostat -w
to observe
+the latency of trim commands.zfs_trim_metaslab_skip |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+boolean |
+
Range |
+0=do not skip uninitialized metaslabs +during trim, 1=skip uninitialized +metaslabs during trim |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_trim_queue_limit
sets the maximum queue depth for leaf vdevs.
+See also zfs_vdev_trim_max_active and
+zfs_trim_extent_bytes_max Use
+zpool iostat -q
to observe trim queue depth.
zfs_trim_queue_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+to restrict the number of trim commands in the queue |
+
Data Type |
+uint |
+
Units |
+I/O operations |
+
Range |
+1 to MAX_UINT |
+
Default |
+10 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_trim_txg_batch
sets the number of transaction groups worth of
+frees which should be aggregated before trim (aka discard, scsi unmap)
+commands are issued to a device. This setting represents a trade-off
+between issuing larger, more efficient trim commands and the delay
+before the recently trimmed space is available for use by the device.
Increasing this value will allow frees to be aggregated for a longer +time. This will result is larger trim operations and potentially +increased memory usage. Decreasing this value will have the opposite +effect. The default value of 32 was empirically determined to be a +reasonable compromise.
+zfs_trim_txg_batch |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+metaslabs to stride |
+
Range |
+1 to MAX_UINT |
+
Default |
+32 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_aggregate_trim
allows trim I/Os to be aggregated. This is
+normally not helpful because the extents to be trimmed will have been
+already been aggregated by the metaslab.
zfs_vdev_aggregate_trim |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging trim code or trim +performance issues |
+
Data Type |
+boolean |
+
Range |
+0=do not attempt to aggregate trim +commands, 1=attempt to aggregate trim +commands |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_aggregation_limit_non_rotating
is the equivalent of
+zfs_vdev_aggregation_limit for devices
+which represent themselves as non-rotating to the Linux blkdev
+interfaces. Such devices have a value of 0 in
+/sys/block/DEVICE/queue/rotational
and are expected to be SSDs.
zfs_vde +v_aggregation_limit_non_rotating |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+0 to MAX_INT |
+
Default |
+131,072 bytes (128 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
ZFS uses barriers (volatile cache flush commands) to ensure data is +committed to permanent media by devices. This ensures consistent +on-media state for devices where caches are volatile (eg HDDs).
+zil_nocacheflush
disables the cache flush commands that are normally
+sent to devices by the ZIL after a log write has completed.
The difference between zil_nocacheflush
and
+zfs_nocacheflush is zil_nocacheflush
applies
+to ZIL writes while zfs_nocacheflush disables
+barrier writes to the pool devices at the end of transaction group syncs.
WARNING: setting this can cause ZIL corruption on power loss if the +device has a volatile write cache.
+zil_nocacheflush |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+If the storage device has nonvolatile cache, +then disabling cache flush can save the cost of +occasional cache flush commands |
+
Data Type |
+boolean |
+
Range |
+0=send cache flush commands, 1=do not send +cache flush commands |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zio_deadman_log_all
enables debugging messages for all ZFS I/Os,
+rather than only for leaf ZFS I/Os for a vdev. This is meant to be used
+by developers to gain diagnostic information for hang conditions which
+don’t involve a mutex or other locking primitive. Typically these are
+conditions where a thread in the zio pipeline is looping indefinitely.
See also zfs_dbgmsg_enable
+zio_deadman_log_all |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging ZFS I/O pipeline |
+
Data Type |
+boolean |
+
Range |
+0=do not log all deadman events, 1=log all +deadman events |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
If non-zero, zio_decompress_fail_fraction
represents the denominator
+of the probability that ZFS should induce a decompression failure. For
+instance, for a 5% decompression failure rate, this value should be set
+to 20.
zio_decompress_fail_fraction |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging ZFS internal +compressed buffer code |
+
Data Type |
+ulong |
+
Units |
+probability of induced decompression
+failure is
+1/ |
+
Range |
+0 = do not induce failures, or 1 to +MAX_ULONG |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
An I/O operation taking more than zio_slow_io_ms
milliseconds to
+complete is marked as a slow I/O. Slow I/O counters can be observed with
+zpool status -s
. Each slow I/O causes a delay zevent, observable
+using zpool events
. See also zfs-events(5)
.
zio_slow_io_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+when debugging slow devices and the default +value is inappropriate |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+0 to MAX_INT |
+
Default |
+30,000 (30 seconds) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
vdev_validate_skip
disables label validation steps during pool
+import. Changing is not recommended unless you know what you are doing
+and are recovering a damaged label.
vdev_validate_skip |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+do not change |
+
Data Type |
+boolean |
+
Range |
+0=validate labels during pool import, 1=do not +validate vdev labels during pool import |
+
Default |
+0 |
+
Change |
+prior to pool import |
+
Versions Affected |
+planned for v2 |
+
zfs_async_block_max_blocks
limits the number of blocks freed in a
+single transaction group commit. During deletes of large objects, such
+as snapshots, the number of freed blocks can cause the DMU to extend txg
+sync times well beyond zfs_txg_timeout.
+zfs_async_block_max_blocks
is used to limit these effects.
zfs_async_block_max_blocks |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+ulong |
+
Units |
+blocks |
+
Range |
+1 to MAX_ULONG |
+
Default |
+MAX_ULONG (do not limit) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_checksum_events_per_second
is a rate limit for checksum events.
+Note that this should not be set below the zed
thresholds (currently
+10 checksums over 10 sec) or else zed
may not trigger any action.
zfs_checksum_events_per_second |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+checksum events |
+
Range |
+
|
+
Default |
+20 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_disable_ivset_guid_check
disables requirement for IVset guids to
+be present and match when doing a raw receive of encrypted datasets.
+Intended for users whose pools were created with ZFS on Linux
+pre-release versions and now have compatibility issues.
For a ZFS raw receive, from a send stream created by zfs send --raw
,
+the crypt_keydata nvlist includes a to_ivset_guid to be set on the new
+snapshot. This value will override the value generated by the snapshot
+code. However, this value may not be present, because older
+implementations of the raw send code did not include this value. When
+zfs_disable_ivset_guid_check
is enabled, the receive proceeds and a
+newly-generated value is used.
zfs_disable_ivset_guid_check |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+debugging pre-release ZFS raw sends |
+
Data Type |
+boolean |
+
Range |
+0=check IVset guid, 1=do not check +IVset guid |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_obsolete_min_time_ms
is similar to
+zfs_free_min_time_ms and used for cleanup of
+old indirection records for vdevs removed using the zpool remove
+command.
zfs_obsolete_min_time_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+0 to MAX_INT |
+
Default |
+500 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_override_estimate_recordsize
overrides the default logic for
+estimating block sizes when doing a zfs send. The default heuristic is
+that the average block size will be the current recordsize.
zfs_override_estimate_recordsize |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+if most data in your dataset is +not of the current recordsize +and you require accurate zfs +send size estimates |
+
Data Type |
+ulong |
+
Units |
+bytes |
+
Range |
+0=do not override, 1 to +MAX_ULONG |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_remove_max_segment
sets the largest contiguous segment that ZFS
+attempts to allocate when removing a vdev. This can be no larger than
+16MB. If there is a performance problem with attempting to allocate
+large blocks, consider decreasing this. The value is rounded up to a
+power-of-2.
zfs_remove_max_segment |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+after removing a top-level vdev, consider +decreasing if there is a performance +degradation when attempting to allocate +large blocks |
+
Data Type |
+int |
+
Units |
+bytes |
+
Range |
+maximum of the physical block size of all +vdevs in the pool to 16,777,216 bytes (16 +MiB) |
+
Default |
+16,777,216 bytes (16 MiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_resilver_disable_defer
disables the resilver_defer
pool
+feature. The resilver_defer
feature allows ZFS to postpone new
+resilvers if an existing resilver is in progress.
zfs_resilver_disable_defer |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+if resilver postponement is not +desired due to overall resilver time +constraints |
+
Data Type |
+boolean |
+
Range |
+0=allow |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_scan_suspend_progress
causes a scrub or resilver scan to freeze
+without actually pausing.
zfs_scan_suspend_progress |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+testing or debugging scan code |
+
Data Type |
+boolean |
+
Range |
+0=do not freeze scans, 1=freeze scans |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
Scrubs are processed by the sync thread. While scrubbing at least
+zfs_scrub_min_time_ms
time is spent working on a scrub between txg
+syncs.
zfs_scrub_min_time_ms |
+Notes |
+
---|---|
Tags |
++ |
When to change |
++ |
Data Type |
+int |
+
Units |
+milliseconds |
+
Range |
+1 to (zfs_txg_timeout - 1) |
+
Default |
+1,000 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_slow_io_events_per_second
is a rate limit for slow I/O events.
+Note that this should not be set below the zed
thresholds (currently
+10 checksums over 10 sec) or else zed
may not trigger any action.
zfs_slow_io_events_per_second |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+slow I/O events |
+
Range |
+
|
+
Default |
+20 |
+
Change |
+Dynamic |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_min_ms_count
is the minimum number of metaslabs to create
+in a top-level vdev.
zfs_vdev_min_ms_count |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+metaslabs |
+
Range |
+16 to +zfs_vdev_m +s_count_limit |
+
Default |
+16 |
+
Change |
+prior to creating a pool or adding a +top-level vdev |
+
Versions Affected |
+planned for v2 |
+
zfs_vdev_ms_count_limit
is the practical upper limit for the number
+of metaslabs per top-level vdev.
zfs_vdev_ms_count_limit |
+Notes |
+
---|---|
Tags |
++ |
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+metaslabs |
+
Range |
+zfs_vdev +_min_ms_count +to 131,072 |
+
Default |
+131,072 |
+
Change |
+prior to creating a pool or adding a +top-level vdev |
+
Versions Affected |
+planned for v2 |
+
spl_hostid
is a unique system id number. It originated in Sun’s
+products where most systems had a unique id assigned at the factory.
+This assignment does not exist in modern hardware.spl_hostid
can be used to uniquely identify a system. By default
+this value is set to zero which indicates the hostid is disabled. It
+can be explicitly enabled by placing a unique non-zero value in the
+file shown in spl_hostid_pathspl_hostid |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+to uniquely identify a system when vdevs can be +shared across multiple systems |
+
Data Type |
+ulong |
+
Range |
+0=ignore hostid, 1 to 4,294,967,295 (32-bits or +0xffffffff) |
+
Default |
+0 |
+
Change |
+prior to importing pool |
+
Versions Affected |
+v0.6.1 |
+
spl_hostid_path
is the path name for a file that can contain a
+unique hostid. For testing purposes, spl_hostid_path
can be
+overridden by the ZFS_HOSTID environment variable.
spl_hostid_path |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+when creating a new ZFS distribution where the +default value is inappropriate |
+
Data Type |
+string |
+
Default |
+“/etc/hostid” |
+
Change |
+read-only, can only be changed prior to spl +module load |
+
Versions Affected |
+v0.6.1 |
+
Large kmem_alloc()
allocations fail if they exceed KMALLOC_MAX_SIZE,
+as determined by the kernel source. Allocations which are marginally
+smaller than this limit may succeed but should still be avoided due to
+the expense of locating a contiguous range of free pages. Therefore, a
+maximum kmem size with reasonable safely margin of 4x is set.
+kmem_alloc()
allocations larger than this maximum will quickly fail.
+vmem_alloc()
allocations less than or equal to this value will use
+kmalloc()
, but shift to vmalloc()
when exceeding this value.
spl_kmem_alloc_max |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+TBD |
+
Default |
+KMALLOC_MAX_SIZE / 4 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
As a general rule kmem_alloc()
allocations should be small,
+preferably just a few pages since they must by physically contiguous.
+Therefore, a rate limited warning is printed to the console for any
+kmem_alloc()
which exceeds the threshold spl_kmem_alloc_warn
The default warning threshold is set to eight pages but capped at 32K to +accommodate systems using large pages. This value was selected to be +small enough to ensure the largest allocations are quickly noticed and +fixed. But large enough to avoid logging any warnings when a allocation +size is larger than optimal but not a serious concern. Since this value +is tunable, developers are encouraged to set it lower when testing so +any new largish allocations are quickly caught. These warnings may be +disabled by setting the threshold to zero.
+spl_kmem_alloc_warn |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+developers are encouraged lower when testing +so any new, large allocations are quickly +caught |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+0=disable the warnings, |
+
Default |
+32,768 (32 KiB) |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
Cache expiration is part of default illumos cache behavior. The idea is +that objects in magazines which have not been recently accessed should +be returned to the slabs periodically. This is known as cache aging and +when enabled objects will be typically returned after 15 seconds.
+On the other hand Linux slabs are designed to never move objects back to +the slabs unless there is memory pressure. This is possible because +under Linux the cache will be notified when memory is low and objects +can be released.
+By default only the Linux method is enabled. It has been shown to
+improve responsiveness on low memory systems and not negatively impact
+the performance of systems with more memory. This policy may be changed
+by setting the spl_kmem_cache_expire
bit mask as follows, both
+policies may be enabled concurrently.
spl_kmem_cache_expire |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+bitmask |
+
Range |
+0x01 - Aging (illumos), 0x02 - Low memory (Linux) |
+
Default |
+0x02 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.6.1 to v0.8.x |
+
Depending on the size of a memory cache object it may be backed by
+kmalloc()
or vmalloc()
memory. This is because the size of the
+required allocation greatly impacts the best way to allocate the memory.
When objects are small and only a small number of memory pages need to
+be allocated, ideally just one, then kmalloc()
is very efficient.
+However, allocating multiple pages with kmalloc()
gets increasingly
+expensive because the pages must be physically contiguous.
For this reason we shift to vmalloc()
for slabs of large objects
+which which removes the need for contiguous pages. vmalloc()
cannot
+be used in all cases because there is significant locking overhead
+involved. This function takes a single global lock over the entire
+virtual address range which serializes all allocations. Using slightly
+different allocation functions for small and large objects allows us to
+handle a wide range of object sizes.
The spl_kmem_cache_kmem_limit
value is used to determine this cutoff
+size. One quarter of the kernel’s compiled PAGE_SIZE is used as the
+default value because
+spl_kmem_cache_obj_per_slab defaults
+to 16. With these default values, at most four contiguous pages are
+allocated.
spl_kmem_cache_kmem_limit |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+pages |
+
Range |
+TBD |
+
Default |
+PAGE_SIZE / 4 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 to v0.8.x |
+
spl_kmem_cache_max_size
is the maximum size of a kmem cache slab in
+MiB. This effectively limits the maximum cache object size to
+spl_kmem_cache_max_size
/
+spl_kmem_cache_obj_per_slab Kmem
+caches may not be created with object sized larger than this limit.
spl_kmem_cache_max_size |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+MiB |
+
Range |
+TBD |
+
Default |
+4 for 32-bit kernel, 32 for 64-bit kernel |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_kmem_cache_obj_per_slab
is the preferred number of objects per
+slab in the kmem cache. In general, a larger value will increase the
+caches memory footprint while decreasing the time required to perform an
+allocation. Conversely, a smaller value will minimize the footprint and
+improve cache reclaim time but individual allocations may take longer.
spl_kmem_cache_obj_per_slab |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+kmem cache objects |
+
Range |
+TBD |
+
Default |
+8 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 to v0.8.x |
+
spl_kmem_cache_obj_per_slab_min
is the minimum number of objects
+allowed per slab. Normally slabs will contain
+spl_kmem_cache_obj_per_slab objects
+but for caches that contain very large objects it’s desirable to only
+have a few, or even just one, object per slab.
spl_kmem_cache_obj_per_slab_min |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+debugging kmem cache operations |
+
Data Type |
+uint |
+
Units |
+kmem cache objects |
+
Range |
+TBD |
+
Default |
+1 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_kmem_cache_reclaim
prevents Linux from being able to rapidly
+reclaim all the memory held by the kmem caches. This may be useful in
+circumstances where it’s preferable that Linux reclaim memory from some
+other subsystem first. Setting spl_kmem_cache_reclaim
increases the
+likelihood out of memory events on a memory constrained system.
spl_kmem_cache_reclaim |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+boolean |
+
Range |
+0=enable rapid memory reclaim from kmem +caches, 1=disable rapid memory reclaim +from kmem caches |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
For small objects the Linux slab allocator should be used to make the
+most efficient use of the memory. However, large objects are not
+supported by the Linux slab allocator and therefore the SPL
+implementation is preferred. spl_kmem_cache_slab_limit
is used to
+determine the cutoff between a small and large object.
Objects of spl_kmem_cache_slab_limit
or smaller will be allocated
+using the Linux slab allocator, large objects use the SPL allocator. A
+cutoff of 16 KiB was determined to be optimal for architectures using 4
+KiB pages.
spl_kmem_cache_slab_limit |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+bytes |
+
Range |
+TBD |
+
Default |
+16,384 (16 KiB) when kernel PAGE_SIZE = +4KiB, 0 for other PAGE_SIZE values |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_max_show_tasks
is the limit of tasks per pending list in each
+taskq shown in /proc/spl/taskq
and /proc/spl/taskq-all
. Reading
+the ProcFS files walks the lists with lock held and it could cause a
+lock up if the list grow too large. If the list is larger than the
+limit, the string `”(truncated)” is printed.
spl_max_show_tasks |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+uint |
+
Units |
+tasks reported |
+
Range |
+0 disables the limit, 1 to MAX_UINT |
+
Default |
+512 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_panic_halt
enables kernel panic upon assertion failures. When
+not enabled, the asserting thread is halted to facilitate further
+debugging.
spl_panic_halt |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+when debugging assertions and kernel core dumps +are desired |
+
Data Type |
+boolean |
+
Range |
+0=halt thread upon assertion, 1=panic kernel +upon assertion |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
Upon writing a non-zero value to spl_taskq_kick
, all taskqs are
+scanned. If any taskq has a pending task more than 5 seconds old, the
+taskq spawns more threads. This can be useful in rare deadlock
+situations caused by one or more taskqs not spawning a thread when it
+should.
spl_taskq_kick |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+See description above |
+
Data Type |
+uint |
+
Units |
+N/A |
+
Default |
+0 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_taskq_thread_bind
enables binding taskq threads to specific
+CPUs, distributed evenly over the available CPUs. By default, this
+behavior is disabled to allow the Linux scheduler the maximum
+flexibility to determine where a thread should run.
spl_taskq_thread_bind |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+when debugging CPU scheduling options |
+
Data Type |
+boolean |
+
Range |
+0=taskqs are not bound to specific CPUs, +1=taskqs are bound to CPUs |
+
Default |
+0 |
+
Change |
+prior to loading spl kernel module |
+
Versions Affected |
+v0.7.0 |
+
spl_taskq_thread_dynamic
enables taskqs to set the TASKQ_DYNAMIC
+flag will by default create only a single thread. New threads will be
+created on demand up to a maximum allowed number to facilitate the
+completion of outstanding tasks. Threads which are no longer needed are
+promptly destroyed. By default this behavior is enabled but it can be d.
See also +zfs_zil_clean_taskq_nthr_pct, +zio_taskq_batch_pct
+spl_taskq_thread_dynamic |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+disable for performance analysis or +troubleshooting |
+
Data Type |
+boolean |
+
Range |
+0=taskq threads are not dynamic, 1=taskq +threads are dynamically created and +destroyed |
+
Default |
+1 |
+
Change |
+prior to loading spl kernel module |
+
Versions Affected |
+v0.7.0 |
+
spl_taskq_thread_priority
allows newly created taskq threads to
+set a non-default scheduler priority. When enabled the priority
+specified when a taskq is created will be applied to all threads
+created by that taskq.spl_taskq_thread_priority |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+when troubleshooting CPU +scheduling-related performance issues |
+
Data Type |
+boolean |
+
Range |
+0=taskq threads use the default Linux +kernel thread priority, 1= |
+
Default |
+1 |
+
Change |
+prior to loading spl kernel module |
+
Versions Affected |
+v0.7.0 |
+
spl_taskq_thread_sequential
is the number of items a taskq worker
+thread must handle without interruption before requesting a new worker
+thread be spawned. spl_taskq_thread_sequential
controls how quickly
+taskqs ramp up the number of threads processing the queue. Because Linux
+thread creation and destruction are relatively inexpensive a small
+default value has been selected. Thus threads are created aggressively,
+which is typically desirable. Increasing this value results in a slower
+thread creation rate which may be preferable for some configurations.
spl_taskq_thread_sequential |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+TBD |
+
Data Type |
+int |
+
Units |
+taskq items |
+
Range |
+1 to MAX_INT |
+
Default |
+4 |
+
Change |
+Dynamic |
+
Versions Affected |
+v0.7.0 |
+
spl_kmem_cache_kmem_threads
shows the current number of
+spl_kmem_cache
threads. This task queue is responsible for
+allocating new slabs for use by the kmem caches. For the majority of
+systems and workloads only a small number of threads are required.
spl_kmem_cache_kmem_threads |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
+read-only |
+
Data Type |
+int |
+
Range |
+1 to MAX_INT |
+
Units |
+threads |
+
Default |
+4 |
+
Change |
+read-only, can only be changed prior +to spl module load |
+
Versions Affected |
+v0.7.0 |
+
spl_kmem_cache_magazine_size
shows the current . Cache magazines are
+an optimization designed to minimize the cost of allocating memory. They
+do this by keeping a per-cpu cache of recently freed objects, which can
+then be reallocated without taking a lock. This can improve performance
+on highly contended caches. However, because objects in magazines will
+prevent otherwise empty slabs from being immediately released this may
+not be ideal for low memory machines.
For this reason spl_kmem_cache_magazine_size can be used to set a +maximum magazine size. When this value is set to 0 the magazine size +will be automatically determined based on the object size. Otherwise +magazines will be limited to 2-256 objects per magazine (eg per CPU). +Magazines cannot be disabled entirely in this implementation.
+spl_kmem_cache_magazine_size |
+Notes |
+
---|---|
Tags |
++ |
Kernel module |
+spl |
+
When to change |
++ |
Data Type |
+int |
+
Units |
+threads |
+
Range |
+0=automatically scale magazine size, +otherwise 2 to 256 |
+
Default |
+0 |
+
Change |
+read-only, can only be changed prior +to spl module load |
+
Versions Affected |
+v0.7.0 |
+
Below are tips for various workloads.
+ +Descriptions of ZFS internals that have an effect on application +performance follow.
+For decades, operating systems have used RAM as a cache to avoid the +necessity of waiting on disk IO, which is extremely slow. This concept +is called page replacement. Until ZFS, virtually all filesystems used +the Least Recently Used (LRU) page replacement algorithm in which the +least recently used pages are the first to be replaced. Unfortunately, +the LRU algorithm is vulnerable to cache flushes, where a brief change +in workload that occurs occasionally removes all frequently used data +from cache. The Adaptive Replacement Cache (ARC) algorithm was +implemented in ZFS to replace LRU. It solves this problem by maintaining +four lists:
+A list for recently cached entries.
A list for recently cached entries that have been accessed more than +once.
A list for entries evicted from #1.
A list of entries evicited from #2.
Data is evicted from the first list while an effort is made to keep data +in the second list. In this way, ARC is able to outperform LRU by +providing a superior hit rate.
+In addition, a dedicated cache device (typically a SSD) can be added to
+the pool, with
+zpool add POOLNAME cache DEVICENAME
. The cache
+device is managed by the L2ARC, which scans entries that are next to be
+evicted and writes them to the cache device. The data stored in ARC and
+L2ARC can be controlled via the primarycache
and secondarycache
+zfs properties respectively, which can be set on both zvols and
+datasets. Possible settings are all
, none
and metadata
. It
+is possible to improve performance when a zvol or dataset hosts an
+application that does its own caching by caching only metadata. One
+example would be a virtual machine using ZFS. Another would be a
+database system which manages its own cache (Oracle for instance).
+PostgreSQL, by contrast, depends on the OS-level file cache for the
+majority of cache.
Top-level vdevs contain an internal property called ashift, which stands
+for alignment shift. It is set at vdev creation and it is immutable. It
+can be read using the zdb
command. It is calculated as the maximum
+base 2 logarithm of the physical sector size of any child vdev and it
+alters the disk format such that writes are always done according to it.
+This makes 2^ashift the smallest possible IO on a vdev. Configuring
+ashift correctly is important because partial sector writes incur a
+penalty where the sector must be read into a buffer before it can be
+written. ZFS makes the implicit assumption that the sector size reported
+by drives is correct and calculates ashift based on that.
In an ideal world, physical sector size is always reported correctly and +therefore, this requires no attention. Unfortunately, this is not the +case. The sector size on all storage devices was 512-bytes prior to the +creation of flash-based solid state drives. Some operating systems, such +as Windows XP, were written under this assumption and will not function +when drives report a different sector size.
+Flash-based solid state drives came to market around 2007. These devices +report 512-byte sectors, but the actual flash pages, which roughly +correspond to sectors, are never 512-bytes. The early models used +4096-byte pages while the newer models have moved to an 8192-byte page. +In addition, “Advanced Format” hard drives have been created which also +use a 4096-byte sector size. Partial page writes suffer from similar +performance degradation as partial sector writes. In some cases, the +design of NAND-flash makes the performance degradation even worse, but +that is beyond the scope of this description.
+Reporting the correct sector sizes is the responsibility the block +device layer. This unfortunately has made proper handling of devices +that misreport drives different across different platforms. The +respective methods are as follows:
+sd.conf +on illumos
gnop(8) +on FreeBSD; see for example FreeBSD on 4K sector +drives +(2011-01-01)
ashift= +on ZFS on Linux
-o ashift= also works with both MacZFS (pool version 8) and ZFS-OSX +(pool version 5000).
-o ashift= is convenient, but it is flawed in that the creation of pools +containing top level vdevs that have multiple optimal sector sizes +require the use of multiple commands. A newer +syntax +that will rely on the actual sector sizes has been discussed as a cross +platform replacement and will likely be implemented in the future.
+In addition, there is a database of +drives known to misreport sector +sizes +to the ZFS on Linux project. It is used to automatically adjust ashift +without the assistance of the system administrator. This approach is +unable to fully compensate for misreported sector sizes whenever drive +identifiers are used ambiguously (e.g. virtual machines, iSCSI LUNs, +some rare SSDs), but it does a great amount of good. The format is +roughly compatible with illumos’ sd.conf and it is expected that other +implementations will integrate the database in future releases. Strictly +speaking, this database does not belong in ZFS, but the difficulty of +patching the Linux kernel (especially older ones) necessitated that this +be implemented in ZFS itself for Linux. The same is true for MacZFS. +However, FreeBSD and illumos are both able to implement this in the +correct layer.
+Internally, ZFS allocates data using multiples of the device’s sector
+size, typically either 512 bytes or 4KB (see above). When compression is
+enabled, a smaller number of sectors can be allocated for each block.
+The uncompressed block size is set by the recordsize
(defaults to
+128KB) or volblocksize
(defaults to 8KB) property (for filesystems
+vs volumes).
The following compression algorithms are available:
+LZ4
+New algorithm added after feature flags were created. It is +significantly superior to LZJB in all metrics tested. It is new +default compression algorithm +(compression=on) in OpenZFS. +It is available on all platforms as of 2020.
LZJB
+Original default compression algorithm (compression=on) for ZFS. +It was created to satisfy the desire for a compression algorithm +suitable for use in filesystems. Specifically, that it provides +fair compression, has a high compression speed, has a high +decompression speed and detects incompressible data +quickly.
GZIP (1 through 9)
+Classic Lempel-Ziv implementation. It provides high compression, +but it often makes IO CPU-bound.
ZLE (Zero Length Encoding)
+A very simple algorithm that only compresses zeroes.
ZSTD (Zstandard)
+Zstandard is a modern, high performance, general compression +algorithm which provides similar or better compression levels to +GZIP, but with much better performance. Zstandard offers a very +wide range of performance/compression trade-off, and is backed by +an extremely fast decoder. +It is available from OpenZFS 2.0 version.
If you want to use compression and are uncertain which to use, use LZ4. +It averages a 2.1:1 compression ratio while gzip-1 averages 2.7:1, but +gzip is much slower. Both figures are obtained from testing by the LZ4 +project on the Silesia corpus. The +greater compression ratio of gzip is usually only worthwhile for rarely +accessed data.
+Choose a RAID-Z stripe width based on your IOPS needs and the amount of +space you are willing to devote to parity information. If you need more +IOPS, use fewer disks per stripe. If you need more usable space, use +more disks per stripe. Trying to optimize your RAID-Z stripe width based +on exact numbers is irrelevant in nearly all cases. See this blog +post +for more details.
+ZFS datasets use an internal recordsize of 128KB by default. The dataset +recordsize is the basic unit of data used for internal copy-on-write on +files. Partial record writes require that data be read from either ARC +(cheap) or disk (expensive). recordsize can be set to any power of 2 +from 512 bytes to 1 megabyte. Software that writes in fixed record +sizes (e.g. databases) will benefit from the use of a matching +recordsize.
+Changing the recordsize on a dataset will only take effect for new +files. If you change the recordsize because your application should +perform better with a different one, you will need to recreate its +files. A cp followed by a mv on each file is sufficient. Alternatively, +send/recv should recreate the files with the correct recordsize when a +full receive is done.
+Record sizes of up to 16M are supported with the large_blocks pool +feature, which is enabled by default on new pools on systems that +support it.
+Record sizes larger than 1M were disabled by default +before openZFS v2.2, +unless the zfs_max_recordsize kernel module parameter was set to allow +sizes higher than 1M.
+`zfs send` operations must specify -L +to ensure that larger than 128KB blocks are sent and the receiving pools +must support the large_blocks feature.
+Zvols have a volblocksize
property that is analogous to recordsize
.
+Current default (16KB since v2.2) balances the metadata overhead, compression
+opportunities and decent space efficiency on majority of pool configurations
+due to 4KB disk physical block rounding (especially on RAIDZ and DRAID),
+while incurring some write amplification on guest FSes that run with smaller
+block sizes [7].
Users are advised to test their scenarios and see whether the volblocksize
+needs to be changed to favor one or the other:
sector alignment of guest FS is crucial
most of guest FSes use default block size of 4-8KB, so:
+Larger volblocksize
can help with mostly sequential workloads and
+will gain a compression efficiency
Smaller volblocksize
can help with random workloads and minimize
+IO amplification, but will use more metadata
+(e.g. more small IOs will be generated by ZFS) and may have worse
+space efficiency (especially on RAIDZ and DRAID)
It’s meaningless to set volblocksize
less than guest FS’s block size
+or ashift
See Dataset recordsize +for additional information
Deduplication uses an on-disk hash table, using extensible
+hashing as
+implemented in the ZAP (ZFS Attribute Processor). Each cached entry uses
+slightly more than 320 bytes of memory. The DDT code relies on ARC for
+caching the DDT entries, such that there is no double caching or
+internal fragmentation from the kernel memory allocator. Each pool has a
+global deduplication table shared across all datasets and zvols on which
+deduplication is enabled. Each entry in the hash table is a record of a
+unique block in the pool. (Where the block size is set by the
+recordsize
or volblocksize
properties.)
The hash table (also known as the DDT or DeDup Table) must be accessed +for every dedup-able block that is written or freed (regardless of +whether it has multiple references). If there is insufficient memory for +the DDT to be cached in memory, each cache miss will require reading a +random block from disk, resulting in poor performance. For example, if +operating on a single 7200RPM drive that can do 100 io/s, uncached DDT +reads would limit overall write throughput to 100 blocks per second, or +400KB/s with 4KB blocks.
+The consequence is that sufficient memory to store deduplication data is
+required for good performance. The deduplication data is considered
+metadata and therefore can be cached if the primarycache
or
+secondarycache
properties are set to metadata
. In addition, the
+deduplication table will compete with other metadata for metadata
+storage, which can have a negative effect on performance. Simulation of
+the number of deduplication table entries needed for a given pool can be
+done using the -D option to zdb. Then a simple multiplication by
+320-bytes can be done to get the approximate memory requirements.
+Alternatively, you can estimate an upper bound on the number of unique
+blocks by dividing the amount of storage you plan to use on each dataset
+(taking into account that partial records each count as a full
+recordsize for the purposes of deduplication) by the recordsize and each
+zvol by the volblocksize, summing and then multiplying by 320-bytes.
ZFS top level vdevs are divided into metaslabs from which blocks can be +independently allocated so allow for concurrent IOs to perform +allocations without blocking one another. At present, there is a +regression on the +Linux and Mac OS X ports that causes serialization to occur.
+By default, the selection of a metaslab is biased toward lower LBAs to +improve performance of spinning disks, but this does not make sense on +solid state media. This behavior can be adjusted globally by setting the +ZFS module’s global metaslab_lba_weighting_enabled tuanble to 0. This +tunable is only advisable on systems that only use solid state media for +pools.
+The metaslab allocator will allocate blocks on a first-fit basis when a
+metaslab has more than or equal to 4 percent free space and a best-fit
+basis when a metaslab has less than 4 percent free space. The former is
+much faster than the latter, but it is not possible to tell when this
+behavior occurs from the pool’s free space. However, the command zdb
+-mmm $POOLNAME
will provide this information.
If small random IOPS are of primary importance, mirrored vdevs will +outperform raidz vdevs. Read IOPS on mirrors will scale with the number +of drives in each mirror while raidz vdevs will each be limited to the +IOPS of the slowest drive.
+If sequential writes are of primary importance, raidz will outperform +mirrored vdevs. Sequential write throughput increases linearly with the +number of data disks in raidz while writes are limited to the slowest +drive in mirrored vdevs. Sequential read performance should be roughly +the same on each.
+Both IOPS and throughput will increase by the respective sums of the +IOPS and throughput of each top level vdev, regardless of whether they +are raidz or mirrors.
+ZFS will behave differently on different platforms when given a whole +disk.
+On illumos, ZFS attempts to enable the write cache on a whole disk. The +illumos UFS driver cannot ensure integrity with the write cache enabled, +so by default Sun/Solaris systems using UFS file system for boot were +shipped with drive write cache disabled (long ago, when Sun was still an +independent company). For safety on illumos, if ZFS is not given the +whole disk, it could be shared with UFS and thus it is not appropriate +for ZFS to enable write cache. In this case, the write cache setting is +not changed and will remain as-is. Today, most vendors ship drives with +write cache enabled by default.
+On Linux, the Linux IO elevator is largely redundant given that ZFS has +its own IO elevator.
+ZFS will also create a GPT partition table own partitions when given a +whole disk under illumos on x86/amd64 and on Linux. This is mainly to +make booting through UEFI possible because UEFI requires a small FAT +partition to be able to boot the system. The ZFS driver will be able to +tell the difference between whether the pool had been given the entire +disk or not via the whole_disk field in the label.
+This is not done on FreeBSD. Pools created by FreeBSD will always have +the whole_disk field set to true, such that a pool imported on another +platform that was created on FreeBSD will always be treated as the whole +disks were given to ZFS.
+Some Linux distributions (at least Debian, Ubuntu) enable
+init_on_alloc
option as security precaution by default.
+This option can help to [6]:
++prevent possible information leaks and +make control-flow bugs that depend on uninitialized values more +deterministic.
+
Unfortunately, it can lower ARC throughput considerably +(see bug).
+If you’re ready to cope with these security risks [6],
+you may disable it
+by setting init_on_alloc=0
in the GRUB kernel boot parameters.
Make sure that you create your pools such that the vdevs have the +correct alignment shift for your storage device’s size. if dealing with +flash media, this is going to be either 12 (4K sectors) or 13 (8K +sectors). For SSD ephemeral storage on Amazon EC2, the proper setting is +12.
+Set either relatime=on or atime=off to minimize IOs used to update +access time stamps. For backward compatibility with a small percentage +of software that supports it, relatime is preferred when available and +should be set on your entire pool. atime=off should be used more +selectively.
+Keep pool free space above 10% to avoid many metaslabs from reaching the +4% free space threshold to switch from first-fit to best-fit allocation +strategies. When the threshold is hit, the Metaslab Allocator becomes very CPU +intensive in an attempt to protect itself from fragmentation. This +reduces IOPS, especially as more metaslabs reach the 4% threshold.
+The recommendation is 10% rather than 5% because metaslabs selection +considers both location and free space unless the global +metaslab_lba_weighting_enabled tunable is set to 0. When that tunable is +0, ZFS will consider only free space, so the the expense of the best-fit +allocator can be avoided by keeping free space above 5%. That setting +should only be used on systems with pools that consist of solid state +drives because it will reduce sequential IO performance on mechanical +disks.
+Set compression=lz4 on your pools’ root datasets so that all datasets +inherit it unless you have a reason not to enable it. Userland tests of +LZ4 compression of incompressible data in a single thread has shown that +it can process 10GB/sec, so it is unlikely to be a bottleneck even on +incompressible data. Furthermore, incompressible data will be stored +without compression such that reads of incompressible data with +compression enabled will not be subject to decompression. Writes are so +fast that in-compressible data is unlikely to see a performance penalty +from the use of LZ4 compression. The reduction in IO from LZ4 will +typically be a performance win.
+Note that larger record sizes will increase compression ratios on +compressible data by allowing compression algorithms to process more +data at a time.
+Do not put more than ~16 disks in raidz. The rebuild times on mechanical +disks will be excessive when the pool is full.
+If your workload involves fsync or O_SYNC and your pool is backed by +mechanical storage, consider adding one or more SLOG devices. Pools that +have multiple SLOG devices will distribute ZIL operations across them. +The best choice for SLOG device(s) are likely Optane / 3D XPoint SSDs. +See Optane / 3D XPoint SSDs +for a description of them. If an Optane / 3D XPoint SSD is an option, +the rest of this section on synchronous I/O need not be read. If Optane +/ 3D XPoint SSDs is not an option, see +NAND Flash SSDs for suggestions +for NAND flash SSDs and also read the information below.
+To ensure maximum ZIL performance on NAND flash SSD-based SLOG devices, +you should also overprovison spare area to increase +IOPS [1]. Only +about 4GB is needed, so the rest can be left as overprovisioned storage. +The choice of 4GB is somewhat arbitrary. Most systems do not write +anything close to 4GB to ZIL between transaction group commits, so +overprovisioning all storage beyond the 4GB partition should be alright. +If a workload needs more, then make it no more than the maximum ARC +size. Even under extreme workloads, ZFS will not benefit from more SLOG +storage than the maximum ARC size. That is half of system memory on +Linux and 3/4 of system memory on illumos.
+You can do this with a mix of a secure erase and a partition table +trick, such as the following:
+Run a secure erase on the NAND-flash SSD.
Create a partition table on the NAND-flash SSD.
Create a 4GB partition.
Give the partition to ZFS to use as a log device.
If using the secure erase and partition table trick, do not use the +unpartitioned space for other things, even temporarily. That will reduce +or eliminate the overprovisioning by marking pages as dirty.
+Alternatively, some devices allow you to change the sizes that they +report.This would also work, although a secure erase should be done +prior to changing the reported size to ensure that the SSD recognizes +the additional spare area. Changing the reported size can be done on +drives that support it with `hdparm -N ` on systems that have +laptop-mode-tools.
+On NVMe, you can use namespaces to achieve overprovisioning:
+Do a sanitize command as a precaution to ensure the device is +completely clean.
Delete the default namespace.
Create a new namespace of size 4GB.
Give the namespace to ZFS to use as a log device. e.g. zfs add tank +log /dev/nvme1n1
Whole disks should be given to ZFS rather than partitions. If you must +use a partition, make certain that the partition is properly aligned to +avoid read-modify-write overhead. See the section on +Alignment Shift (ashift) +for a description of proper alignment. Also, see the section on +Whole Disks versus Partitions +for a description of changes in ZFS behavior when operating on a +partition.
+Single disk RAID 0 arrays from RAID controllers are not equivalent to +whole disks. The Hardware RAID controllers page +explains in detail.
+Bit torrent performs 16KB random reads/writes. The 16KB writes cause +read-modify-write overhead. The read-modify-write overhead can reduce +performance by a factor of 16 with 128KB record sizes when the amount of +data written exceeds system memory. This can be avoided by using a +dedicated dataset for bit torrent downloads with recordsize=16KB.
+When the files are read sequentially through a HTTP server, the random +nature in which the files were generated creates fragmentation that has +been observed to reduce sequential read performance by a factor of two +on 7200RPM hard disks. If performance is a problem, fragmentation can be +eliminated by rewriting the files sequentially in either of two ways:
+The first method is to configure your client to download the files to a +temporary directory and then copy them into their final location when +the downloads are finished, provided that your client supports this.
+The second method is to use send/recv to recreate a dataset +sequentially.
+In practice, defragmenting files obtained through bit torrent should +only improve performance when the files are stored on magnetic storage +and are subject to significant sequential read workloads after creation.
+Setting redundant_metadata=most
can increase IOPS by at least a few
+percentage points by eliminating redundant metadata at the lowest level
+of the indirect block tree. This comes with the caveat that data loss
+will occur if a metadata block pointing to data blocks is corrupted and
+there are no duplicate copies, but this is generally not a problem in
+production on mirrored or raidz vdevs.
Make separate datasets for InnoDB’s data files and log files. Set
+recordsize=16K
on InnoDB’s data files to avoid expensive partial record
+writes and leave recordsize=128K on the log files. Set
+primarycache=metadata
on both to prefer InnoDB’s
+caching [2].
+Set logbias=throughput
on the data to stop ZIL from writing twice.
Set skip-innodb_doublewrite
in my.cnf to prevent innodb from writing
+twice. The double writes are a data integrity feature meant to protect
+against corruption from partially-written records, but those are not
+possible on ZFS. It should be noted that Percona’s
+blog had advocated
+using an ext4 configuration where double writes were
+turned off for a performance gain, but later recanted it because it
+caused data corruption. Following a well timed power failure, an in
+place filesystem such as ext4 can have half of a 8KB record be old while
+the other half would be new. This would be the corruption that caused
+Percona to recant its advice. However, ZFS’ copy on write design would
+cause it to return the old correct data following a power failure (no
+matter what the timing is). That prevents the corruption that the double
+write feature is intended to prevent from ever happening. The double
+write feature is therefore unnecessary on ZFS and can be safely turned
+off for better performance.
On Linux, the driver’s AIO implementation is a compatibility shim that
+just barely passes the POSIX standard. InnoDB performance suffers when
+using its default AIO codepath. Set innodb_use_native_aio=0
and
+innodb_use_atomic_writes=0
in my.cnf to disable AIO. Both of these
+settings must be disabled to disable AIO.
Make separate datasets for PostgreSQL’s data and WAL. Set
+compression=lz4
and recordsize=32K
(64K also work well, as
+does the 128K default) on both. Configure full_page_writes = off
+for PostgreSQL, as ZFS will never commit a partial write. For a database
+with large updates, experiment with logbias=throughput
on
+PostgreSQL’s data to avoid writing twice, but be aware that with this
+setting smaller updates can cause severe fragmentation.
Make a separate dataset for the database. Set the recordsize to 64K. Set +the SQLite page size to 65536 +bytes [3].
+Note that SQLite databases typically are not exercised enough to merit +special tuning, but this will provide it. Note the side effect on cache +size mentioned at +SQLite.org [4].
+Create a dedicated dataset for files being served.
+See +Sequential workloads +for configuration recommendations.
+Windows/DOS clients doesn’t support case sensitive file names.
+If your main workload won’t need case sensitivity for other supported clients,
+create dataset with zfs create -o casesensitivity=insensitive
+so Samba may search filenames faster in future [5].
See case sensitive
option in
+smb.conf(5).
Set recordsize=1M
on datasets that are subject to sequential workloads.
+Read
+Larger record sizes
+for documentation on things that should be known before setting 1M
+record sizes.
Set compression=lz4
as per the general recommendation for LZ4
+compression.
Create a dedicated dataset, use chown to make it user accessible (or +create a directory under it and use chown on that) and then configure +the game download application to place games there. Specific information +on how to configure various ones is below.
+See +Sequential workloads +for configuration recommendations before installing games.
+Note that the performance gains from this tuning are likely to be small +and limited to load times. However, the combination of 1M records and +LZ4 will allow more games to be stored, which is why this tuning is +documented despite the performance gains being limited. A steam library +of 300 games (mostly from humble bundle) that had these tweaks applied +to it saw 20% space savings. Both faster load times and significant +space savings are possible on compressible games when this tuning has +been done. Games whose assets are already compressed will see little to +no benefit.
+Open the context menu by left clicking on the triple bar icon in the +upper right. Go to “Preferences” and then the “System options” tab. +Change the default installation directory and click save.
+Go to “Settings” -> “Downloads” -> “Steam Library Folders” and use “Add +Library Folder” to set the directory for steam to use to store games. +Make sure to set it to the default by right clicking on it and clicking +“Make Default Folder” before closing the dialogue.
+If you’ll use Proton to run non-native games,
+create dataset with zfs create -o casesensitivity=insensitive
+so Wine may search filenames faster in future [5].
Windows file systems’ standard behavior is to be case-insensitive.
+Create dataset with zfs create -o casesensitivity=insensitive
+so Wine may search filenames faster in future [5].
Virtual machine images on ZFS should be stored using either zvols or raw +files to avoid unnecessary overhead. The recordsize/volblocksize and +guest filesystem may be configured to match to avoid overhead from +partial record modification, see zvol volblocksize. +If raw files are used, a separate dataset should be used to make it easy to configure +recordsize independently of other things stored on ZFS.
+AIO should be used to maximize IOPS when using files for guest storage.
+Footnotes
+<https://www.patpro.net/blog/index.php/2014/03/09/2617-mysql-on-zfs-on-freebsd/>
+ZFS write operations are delayed when the backend storage isn’t able to +accommodate the rate of incoming writes. This delay process is known as +the ZFS write throttle.
+If there is already a write transaction waiting, the delay is relative +to when that transaction will finish waiting. Thus the calculated delay +time is independent of the number of threads concurrently executing +transactions.
+If there is only one waiter, the delay is relative to when the +transaction started, rather than the current time. This credits the +transaction for “time already served.” For example, if a write +transaction requires reading indirect blocks first, then the delay is +counted at the start of the transaction, just prior to the indirect +block reads.
+The minimum time for a transaction to take is calculated as:
+min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
+min_time is then capped at 100 milliseconds
+
The delay has two degrees of freedom that can be adjusted via tunables:
+The percentage of dirty data at which we start to delay is defined by +zfs_delay_min_dirty_percent. This is typically be at or above +zfs_vdev_async_write_active_max_dirty_percent so delays occur after +writing at full speed has failed to keep up with the incoming write +rate.
The scale of the curve is defined by zfs_delay_scale. Roughly +speaking, this variable determines the amount of delay at the +midpoint of the curve.
delay
+ 10ms +-------------------------------------------------------------*+
+ | *|
+ 9ms + *+
+ | *|
+ 8ms + *+
+ | * |
+ 7ms + * +
+ | * |
+ 6ms + * +
+ | * |
+ 5ms + * +
+ | * |
+ 4ms + * +
+ | * |
+ 3ms + * +
+ | * |
+ 2ms + (midpoint) * +
+ | | ** |
+ 1ms + v *** +
+ | zfs_delay_scale ----------> ******** |
+ 0 +-------------------------------------*********----------------+
+ 0% <- zfs_dirty_data_max -> 100%
+
Note that since the delay is added to the outstanding time remaining on +the most recent transaction, the delay is effectively the inverse of +IOPS. Here the midpoint of 500 microseconds translates to 2000 IOPS. The +shape of the curve was chosen such that small changes in the amount of +accumulated dirty data in the first 3/4 of the curve yield relatively +small differences in the amount of delay.
+The effects can be easier to understand when the amount of delay is +represented on a log scale:
+delay
+100ms +-------------------------------------------------------------++
+ + +
+ | |
+ + *+
+ 10ms + *+
+ + ** +
+ | (midpoint) ** |
+ + | ** +
+ 1ms + v **** +
+ + zfs_delay_scale ----------> ***** +
+ | **** |
+ + **** +
+100us + ** +
+ + * +
+ | * |
+ + * +
+ 10us + * +
+ + +
+ | |
+ + +
+ +--------------------------------------------------------------+
+ 0% <- zfs_dirty_data_max -> 100%
+
Note here that only as the amount of dirty data approaches its limit +does the delay start to increase rapidly. The goal of a properly tuned +system should be to keep the amount of dirty data out of that range by +first ensuring that the appropriate limits are set for the I/O scheduler +to reach optimal throughput on the backend storage, and then by changing +the value of zfs_delay_scale to increase the steepness of the curve.
+ZFS issues I/O operations to leaf vdevs (usually devices) to satisfy and +complete I/Os. The ZIO scheduler determines when and in what order those +operations are issued. Operations are divided into five I/O classes +prioritized in the following order:
+Priority |
+I/O Class |
+Description |
+
---|---|---|
highest |
+sync read |
+most reads |
+
+ | sync write |
+as defined by application or via ‘zfs’ +‘sync’ property |
+
+ | async read |
+prefetch reads |
+
+ | async write |
+most writes |
+
lowest |
+scrub read |
+scan read: includes both scrub and +resilver |
+
Each queue defines the minimum and maximum number of concurrent +operations issued to the device. In addition, the device has an +aggregate maximum, zfs_vdev_max_active. Note that the sum of the +per-queue minimums must not exceed the aggregate maximum. If the sum of +the per-queue maximums exceeds the aggregate maximum, then the number of +active I/Os may reach zfs_vdev_max_active, in which case no further I/Os +are issued regardless of whether all per-queue minimums have been met.
+I/O Class |
+Min Active Parameter |
+Max Active Parameter |
+
---|---|---|
sync read |
+
|
+
|
+
sync write |
+
|
+
|
+
async read |
+
|
+
|
+
async write |
+
|
+
|
+
scrub read |
+
|
+
|
+
For many physical devices, throughput increases with the number of +concurrent operations, but latency typically suffers. Further, physical +devices typically have a limit at which more concurrent operations have +no effect on throughput or can cause the disk performance to +decrease.
+The ZIO scheduler selects the next operation to issue by first looking +for an I/O class whose minimum has not been satisfied. Once all are +satisfied and the aggregate maximum has not been hit, the scheduler +looks for classes whose maximum has not been satisfied. Iteration +through the I/O classes is done in the order specified above. No further +operations are issued if the aggregate maximum number of concurrent +operations has been hit or if there are no operations queued for an I/O +class that has not hit its maximum. Every time an I/O is queued or an +operation completes, the I/O scheduler looks for new operations to +issue.
+In general, smaller max_active’s will lead to lower latency of +synchronous operations. Larger max_active’s may lead to higher overall +throughput, depending on underlying storage and the I/O mix.
+The ratio of the queues’ max_actives determines the balance of +performance between reads, writes, and scrubs. For example, when there +is contention, increasing zfs_vdev_scrub_max_active will cause the scrub +or resilver to complete more quickly, but reads and writes to have +higher latency and lower throughput.
+All I/O classes have a fixed maximum number of outstanding operations +except for the async write class. Asynchronous writes represent the data +that is committed to stable storage during the syncing stage for +transaction groups (txgs). Transaction groups enter the syncing state +periodically so the number of queued async writes quickly bursts up and +then reduce down to zero. The zfs_txg_timeout tunable (default=5 +seconds) sets the target interval for txg sync. Thus a burst of async +writes every 5 seconds is a normal ZFS I/O pattern.
+Rather than servicing I/Os as quickly as possible, the ZIO scheduler +changes the maximum number of active async write I/Os according to the +amount of dirty data in the pool. Since both throughput and latency +typically increase as the number of concurrent operations issued to +physical devices, reducing the burstiness in the number of concurrent +operations also stabilizes the response time of operations from other +queues. This is particularly important for the sync read and write queues, +where the periodic async write bursts of the txg sync can lead to +device-level contention. In broad strokes, the ZIO scheduler issues more +concurrent operations from the async write queue as there’s more dirty +data in the pool.
+The hole_birth feature has/had bugs, the result of which is that, if you
+do a zfs send -i
(or -R
, since it uses -i
) from an affected
+dataset, the receiver will not see any checksum or other errors, but the
+resulting destination snapshot will not match the source.
ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring the +faulty metadata which causes this issue on the sender side.
+It is technically possible to calculate whether you have any affected +files, but it requires scraping zdb output for each file in each +snapshot in each dataset, which is a combinatoric nightmare. (If you +really want it, there is a proof of concept +here.
+No, the data you need was simply not present in the send stream, +unfortunately, and cannot feasibly be rewritten in place.
+hole_birth is a feature to speed up ZFS send -i - in particular, ZFS +used to not store metadata on when “holes” (sparse regions) in files +were created, so every zfs send -i needed to include every hole.
+hole_birth, as the name implies, added tracking for the txg (transaction +group) when a hole was created, so that zfs send -i could only send +holes that had a birth_time between (starting snapshot txg) and (ending +snapshot txg), and life was wonderful.
+Unfortunately, hole_birth had a number of edge cases where it could +“forget” to set the birth_time of holes in some cases, causing it to +record the birth_time as 0 (the value used prior to hole_birth, and +essentially equivalent to “since file creation”).
+This meant that, when you did a zfs send -i, since zfs send does not +have any knowledge of the surrounding snapshots when sending a given +snapshot, it would see the creation txg as 0, conclude “oh, it is 0, I +must have already sent this before”, and not include it.
+This means that, on the receiving side, it does not know those holes +should exist, and does not create them. This leads to differences +between the source and the destination.
+ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring this
+metadata and always sending holes with birth_time 0, configurable using
+the tunable known as ignore_hole_birth
or
+send_holes_without_birth_time
. The latter is what OpenZFS
+standardized on. ZoL version 0.6.5.8 only has the former, but for any
+ZoL version with send_holes_without_birth_time
, they point to the
+same value, so changing either will work.
OpenZFS is an outstanding storage platform that +encompasses the functionality of traditional filesystems, volume +managers, and more, with consistent reliability, functionality and +performance across all distributions. Additional information about +OpenZFS can be found in the OpenZFS wikipedia +article.
+Because ZFS was originally designed for Sun Solaris it was long +considered a filesystem for large servers and for companies that could +afford the best and most powerful hardware available. But since the +porting of ZFS to numerous OpenSource platforms (The BSDs, Illumos and +Linux - under the umbrella organization “OpenZFS”), these requirements +have been lowered.
+The suggested hardware requirements are:
+ECC memory. This isn’t really a requirement, but it’s highly +recommended.
8GB+ of memory for the best performance. It’s perfectly possible to +run with 2GB or less (and people do), but you’ll need more if using +deduplication.
Using ECC memory for OpenZFS is strongly recommended for enterprise +environments where the strongest data integrity guarantees are required. +Without ECC memory rare random bit flips caused by cosmic rays or by +faulty memory can go undetected. If this were to occur OpenZFS (or any +other filesystem) will write the damaged data to disk and be unable to +automatically detect the corruption.
+Unfortunately, ECC memory is not always supported by consumer grade +hardware. And even when it is, ECC memory will be more expensive. For +home users the additional safety brought by ECC memory might not justify +the cost. It’s up to you to determine what level of protection your data +requires.
+OpenZFS is available for FreeBSD and all major Linux distributions. Refer to +the getting started section of the wiki for +links to installations instructions. If your distribution/OS isn’t +listed you can always build OpenZFS from the latest official +tarball.
+OpenZFS is regularly compiled for the following architectures: +aarch64, arm, ppc, ppc64, x86, x86_64.
+The notes for a given +OpenZFS release will include a range of supported kernels. Point +releases will be tagged as needed in order to support the stable +kernel available from kernel.org. The +oldest supported kernel is 2.6.32 due to its prominence in Enterprise +Linux distributions.
+You are strongly encouraged to use a 64-bit kernel. OpenZFS +will build for 32-bit systems but you may encounter stability problems.
+ZFS was originally developed for the Solaris kernel which differs from +some OpenZFS platforms in several significant ways. Perhaps most importantly +for ZFS it is common practice in the Solaris kernel to make heavy use of +the virtual address space. However, use of the virtual address space is +strongly discouraged in the Linux kernel. This is particularly true on +32-bit architectures where the virtual address space is limited to 100M +by default. Using the virtual address space on 64-bit Linux kernels is +also discouraged but the address space is so much larger than physical +memory that it is less of an issue.
+If you are bumping up against the virtual memory limit on a 32-bit
+system you will see the following message in your system logs. You can
+increase the virtual address size with the boot option vmalloc=512M
.
vmap allocation for size 4198400 failed: use vmalloc=<size> to increase size.
+
However, even after making this change your system will likely not be +entirely stable. Proper support for 32-bit systems is contingent upon +the OpenZFS code being weaned off its dependence on virtual memory. This +will take some time to do correctly but it is planned for OpenZFS. This +change is also expected to improve how efficiently OpenZFS manages the +ARC cache and allow for tighter integration with the standard Linux page +cache.
+Booting from ZFS on Linux is possible and many people do it. There are +excellent walk throughs available for +Debian, +Ubuntu, and +Gentoo.
+On FreeBSD 13+ booting from ZFS is supported out of the box.
+There are different /dev/ names that can be used when creating a ZFS +pool. Each option has advantages and drawbacks, the right choice for +your ZFS pool really depends on your requirements. For development and +testing using /dev/sdX naming is quick and easy. A typical home server +might prefer /dev/disk/by-id/ naming for simplicity and readability. +While very large configurations with multiple controllers, enclosures, +and switches will likely prefer /dev/disk/by-vdev naming for maximum +control. But in the end, how you choose to identify your disks is up to +you.
+/dev/sdX, /dev/hdX: Best for development/test pools
+Summary: The top level /dev/ names are the default for consistency +with other ZFS implementations. They are available under all Linux +distributions and are commonly used. However, because they are not +persistent they should only be used with ZFS for development/test +pools.
Benefits: This method is easy for a quick test, the names are +short, and they will be available on all Linux distributions.
Drawbacks: The names are not persistent and will change depending +on what order the disks are detected in. Adding or removing +hardware for your system can easily cause the names to change. You +would then need to remove the zpool.cache file and re-import the +pool using the new names.
Example: zpool create tank sda sdb
/dev/disk/by-id/: Best for small pools (less than 10 disks)
+Summary: This directory contains disk identifiers with more human +readable names. The disk identifier usually consists of the +interface type, vendor name, model number, device serial number, +and partition number. This approach is more user friendly because +it simplifies identifying a specific disk.
Benefits: Nice for small systems with a single disk controller. +Because the names are persistent and guaranteed not to change, it +doesn’t matter how the disks are attached to the system. You can +take them all out, randomly mix them up on the desk, put them +back anywhere in the system and your pool will still be +automatically imported correctly.
Drawbacks: Configuring redundancy groups based on physical +location becomes difficult and error prone. Unreliable on many +personal virtual machine setups because the software does not +generate persistent unique names by default.
Example:
+zpool create tank scsi-SATA_Hitachi_HTS7220071201DP1D10DGG6HMRP
/dev/disk/by-path/: Good for large pools (greater than 10 disks)
+Summary: This approach is to use device names which include the +physical cable layout in the system, which means that a particular +disk is tied to a specific location. The name describes the PCI +bus number, as well as enclosure names and port numbers. This +allows the most control when configuring a large pool.
Benefits: Encoding the storage topology in the name is not only +helpful for locating a disk in large installations. But it also +allows you to explicitly layout your redundancy groups over +multiple adapters or enclosures.
Drawbacks: These names are long, cumbersome, and difficult for a +human to manage.
Example:
+zpool create tank pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-1:0:0:0
/dev/disk/by-vdev/: Best for large pools (greater than 10 disks)
+Summary: This approach provides administrative control over device +naming using the configuration file /etc/zfs/vdev_id.conf. Names +for disks in JBODs can be generated automatically to reflect their +physical location by enclosure IDs and slot numbers. The names can +also be manually assigned based on existing udev device links, +including those in /dev/disk/by-path or /dev/disk/by-id. This +allows you to pick your own unique meaningful names for the disks. +These names will be displayed by all the zfs utilities so it can +be used to clarify the administration of a large complex pool. See +the vdev_id and vdev_id.conf man pages for further details.
Benefits: The main benefit of this approach is that it allows you +to choose meaningful human-readable names. Beyond that, the +benefits depend on the naming method employed. If the names are +derived from the physical path the benefits of /dev/disk/by-path +are realized. On the other hand, aliasing the names based on drive +identifiers or WWNs has the same benefits as using +/dev/disk/by-id.
Drawbacks: This method relies on having a /etc/zfs/vdev_id.conf +file properly configured for your system. To configure this file +please refer to section Setting up the /etc/zfs/vdev_id.conf +file. As with +benefits, the drawbacks of /dev/disk/by-id or /dev/disk/by-path +may apply depending on the naming method employed.
Example: zpool create tank mirror A1 B1 mirror A2 B2
/dev/disk/by-uuid/: Not a great option
+++
+- +
Summary: One might think from the use of “UUID” that this would +be an ideal option - however, in practice, this ends up listing +one device per pool ID, which is not very useful for importing +pools with multiple disks.
/dev/disk/by-partuuid//by-partlabel: Works only for existing partitions
+++
+- +
Summary: partition UUID is generated on it’s creation, so usage is limited
- +
Drawbacks: you can’t refer to a partition unique ID on +an unpartitioned disk for
zpool replace
/add
/attach
, +and you can’t find failed disk easily without a mapping written +down ahead of time.
In order to use /dev/disk/by-vdev/ naming the /etc/zfs/vdev_id.conf
+must be configured. The format of this file is described in the
+vdev_id.conf man page. Several examples follow.
A non-multipath configuration with direct-attached SAS enclosures and an +arbitrary slot re-mapping.
+multipath no
+topology sas_direct
+phys_per_port 4
+
+# PCI_SLOT HBA PORT CHANNEL NAME
+channel 85:00.0 1 A
+channel 85:00.0 0 B
+
+# Linux Mapped
+# Slot Slot
+slot 0 2
+slot 1 6
+slot 2 0
+slot 3 3
+slot 4 5
+slot 5 7
+slot 6 4
+slot 7 1
+
A SAS-switch topology. Note that the channel keyword takes only two +arguments in this example.
+topology sas_switch
+
+# SWITCH PORT CHANNEL NAME
+channel 1 A
+channel 2 B
+channel 3 C
+channel 4 D
+
A multipath configuration. Note that channel names have multiple +definitions - one per physical path.
+multipath yes
+
+# PCI_SLOT HBA PORT CHANNEL NAME
+channel 85:00.0 1 A
+channel 85:00.0 0 B
+channel 86:00.0 1 A
+channel 86:00.0 0 B
+
A configuration using device link aliases.
+# by-vdev
+# name fully qualified or base name of device link
+alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca
+alias d2 wwn-0x5000c5002def789e
+
After defining the new disk names run udevadm trigger
to prompt udev
+to parse the configuration file. This will result in a new
+/dev/disk/by-vdev directory which is populated with symlinks to /dev/sdX
+names. Following the first example above, you could then create the new
+pool of mirrors with the following command:
$ zpool create tank \
+ mirror A0 B0 mirror A1 B1 mirror A2 B2 mirror A3 B3 \
+ mirror A4 B4 mirror A5 B5 mirror A6 B6 mirror A7 B7
+
+$ zpool status
+ pool: tank
+ state: ONLINE
+ scan: none requested
+config:
+
+ NAME STATE READ WRITE CKSUM
+ tank ONLINE 0 0 0
+ mirror-0 ONLINE 0 0 0
+ A0 ONLINE 0 0 0
+ B0 ONLINE 0 0 0
+ mirror-1 ONLINE 0 0 0
+ A1 ONLINE 0 0 0
+ B1 ONLINE 0 0 0
+ mirror-2 ONLINE 0 0 0
+ A2 ONLINE 0 0 0
+ B2 ONLINE 0 0 0
+ mirror-3 ONLINE 0 0 0
+ A3 ONLINE 0 0 0
+ B3 ONLINE 0 0 0
+ mirror-4 ONLINE 0 0 0
+ A4 ONLINE 0 0 0
+ B4 ONLINE 0 0 0
+ mirror-5 ONLINE 0 0 0
+ A5 ONLINE 0 0 0
+ B5 ONLINE 0 0 0
+ mirror-6 ONLINE 0 0 0
+ A6 ONLINE 0 0 0
+ B6 ONLINE 0 0 0
+ mirror-7 ONLINE 0 0 0
+ A7 ONLINE 0 0 0
+ B7 ONLINE 0 0 0
+
+errors: No known data errors
+
Changing the /dev/ names on an existing pool can be done by simply +exporting the pool and re-importing it with the -d option to specify +which new names should be used. For example, to use the custom names in +/dev/disk/by-vdev:
+$ zpool export tank
+$ zpool import -d /dev/disk/by-vdev tank
+
Whenever a pool is imported on the system it will be added to the
+/etc/zfs/zpool.cache file
. This file stores pool configuration
+information, such as the device names and pool state. If this file
+exists when running the zpool import
command then it will be used to
+determine the list of pools available for import. When a pool is not
+listed in the cache file it will need to be detected and imported using
+the zpool import -d /dev/disk/by-id
command.
The /etc/zfs/zpool.cache
file will be automatically updated when
+your pool configuration is changed. However, if for some reason it
+becomes stale you can force the generation of a new
+/etc/zfs/zpool.cache
file by setting the cachefile property on the
+pool.
$ zpool set cachefile=/etc/zfs/zpool.cache tank
+
Conversely the cache file can be disabled by setting cachefile=none
.
+This is useful for failover configurations where the pool should always
+be explicitly imported by the failover software.
$ zpool set cachefile=none tank
+
The hole_birth feature has/had bugs, the result of which is that, if you
+do a zfs send -i
(or -R
, since it uses -i
) from an affected
+dataset, the receiver will not see any checksum or other errors, but
+will not match the source.
ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring the +faulty metadata which causes this issue on the sender side.
+For more details, see the hole_birth FAQ.
+When sending incremental streams which contain large blocks (>128K) the
+--large-block
flag must be specified. Inconsistent use of the flag
+between incremental sends can result in files being incorrectly zeroed
+when they are received. Raw encrypted send/recvs automatically imply the
+--large-block
flag and are therefore unaffected.
For more details, see issue +6224.
+There is a lot of tuning that can be done that’s dependent on the +workload that is being put on CEPH/ZFS, as well as some general +guidelines. Some are as follow;
+The CEPH filestore back-end heavily relies on xattrs, for optimal +performance all CEPH workloads will benefit from the following ZFS +dataset parameters
+xattr=sa
dnodesize=auto
Beyond that typically rbd/cephfs focused workloads benefit from small +recordsize({16K-128K), while objectstore/s3/rados focused workloads +benefit from large recordsize (128K-1M).
+Additionally CEPH sets various values internally for handling xattrs +based on the underlying filesystem. As CEPH only officially +supports/detects XFS and BTRFS, for all other filesystems it falls back +to rather limited “safe” +values. +On newer releases, the need for larger xattrs will prevent OSD’s from even +starting.
+The officially recommended workaround (see +here) +has some severe downsides, and more specifically is geared toward +filesystems with “limited” xattr support such as ext4.
+ZFS does not have a limit internally to xattrs length, as such we can +treat it similarly to how CEPH treats XFS. We can set overrides to set 3 +internal values to the same as those used with XFS(see +here +and +here) +and allow it be used without the severe limitations of the “official” +workaround.
+[osd]
+filestore_max_inline_xattrs = 10
+filestore_max_inline_xattr_size = 65536
+filestore_max_xattr_value_size = 65536
+
Use a separate journal device. Do not collocate CEPH journal on +ZFS dataset if at all possible, this will quickly lead to terrible +fragmentation, not to mention terrible performance upfront even +before fragmentation (CEPH journal does a dsync for every write).
Use a SLOG device, even with a separate CEPH journal device. For some
+workloads, skipping SLOG and setting logbias=throughput
may be
+acceptable.
Use a high-quality SLOG/CEPH journal device. A consumer based SSD, or +even NVMe WILL NOT DO (Samsung 830, 840, 850, etc) for a variety of +reasons. CEPH will kill them quickly, on-top of the performance being +quite low in this use. Generally recommended devices are [Intel DC S3610, +S3700, S3710, P3600, P3700], or [Samsung SM853, SM863], or better.
If using a high quality SSD or NVMe device (as mentioned above), you +CAN share SLOG and CEPH Journal to good results on single device. A +ratio of 4 HDDs to 1 SSD (Intel DC S3710 200GB), with each SSD +partitioned (remember to align!) to 4x10GB (for ZIL/SLOG) + 4x20GB +(for CEPH journal) has been reported to work well.
Again - CEPH + ZFS will KILL a consumer based SSD VERY quickly. Even +ignoring the lack of power-loss protection, and endurance ratings, you +will be very disappointed with performance of consumer based SSD under +such a workload.
+To achieve good performance with your pool there are some easy best +practices you should follow.
+Evenly balance your disks across controllers: Often the limiting +factor for performance is not the disks but the controller. By +balancing your disks evenly across controllers you can often improve +throughput.
Create your pool using whole disks: When running zpool create use +whole disk names. This will allow ZFS to automatically partition the +disk to ensure correct alignment. It will also improve +interoperability with other OpenZFS implementations which honor the +wholedisk property.
Have enough memory: A minimum of 2GB of memory is recommended for +ZFS. Additional memory is strongly recommended when the compression +and deduplication features are enabled.
Improve performance by setting ashift=12: You may be able to
+improve performance for some workloads by setting ashift=12
. This
+tuning can only be set when block devices are first added to a pool,
+such as when the pool is first created or when a new vdev is added to
+the pool. This tuning parameter can result in a decrease of capacity
+for RAIDZ configurations.
Advanced Format (AF) is a new disk format which natively uses a 4,096 +byte, instead of 512 byte, sector size. To maintain compatibility with +legacy systems many AF disks emulate a sector size of 512 bytes. By +default, ZFS will automatically detect the sector size of the drive. +This combination can result in poorly aligned disk accesses which will +greatly degrade the pool performance.
+Therefore, the ability to set the ashift property has been added to the +zpool command. This allows users to explicitly assign the sector size +when devices are first added to a pool (typically at pool creation time +or adding a vdev to the pool). The ashift values range from 9 to 16 with +the default value 0 meaning that zfs should auto-detect the sector size. +This value is actually a bit shift value, so an ashift value for 512 +bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12 +(2^12 = 4,096).
+To force the pool to use 4,096 byte sectors at pool creation time, you +may run:
+$ zpool create -o ashift=12 tank mirror sda sdb
+
To force the pool to use 4,096 byte sectors when adding a vdev to a +pool, you may run:
+$ zpool add -o ashift=12 tank mirror sdc sdd
+
used
and
+referenced
properties reported by the zvol may be larger than the
+“actual” space that is being used as reported by the consumer.used
property
+reaches the configured volsize
the underlying filesystem will
+start reusing blocks. But the problem arises if it is desired to
+snapshot the zvol, as the space referenced by the snapshots will
+contain the unused blocks.fstrim
command on Linux) to allow
+the kernel to specify to zfs which blocks are unused.discard
option for the mounted ZVOL in /etc/fstab
+effectively enables the kernel to issue the trim commands
+continuously, without the need to execute fstrim on-demand.You may use a zvol as a swap device but you’ll need to configure it +appropriately.
+CAUTION: for now swap on zvol may lead to deadlock, in this case +please send your logs +here.
+Set the volume block size to match your systems page size. This +tuning prevents ZFS from having to perform read-modify-write options +on a larger block while the system is already low on memory.
Set the logbias=throughput
and sync=always
properties. Data
+written to the volume will be flushed immediately to disk freeing up
+memory as quickly as possible.
Set primarycache=metadata
to avoid keeping swap data in RAM via
+the ARC.
Disable automatic snapshots of the swap device.
$ zfs create -V 4G -b $(getconf PAGESIZE) \
+ -o logbias=throughput -o sync=always \
+ -o primarycache=metadata \
+ -o com.sun:auto-snapshot=false rpool/swap
+
It is usually recommended to keep virtual machine storage and hypervisor +pools, quite separate. Although few people have managed to successfully +deploy and run OpenZFS using the same machine configured as Dom0. +There are few caveats:
+Set a fair amount of memory in grub.conf, dedicated to Dom0.
+dom0_mem=16384M,max:16384M
Allocate no more of 30-40% of Dom0’s memory to ZFS in
+/etc/modprobe.d/zfs.conf
.
options zfs zfs_arc_max=6442450944
Disable Xen’s auto-ballooning in /etc/xen/xl.conf
Watch out for any Xen bugs, such as this +one related to +ballooning
To prevent udisks2 from creating /dev/mapper entries that must be
+manually removed or maintained during zvol remove / rename, create a
+udev rule such as /etc/udev/rules.d/80-udisks2-ignore-zfs.rules
with
+the following contents:
ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_FS_TYPE}=="zfs_member", ENV{ID_PART_ENTRY_TYPE}=="6a898cc3-1dd2-11b2-99a6-080020736631", ENV{UDISKS_IGNORE}="1"
+
License information can be found here.
+You can open a new issue and search existing issues using the public +issue tracker. The issue +tracker is used to organize outstanding bug reports, feature requests, +and other development tasks. Anyone may post comments after signing up +for a github account.
+Please make sure that what you’re actually seeing is a bug and not a +support issue. If in doubt, please ask on the mailing list first, and if +you’re then asked to file an issue, do so.
+When opening a new issue include this information at the top of the +issue:
+What distribution you’re using and the version.
What spl/zfs packages you’re using and the version.
Describe the problem you’re observing.
Describe how to reproduce the problem.
Including any warning/errors/backtraces from the system logs.
When a new issue is opened it’s not uncommon for a developer to request +additional information about the problem. In general, the more detail +you share about a problem the quicker a developer can resolve it. For +example, providing a simple test case is always exceptionally helpful. +Be prepared to work with the developer looking in to your bug in order +to get it resolved. They may ask for information like:
+Your pool configuration as reported by zdb
or zpool status
.
Your hardware configuration, such as
+Number of CPUs.
Amount of memory.
Whether your system has ECC memory.
Whether it is running under a VMM/Hypervisor.
Kernel version.
Values of the spl/zfs module parameters.
Stack traces which may be logged to dmesg
.
Yes, the OpenZFS community has a code of conduct. See the Code of +Conduct for details.
+List |
+Description |
+List Archive |
+
---|---|---|
+ | A low-traffic list +for announcements +such as new releases |
++ |
+ | A user discussion +list for issues +related to +functionality and +usability |
++ |
+ | A development list +for developers to +discuss technical +issues |
++ |
+ | A +platform-independent +mailing list for ZFS +developers to review +ZFS code and +architecture changes +from all platforms |
++ |
All tagged ZFS on Linux +releases are signed by +the official maintainer for that branch. These signatures are +automatically verified by GitHub and can be checked locally by +downloading the maintainers public key.
+First import the public key listed above in to your key ring.
+$ gpg --keyserver pgp.mit.edu --recv C6AF658B
+gpg: requesting key C6AF658B from hkp server pgp.mit.edu
+gpg: key C6AF658B: "Brian Behlendorf <behlendorf1@llnl.gov>" not changed
+gpg: Total number processed: 1
+gpg: unchanged: 1
+
After the public key is imported the signature of a git tag can be +verified as shown.
+$ git tag --verify zfs-0.6.5
+object 7a27ad00ae142b38d4aef8cc0af7a72b4c0e44fe
+type commit
+tag zfs-0.6.5
+tagger Brian Behlendorf <behlendorf1@llnl.gov> 1441996302 -0700
+
+ZFS Version 0.6.5
+gpg: Signature made Fri 11 Sep 2015 11:31:42 AM PDT using DSA key ID C6AF658B
+gpg: Good signature from "Brian Behlendorf <behlendorf1@llnl.gov>"
+gpg: aka "Brian Behlendorf (LLNL) <behlendorf1@llnl.gov>"
+
OpenZFS is storage software which combines the functionality of +traditional filesystems, volume manager, and more. OpenZFS includes +protection against data corruption, support for high storage capacities, +efficient data compression, snapshots and copy-on-write clones, +continuous integrity checking and automatic repair, remote replication +with ZFS send and receive, and RAID-Z.
+OpenZFS brings together developers from the illumos, Linux, FreeBSD and +OS X platforms, and a wide range of companies – both online and at the +annual OpenZFS Developer Summit. High-level goals of the project include +raising awareness of the quality, utility and availability of +open-source implementations of ZFS, encouraging open communication about +ongoing efforts toward improving open-source variants of ZFS, and +ensuring consistent reliability, functionality and performance of all +distributions of ZFS.
+