Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

use kernel fast paths for file copying #1742

Merged
merged 9 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 78 additions & 6 deletions doc/best_practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,78 @@ This isn’t the last word. Also consider:
NIST Special Publication 800-190; Souppaya, Morello, and Scarfone 2017.


Filesystems
===========

There are two performance gotchas to be aware of for Charliecloud.

Metadata traffic
----------------

Directory-format container images and the Charliecloud storage directory often
contain, and thus Charliecloud must manipulate, a very large number of files.
For example, after running the test suite, the storage directory contains
almost 140,000 files. That is, metadata traffic can be quite high.

Such images and the storage directory should be stored on a filesystem with
reasonable metadata performance. Notably, this *excludes* Lustre, which is
commonly used for scratch filesystems in HPC; i.e., don’t store these things
on Lustre. NFS is usually fine, though in general it performs worse than a
local filesystem.

In contrast, SquashFS images, which encapsulate the image into a single file
that is mounted using FUSE at runtime, insulate the filesystem from this
metadata traffic. Images in this format are suitable for any filesystem,
including Lustre.

.. _best-practices_file-copy:

File copy performance
---------------------

:code:`ch-image` does a lot of file copying. The bulk of this is copying
images around in the storage directory. Importantly, this includes :ref:`large
files <ch-image_bu-large>` stored by the build cache outside its Git
repository, which by definition hold a lot of data to copy.

Copies are costly both in time (to read, transfer, and write the duplicate
bytes) and space (to store the bytes). However, with the right Python and
filesystem, significant optimizations are available. Charliecloud’s internal
file copies (unfortunately not sub-programs like Git) can take advantage of
multiple file-copy optimized paths offered by Linux:

1. Copy data in-kernel without passing through user-space. Saves time but not
space. All filesystems support this.

2. Copy data server-side without sending it over the network, relevant of
course only for network filesystems. Saves time but not space. NFS 4
supports this, among others.

3. Copy-on-write via “`reflink
<https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for>`_”.
The destination file gets a new inode but shares the data extents the
source file — i.e., no data are copied! — with extents copied and unshared
later if/when are written. Saves potentially a lot of both time and space.
BTRFS, XFS, and ZFS support this, among others.

Support of course varies by kernel and filesystem tools version, and we have
listed only the most common filesystems above. In-kernel filesystem support
can be checked in the `Linux source code
<https://elixir.bootlin.com/linux/latest/A/ident/remap_file_range>`_, and ZFS
has `release notes <https://github.com/openzfs/zfs/releases>`_. Also, paths 2
and 3 require that source and destination be on the same filesystem.

If available (Python ≥3.8), :code:`ch-image` copies file data with
:code:`os.copy_file_range()` (`docs
<https://docs.python.org/3/library/os.html#os.copy_file_range>`_), which wraps
:code:`copy_file_range(2)` (`man page
<https://man7.org/linux/man-pages/man2/copy_file_range.2.html>`_). This system
call copies data between files using the best method available of the three
above.

Thus, we recommend using a kernel, filesystem, and other tools that support
path 3 or at least path 2.

Installing your own software
============================

Expand All @@ -36,7 +108,7 @@ Charliecloud container:
trustworthy image on Docker Hub you can use as a base?

Third-party software via package manager
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------------

This approach is the simplest and fastest way to install stuff in your image.
The :code:`examples/hello` Dockerfile does this to install the package
Expand All @@ -57,9 +129,8 @@ you add an HTTP cache, which is out of scope of this documentation).
rather troublesome in containers, and we suspect there are bugs we haven’t
ironed out yet. If you encounter problems, please do file a bug!


Third-party software compiled from source
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----------------------------------------

Under this method, one uses :code:`RUN` commands to fetch the desired software
using :code:`curl` or :code:`wget`, compile it, and install. Our example does
Expand Down Expand Up @@ -104,7 +175,7 @@ So what is going on here?
:code:`/usr` rather than :code:`/usr/local`.

Your software stored in the image
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------------

This method covers software provided by you that is included in the image.
This is recommended when your software is relatively stable or is not easily
Expand Down Expand Up @@ -154,7 +225,7 @@ Once the image is built, we can see the results. (Install the image into
-rwxrwx--- 1 charlie charlie 441 Aug 5 22:37 test.sh

Your software stored on the host
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--------------------------------

This method leaves your software on the host but compiles it in the image.
This is recommended when your software is volatile or each image user needs a
Expand Down Expand Up @@ -187,4 +258,5 @@ A common use case is to leave a container shell open in one terminal for
building, and then run using a separate container invoked from a different
terminal.

.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi

.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi nist
45 changes: 28 additions & 17 deletions doc/ch-image.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,14 @@ Common options placed before or after the sub-command:

:code:`--cache-large SIZE`
Set the cache’s large file threshold to :code:`SIZE` MiB, or :code:`0` for
no large files, which is the default. This can speed up some builds.
no large files, which is the default. Values greater than zero can speed
up many builds but can also cause performance degradation.
**Experimental.** See section :ref:`Large file threshold
<ch-image_bu-large>` for details.

:code:`--debug`
Add a stack trace to fatal error hints. This can also be done by setting
the environment variable `CH_IMAGE_DEBUG`.
the environment variable :code:`CH_IMAGE_DEBUG`.

:code:`--no-cache`
Disable build cache. Default if a sufficiently new Git is not available.
Expand Down Expand Up @@ -451,27 +452,37 @@ Large file threshold

Because Git uses content-addressed storage, upon commit, it must read in full
all files modified by an instruction. This I/O cost can be a significant
fraction of build time for some large images. Regular files larger than the
experimental *large file threshold* are stored outside the Git repository,
somewhat like `Git Large File Storage <https://git-lfs.github.com/>`_.
:code:`ch-image` uses hard links to bring large files in and out of images as
needed, which is a fast metadata operation that ignores file content.
fraction of build time for some images. To mitigate this, regular files larger
than the experimental *large file threshold* are stored outside the Git
repository, somewhat like `Git Large File Storage
<https://git-lfs.github.com/>`_.

:code:`ch-image` copies large files in and out of images at each instruction
commit. It tries to do this with a fast metadata-only copy-on-write operation
called “reflink”, but that is only supported with the right Python version,
Linux kernel version, and filesystem. If unsupported, Charliecloud falls back
to an expensive standard copy, which is likely slower than letting Git deal
with the files. See :ref:`File copy performance <best-practices_file-copy>`
for details.

Every version of a large file is stored verbatim and uncompressed (e.g., a
large file with a one-byte change will be stored in full twice), so Git’s
de-duplication does not apply. *However*, on filesystems with reflink support,
files can share extents (e.g., each of the two files will have its own extent
containing the changed byte, but the rest of the extents will remain shared).
This provides de-duplication between large files images that share ancestry.
Also, unused large files are deleted by :code:`ch-image build-cache --gc`.

A final caveat: Large files in any image with the same path, mode, size, and
mtime (to nanosecond precision if possible) are considered identical, even if
their content is not actually identical (e.g., :code:`touch(1)` shenanigans
can corrupt an image).

Option :code:`--cache-large` sets the threshold in MiB; if not set,
environment variable :code:`CH_IMAGE_CACHE_LARGE` is used; if that is not set
either, the default value :code:`0` indicates that no files are considered
large.

There are two trade-offs. First, large files in any image with the same path,
mode, size, and mtime (to nanosecond precision if possible) are considered
identical, *even if their content is not actually identical*; e.g.,
:code:`touch(1)` shenanigans can corrupt an image. Second, every version of a
large file is stored verbatim and uncompressed (e.g., a large file with a
one-byte change will be stored in full twice), and large files do not
participate in the build cache’s de-duplication, so more storage space will
likely be used. Unused versions *are* deleted by :code:`ch-image build-cache
--gc`.

(Note that Git has an unrelated setting called :code:`core.bigFileThreshold`.)

Example
Expand Down
9 changes: 6 additions & 3 deletions lib/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def build_arg_get(arg):
% (ml.instruction_total_ct, ml.inst_prev.image))
# FIXME: remove when we’re done encouraging people to use the build cache.
if (isinstance(bu.cache, bu.Disabled_Cache)):
ch.INFO("build slow? consider enabling the new build cache",
ch.INFO("build slow? consider enabling the build cache",
"https://hpc.github.io/charliecloud/command-usage.html#build-cache")


Expand Down Expand Up @@ -767,7 +767,7 @@ def onerror(x):
dst_path.rmtree()
else:
dst_path.unlink_()
ch.copy2(src_path, dst_path, follow_symlinks=False)
src_path.copy(dst_path)

def copy_src_file(self, src, dst):
"""Copy file src to dst. src might be a symlink, but dst is a canonical
Expand All @@ -789,8 +789,11 @@ def copy_src_file(self, src, dst):
assert (not dst.is_symlink())
assert ( (dst.exists() and (dst.is_dir() or dst.is_file()))
or (not dst.exists() and dst.parent.is_dir()))
if (dst.is_dir()):
dst //= src.name
src = src.resolve()
ch.DEBUG("copying named file: %s -> %s" % (src, dst))
ch.copy2(src, dst, follow_symlinks=True)
src.copy(dst)

def dest_realpath(self, unpack_path, dst):
"""Return the canonicalized version of path dst within (canonical) image
Expand Down
4 changes: 2 additions & 2 deletions lib/build_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -538,10 +538,10 @@ def large_prepare(self):
return large_name

def large_restore(self):
"Hard link my file to the copy in large file storage."
"Restore large file from OOB storage."
target = ch.storage.build_large_path(self.large_name)
ch.DEBUG("large file: %s: copying: %s" % (self.path_abs, self.large_name))
ch.copy2(target, self.path_abs)
fs.copy(target, self.path_abs)

def pickle(self):
(self.image_root // PICKLE_PATH) \
Expand Down
4 changes: 0 additions & 4 deletions lib/charliecloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,10 +609,6 @@ def color_set(color, fp):
if (fp.isatty()):
print("\033[" + color, end="", flush=True, file=fp)

def copy2(src, dst, **kwargs):
"Wrapper for shutil.copy2() with error checking."
ossafe(shutil.copy2, "can’t copy: %s -> %s" % (src, dst), src, dst, **kwargs)

def dependencies_check():
"""Check more dependencies. If any dependency problems found, here or above
(e.g., lark module checked at import time), then complain and exit."""
Expand Down
77 changes: 75 additions & 2 deletions lib/filesystem.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,20 @@
storage_lock = True


### Functions ###

def copy(src, dst, follow_symlinks=False):
"""Copy file src to dst. Wrapper function providing same signature as
shutil.copy2(). See Path.copy() for lots of gory details. Accepts
follow_symlinks, but the only valid value is False."""
assert (not follow_symlinks)
if (isinstance(src, str)):
src = Path(src)
if (isinstance(dst, str)):
dst = Path(dst)
src.copy(dst)


## Classes ##

class Path(pathlib.PosixPath):
Expand Down Expand Up @@ -187,9 +201,68 @@ def chmod_min(self, st=None):
ch.ossafe(os.chmod, "can’t chmod: %s" % self, self, perms_new)
return (st.st_mode | perms_new)

def copy(self, dst):
"""Copy file myself to dst, including metadata, overwriting dst if it
exists. dst must be the actual destination path, i.e., it may not be
a directory. Does not follow symlinks.

If (a) src is a regular file, (b) src and dst are on the same
filesystem, and (c) Python is version ≥3.8, then use
os.copy_file_range() [1,2], which at a minimum does an in-kernel data
transfer. If that filesystem also (d) supports copy-on-write [3],
then this is a very fast lazy reflink copy.

[1]: https://docs.python.org/3/library/os.html#os.copy_file_range
[2]: https://man7.org/linux/man-pages/man2/copy_file_range.2.html
[3]: https://elixir.bootlin.com/linux/latest/A/ident/remap_file_range
"""
src_st = self.stat_(False)
# dst is not a directory, so parent must be on the same filesystem. We
# *do* want to follow symlinks on the parent.
dst_dev = dst.parent.stat_(True).st_dev
if ( stat.S_ISREG(src_st.st_mode)
and src_st.st_dev == dst_dev
and hasattr(os, "copy_file_range")):
# Fast path. The same-filesystem restriction is because reliable
# copy_file_range(2) between filesystems seems quite new (maybe
# kernel 5.18?).
try:
if (dst.exists()):
# If dst is a symlink, we get OLOOP from os.open(). Delete it
# unconditionally though, for simplicity.
dst.unlink()
src_fd = os.open(self, os.O_RDONLY|os.O_NOFOLLOW)
dst_fd = os.open(dst, os.O_WRONLY|os.O_NOFOLLOW|os.O_CREAT)
# I’m not sure why we need to loop this -- there’s no explanation
# of *when* fewer bytes than requested would be copied -- but the
# man page example does.
remaining = src_st.st_size
while (remaining > 0):
copied = os.copy_file_range(src_fd, dst_fd, remaining)
if (copied == 0):
ch.FATAL("zero bytes copied: %s -> %s" % (self, dst))
remaining -= copied
os.close(src_fd)
os.close(dst_fd)
except OSError as x:
ch.FATAL("can’t copy data (fast): %s -> %s: %s"
% (self, dst, x.strerror))
else:
# Slow path.
try:
shutil.copyfile(self, dst, follow_symlinks=False)
except OSError as x:
ch.FATAL("can’t copy data (slow): %s -> %s: %s"
% (self, dst, x.strerror))
try:
# Metadata.
shutil.copystat(self, dst, follow_symlinks=False)
except OSError as x:
ch.FATAL("can’t copy metadata: %s -> %s" % (self, dst, x.strerror))

def copytree(self, *args, **kwargs):
"Wrapper for shutil.copytree() that exits on the first error."
shutil.copytree(str(self), copy_function=ch.copy2, *args, **kwargs)
shutil.copytree(self, copy_function=copy, *args, **kwargs)

def disk_bytes(self):
"""Return the number of disk bytes consumed by path. Note this is
Expand Down Expand Up @@ -435,7 +508,7 @@ def stat_(self, links):
follow_symlinks kwarg is absent in pathlib for Python 3.6, which we
want to retain compatibility with."""
return ch.ossafe(os.stat, "can’t stat: %s" % self, self,
follow_symlinks=links)
follow_symlinks=links)

def strip(self, left=0, right=0):
"""Return a copy of myself with n leading components removed. E.g.:
Expand Down
2 changes: 1 addition & 1 deletion lib/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,7 +396,7 @@ def metadata_replace(self, config_json):
else:
# Copy pulled config file into the image so we still have it.
path = self.metadata_path // "config.pulled.json"
ch.copy2(config_json, path)
config_json.copy(path)
ch.VERBOSE("pulled config path: %s" % path)
self.metadata_merge_from_config(path.json_from_file("config"))
self.metadata_save()
Expand Down
2 changes: 1 addition & 1 deletion test/build/50_dockerfile.bats
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ test 7a
test 7 b
--force=seccomp: modified 0 RUN instructions
grown in 16 instructions: tmpimg
build slow? consider enabling the new build cache
build slow? consider enabling the build cache
hint: https://hpc.github.io/charliecloud/command-usage.html#build-cache
warning: reprinting 1 warning(s)
warning: not yet supported, ignored: issue #777: .dockerignore file
Expand Down
6 changes: 3 additions & 3 deletions test/build/55_cache.bats
Original file line number Diff line number Diff line change
Expand Up @@ -1326,7 +1326,7 @@ EOF
[[ -z $output ]]

echo
echo '*** threshold = 4'
echo '*** threshold = 5'
ch-image build-cache --reset
echo "$df" | ch-image build --cache-large=5 -t tmpimg -
run ls "$CH_IMAGE_STORAGE"/bularge
Expand All @@ -1337,7 +1337,7 @@ b2dbc2a2bb35d6d0d5590aedc122cab6%bigfile5
EOF

echo
echo '*** threshold = 3, rebuild'
echo '*** threshold = 4, rebuild'
echo "$df" | ch-image build --rebuild --cache-large=4 -t tmpimg -
run ls "$CH_IMAGE_STORAGE"/bularge
echo "$output"
Expand All @@ -1349,7 +1349,7 @@ b2dbc2a2bb35d6d0d5590aedc122cab6%bigfile5
EOF

echo
echo '*** threshold = 3, reset'
echo '*** threshold = 4, reset'
ch-image build-cache --reset
echo "$df" | ch-image build --rebuild --cache-large=4 -t tmpimg -
run ls "$CH_IMAGE_STORAGE"/bularge
Expand Down