hpc · reidpr · Oct 27, 2023 · Oct 19, 2023 · Oct 19, 2023 · Oct 19, 2023
diff --git a/doc/best_practices.rst b/doc/best_practices.rst
@@ -20,6 +20,78 @@ This isn’t the last word. Also consider:
   NIST Special Publication 800-190; Souppaya, Morello, and Scarfone 2017.
 
 
+Filesystems
+===========
+
+There are two performance gotchas to be aware of for Charliecloud.
+
+Metadata traffic
+----------------
+
+Directory-format container images and the Charliecloud storage directory often
+contain, and thus Charliecloud must manipulate, a very large number of files.
+For example, after running the test suite, the storage directory contains
+almost 140,000 files. That is, metadata traffic can be quite high.
+
+Such images and the storage directory should be stored on a filesystem with
+reasonable metadata performance. Notably, this *excludes* Lustre, which is
+commonly used for scratch filesystems in HPC; i.e., don’t store these things
+on Lustre. NFS is usually fine, though in general it performs worse than a
+local filesystem.
+
+In contrast, SquashFS images, which encapsulate the image into a single file
+that is mounted using FUSE at runtime, insulate the filesystem from this
+metadata traffic. Images in this format are suitable for any filesystem,
+including Lustre.
+
+.. _best-practices_file-copy:
+
+File copy performance
+---------------------
+
+:code:`ch-image` does a lot of file copying. The bulk of this is copying
+images around in the storage directory. Importantly, this includes :ref:`large
+files <ch-image_bu-large>` stored by the build cache outside its Git
+repository, which by definition hold a lot of data to copy.
+
+Copies are costly both in time (to read, transfer, and write the duplicate
+bytes) and space (to store the bytes). However, with the right Python and
+filesystem, significant optimizations are available. Charliecloud’s internal
+file copies (unfortunately not sub-programs like Git) can take advantage of
+multiple file-copy optimized paths offered by Linux:
+
+1. Copy data in-kernel without passing through user-space. Saves time but not
+   space. All filesystems support this.
+
+2. Copy data server-side without sending it over the network, relevant of
+   course only for network filesystems. Saves time but not space. NFS 4
+   supports this, among others.
+
+3. Copy-on-write via “`reflink
+   <https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for>`_”.
+   The destination file gets a new inode but shares the data extents the
+   source file — i.e., no data are copied! — with extents copied and unshared
+   later if/when are written. Saves potentially a lot of both time and space.
+   BTRFS, XFS, and ZFS support this, among others.
+
+Support of course varies by kernel and filesystem tools version, and we have
+listed only the most common filesystems above. In-kernel filesystem support
+can be checked in the `Linux source code
+<https://elixir.bootlin.com/linux/latest/A/ident/remap_file_range>`_, and ZFS
+has `release notes <https://github.com/openzfs/zfs/releases>`_. Also, paths 2
+and 3 require that source and destination be on the same filesystem.
+
+If available (Python ≥3.8), :code:`ch-image` copies file data with
+:code:`os.copy_file_range()` (`docs
+<https://docs.python.org/3/library/os.html#os.copy_file_range>`_), which wraps
+:code:`copy_file_range(2)` (`man page
+<https://man7.org/linux/man-pages/man2/copy_file_range.2.html>`_). This system
+call copies data between files using the best method available of the three
+above.
+
+Thus, we recommend using a kernel, filesystem, and other tools that support
+path 3 or at least path 2.
+
 Installing your own software
 ============================
 
@@ -36,7 +108,7 @@ Charliecloud container:
    trustworthy image on Docker Hub you can use as a base?
 
 Third-party software via package manager
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+----------------------------------------
 
 This approach is the simplest and fastest way to install stuff in your image.
 The :code:`examples/hello` Dockerfile does this to install the package
@@ -57,9 +129,8 @@ you add an HTTP cache, which is out of scope of this documentation).
    rather troublesome in containers, and we suspect there are bugs we haven’t
    ironed out yet. If you encounter problems, please do file a bug!
 
-
 Third-party software compiled from source
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-----------------------------------------
 
 Under this method, one uses :code:`RUN` commands to fetch the desired software
 using :code:`curl` or :code:`wget`, compile it, and install. Our example does
@@ -104,7 +175,7 @@ So what is going on here?
    :code:`/usr` rather than :code:`/usr/local`.
 
 Your software stored in the image
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------------
 
 This method covers software provided by you that is included in the image.
 This is recommended when your software is relatively stable or is not easily
@@ -154,7 +225,7 @@ Once the image is built, we can see the results. (Install the image into
   -rwxrwx--- 1 charlie charlie  441 Aug  5 22:37 test.sh
 
 Your software stored on the host
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------------------------
 
 This method leaves your software on the host but compiles it in the image.
 This is recommended when your software is volatile or each image user needs a
@@ -187,4 +258,5 @@ A common use case is to leave a container shell open in one terminal for
 building, and then run using a separate container invoked from a different
 terminal.
 
-..  LocalWords:  userguide Gruening Souppaya Morello Scarfone openmpi
+
+..  LocalWords:  userguide Gruening Souppaya Morello Scarfone openmpi nist
diff --git a/doc/ch-image.rst b/doc/ch-image.rst
@@ -72,13 +72,14 @@ Common options placed before or after the sub-command:
 
   :code:`--cache-large SIZE`
     Set the cache’s large file threshold to :code:`SIZE` MiB, or :code:`0` for
-    no large files, which is the default. This can speed up some builds.
+    no large files, which is the default. Values greater than zero can speed
+    up many builds but can also cause performance degradation.
     **Experimental.** See section :ref:`Large file threshold
     <ch-image_bu-large>` for details.
 
   :code:`--debug`
     Add a stack trace to fatal error hints. This can also be done by setting
-    the environment variable `CH_IMAGE_DEBUG`.
+    the environment variable :code:`CH_IMAGE_DEBUG`.
 
   :code:`--no-cache`
     Disable build cache. Default if a sufficiently new Git is not available.
@@ -451,27 +452,37 @@ Large file threshold
 
 Because Git uses content-addressed storage, upon commit, it must read in full
 all files modified by an instruction. This I/O cost can be a significant
-fraction of build time for some large images. Regular files larger than the
-experimental *large file threshold* are stored outside the Git repository,
-somewhat like `Git Large File Storage <https://git-lfs.github.com/>`_.
-:code:`ch-image` uses hard links to bring large files in and out of images as
-needed, which is a fast metadata operation that ignores file content.
+fraction of build time for some images. To mitigate this, regular files larger
+than the experimental *large file threshold* are stored outside the Git
+repository, somewhat like `Git Large File Storage
+<https://git-lfs.github.com/>`_.
+
+:code:`ch-image` copies large files in and out of images at each instruction
+commit. It tries to do this with a fast metadata-only copy-on-write operation
+called “reflink”, but that is only supported with the right Python version,
+Linux kernel version, and filesystem. If unsupported, Charliecloud falls back
+to an expensive standard copy, which is likely slower than letting Git deal
+with the files. See :ref:`File copy performance <best-practices_file-copy>`
+for details.
+
+Every version of a large file is stored verbatim and uncompressed (e.g., a
+large file with a one-byte change will be stored in full twice), so Git’s
+de-duplication does not apply. *However*, on filesystems with reflink support,
+files can share extents (e.g., each of the two files will have its own extent
+containing the changed byte, but the rest of the extents will remain shared).
+This provides de-duplication between large files images that share ancestry.
+Also, unused large files are deleted by :code:`ch-image build-cache --gc`.
+
+A final caveat: Large files in any image with the same path, mode, size, and
+mtime (to nanosecond precision if possible) are considered identical, even if
+their content is not actually identical (e.g., :code:`touch(1)` shenanigans
+can corrupt an image).
 
 Option :code:`--cache-large` sets the threshold in MiB; if not set,
 environment variable :code:`CH_IMAGE_CACHE_LARGE` is used; if that is not set
 either, the default value :code:`0` indicates that no files are considered
 large.
 
-There are two trade-offs. First, large files in any image with the same path,
-mode, size, and mtime (to nanosecond precision if possible) are considered
-identical, *even if their content is not actually identical*; e.g.,
-:code:`touch(1)` shenanigans can corrupt an image. Second, every version of a
-large file is stored verbatim and uncompressed (e.g., a large file with a
-one-byte change will be stored in full twice), and large files do not
-participate in the build cache’s de-duplication, so more storage space will
-likely be used. Unused versions *are* deleted by :code:`ch-image build-cache
---gc`.
-
 (Note that Git has an unrelated setting called :code:`core.bigFileThreshold`.)
 
 Example

diff --git a/lib/build.py b/lib/build.py
@@ -206,7 +206,7 @@ def build_arg_get(arg):
            % (ml.instruction_total_ct, ml.inst_prev.image))
    # FIXME: remove when we’re done encouraging people to use the build cache.
    if (isinstance(bu.cache, bu.Disabled_Cache)):
-      ch.INFO("build slow? consider enabling the new build cache",
+      ch.INFO("build slow? consider enabling the build cache",
               "https://hpc.github.io/charliecloud/command-usage.html#build-cache")
 
 
@@ -767,7 +767,7 @@ def onerror(x):
                   dst_path.rmtree()
                else:
                   dst_path.unlink_()
-            ch.copy2(src_path, dst_path, follow_symlinks=False)
+            src_path.copy(dst_path)
 
    def copy_src_file(self, src, dst):
       """Copy file src to dst. src might be a symlink, but dst is a canonical
@@ -789,8 +789,11 @@ def copy_src_file(self, src, dst):
       assert (not dst.is_symlink())
       assert (   (dst.exists() and (dst.is_dir() or dst.is_file()))
               or (not dst.exists() and dst.parent.is_dir()))
+      if (dst.is_dir()):
+         dst //= src.name
+      src = src.resolve()
       ch.DEBUG("copying named file: %s -> %s" % (src, dst))
-      ch.copy2(src, dst, follow_symlinks=True)
+      src.copy(dst)
 
    def dest_realpath(self, unpack_path, dst):
       """Return the canonicalized version of path dst within (canonical) image

diff --git a/lib/build_cache.py b/lib/build_cache.py
@@ -538,10 +538,10 @@ def large_prepare(self):
       return large_name
 
    def large_restore(self):
-      "Hard link my file to the copy in large file storage."
+      "Restore large file from OOB storage."
       target = ch.storage.build_large_path(self.large_name)
       ch.DEBUG("large file: %s: copying: %s" % (self.path_abs, self.large_name))
-      ch.copy2(target, self.path_abs)
+      fs.copy(target, self.path_abs)
 
    def pickle(self):
       (self.image_root // PICKLE_PATH) \

diff --git a/lib/charliecloud.py b/lib/charliecloud.py
@@ -609,10 +609,6 @@ def color_set(color, fp):
    if (fp.isatty()):
       print("\033[" + color, end="", flush=True, file=fp)
 
-def copy2(src, dst, **kwargs):
-   "Wrapper for shutil.copy2() with error checking."
-   ossafe(shutil.copy2, "can’t copy: %s -> %s" % (src, dst), src, dst, **kwargs)
-
 def dependencies_check():
    """Check more dependencies. If any dependency problems found, here or above
       (e.g., lark module checked at import time), then complain and exit."""

diff --git a/lib/filesystem.py b/lib/filesystem.py
@@ -32,6 +32,20 @@
 storage_lock = True
 
 
+### Functions ###
+
+def copy(src, dst, follow_symlinks=False):
+   """Copy file src to dst. Wrapper function providing same signature as
+      shutil.copy2(). See Path.copy() for lots of gory details. Accepts
+      follow_symlinks, but the only valid value is False."""
+   assert (not follow_symlinks)
+   if (isinstance(src, str)):
+      src = Path(src)
+   if (isinstance(dst, str)):
+      dst = Path(dst)
+   src.copy(dst)
+
+
 ## Classes ##
 
 class Path(pathlib.PosixPath):
@@ -187,9 +201,68 @@ def chmod_min(self, st=None):
          ch.ossafe(os.chmod, "can’t chmod: %s" % self, self, perms_new)
       return (st.st_mode | perms_new)
 
+   def copy(self, dst):
+      """Copy file myself to dst, including metadata, overwriting dst if it
+         exists. dst must be the actual destination path, i.e., it may not be
+         a directory. Does not follow symlinks.
+
+         If (a) src is a regular file, (b) src and dst are on the same
+         filesystem, and (c) Python is version ≥3.8, then use
+         os.copy_file_range() [1,2], which at a minimum does an in-kernel data
+         transfer. If that filesystem also (d) supports copy-on-write [3],
+         then this is a very fast lazy reflink copy.
+
+         [1]: https://docs.python.org/3/library/os.html#os.copy_file_range
+         [2]: https://man7.org/linux/man-pages/man2/copy_file_range.2.html
+         [3]: https://elixir.bootlin.com/linux/latest/A/ident/remap_file_range
+      """
+      src_st = self.stat_(False)
+      # dst is not a directory, so parent must be on the same filesystem. We
+      # *do* want to follow symlinks on the parent.
+      dst_dev = dst.parent.stat_(True).st_dev
+      if (    stat.S_ISREG(src_st.st_mode)
+          and src_st.st_dev == dst_dev
+          and hasattr(os, "copy_file_range")):
+         # Fast path. The same-filesystem restriction is because reliable
+         # copy_file_range(2) between filesystems seems quite new (maybe
+         # kernel 5.18?).
+         try:
+            if (dst.exists()):
+               # If dst is a symlink, we get OLOOP from os.open(). Delete it
+               # unconditionally though, for simplicity.
+               dst.unlink()
+            src_fd = os.open(self, os.O_RDONLY|os.O_NOFOLLOW)
+            dst_fd = os.open(dst, os.O_WRONLY|os.O_NOFOLLOW|os.O_CREAT)
+            # I’m not sure why we need to loop this -- there’s no explanation
+            # of *when* fewer bytes than requested would be copied -- but the
+            # man page example does.
+            remaining = src_st.st_size
+            while (remaining > 0):
+               copied = os.copy_file_range(src_fd, dst_fd, remaining)
+               if (copied == 0):
+                  ch.FATAL("zero bytes copied: %s -> %s" % (self, dst))
+               remaining -= copied
+            os.close(src_fd)
+            os.close(dst_fd)
+         except OSError as x:
+            ch.FATAL("can’t copy data (fast): %s -> %s: %s"
+                     % (self, dst, x.strerror))
+      else:
+         # Slow path.
+         try:
+            shutil.copyfile(self, dst, follow_symlinks=False)
+         except OSError as x:
+            ch.FATAL("can’t copy data (slow): %s -> %s: %s"
+                     % (self, dst, x.strerror))
+      try:
+         # Metadata.
+         shutil.copystat(self, dst, follow_symlinks=False)
+      except OSError as x:
+         ch.FATAL("can’t copy metadata: %s -> %s" % (self, dst, x.strerror))
+
    def copytree(self, *args, **kwargs):
       "Wrapper for shutil.copytree() that exits on the first error."
-      shutil.copytree(str(self), copy_function=ch.copy2, *args, **kwargs)
+      shutil.copytree(self, copy_function=copy, *args, **kwargs)
 
    def disk_bytes(self):
       """Return the number of disk bytes consumed by path. Note this is
@@ -435,7 +508,7 @@ def stat_(self, links):
          follow_symlinks kwarg is absent in pathlib for Python 3.6, which we
          want to retain compatibility with."""
       return ch.ossafe(os.stat, "can’t stat: %s" % self, self,
-                    follow_symlinks=links)
+                       follow_symlinks=links)
 
    def strip(self, left=0, right=0):
       """Return a copy of myself with n leading components removed. E.g.:

diff --git a/lib/image.py b/lib/image.py
@@ -396,7 +396,7 @@ def metadata_replace(self, config_json):
       else:
          # Copy pulled config file into the image so we still have it.
          path = self.metadata_path // "config.pulled.json"
-         ch.copy2(config_json, path)
+         config_json.copy(path)
          ch.VERBOSE("pulled config path: %s" % path)
          self.metadata_merge_from_config(path.json_from_file("config"))
       self.metadata_save()

diff --git a/test/build/50_dockerfile.bats b/test/build/50_dockerfile.bats
@@ -159,7 +159,7 @@ test 7a
 test 7 b
 --force=seccomp: modified 0 RUN instructions
 grown in 16 instructions: tmpimg
-build slow? consider enabling the new build cache
+build slow? consider enabling the build cache
 hint: https://hpc.github.io/charliecloud/command-usage.html#build-cache
 warning: reprinting 1 warning(s)
 warning: not yet supported, ignored: issue #777: .dockerignore file

diff --git a/test/build/55_cache.bats b/test/build/55_cache.bats
@@ -1326,7 +1326,7 @@ EOF
     [[ -z $output ]]
 
     echo
-    echo '*** threshold = 4'
+    echo '*** threshold = 5'
     ch-image build-cache --reset
     echo "$df" | ch-image build --cache-large=5 -t tmpimg -
     run ls "$CH_IMAGE_STORAGE"/bularge
@@ -1337,7 +1337,7 @@ b2dbc2a2bb35d6d0d5590aedc122cab6%bigfile5
 EOF
 
     echo
-    echo '*** threshold = 3, rebuild'
+    echo '*** threshold = 4, rebuild'
     echo "$df" | ch-image build --rebuild --cache-large=4 -t tmpimg -
     run ls "$CH_IMAGE_STORAGE"/bularge
     echo "$output"
@@ -1349,7 +1349,7 @@ b2dbc2a2bb35d6d0d5590aedc122cab6%bigfile5
 EOF
 
     echo
-    echo '*** threshold = 3, reset'
+    echo '*** threshold = 4, reset'
     ch-image build-cache --reset
     echo "$df" | ch-image build --rebuild --cache-large=4 -t tmpimg -
     run ls "$CH_IMAGE_STORAGE"/bularge