Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI image reproducibility fail #930

Open
DYefimov opened this issue Jul 12, 2022 · 1 comment
Open

OCI image reproducibility fail #930

DYefimov opened this issue Jul 12, 2022 · 1 comment

Comments

@DYefimov
Copy link

DYefimov commented Jul 12, 2022

Abstract

OCI chose tar format as a basis for images storage layer, while not specifying any constrains on the tar format itself AFAIK.

In #805 @vbatts says:

The big determinism is once a layer is built. That unpacking and repacking of the same content is deterministic. Compression can and does mess this up, but at least for the *.tar itself, this should hold true.

I found out that it not always holds true. Thus the content addressable scheme might be affected.

Steps to reproduce (with linux + GNU tar):

  • pull hello-world image with skopeo
  • gunzip the tar.gz layer holding the hello binary
  • tar x the hello binary inside
  • tar c the extracted hello binary
  • compare two tar archives
  • observe 1 bit difference in the hello tar header file stat mode section (and crc in the header consequentially)

I refined the case down to differences in GNU tar implementation vs Golang one.

Given that most of the containerization software nowadays written in Go, someone might find this useful.
As a side note, I don't have any intention of digging deeper into this and hope for more experienced OCI/Golang (-related) guys picking it up.

And after this small discussion at #reproducible-builds IRC (click for convo log),

[10:43] Hello. Diving deep into containerization software, I found out that GO implementation of tar behaves differently from GNU tar. More specifically it writes first three file stat mode triplets into the archive for every regular file, while GNU tar clamps (zeroes) them. GNU tar is conformant to the POSIX spec. OCI (Open Container Initiative) relies on tar format for its storage layer, without specifying any details. Resulting in: container images built with GO differ by 1 bit (+crc in the header) for every file within the archives, meaning that hashes used by the content addressable scheme differ too. Is this a reproducibility issue at all?
[11:14] <*> DYefimov: kind-of, usually using different build tools means all bets are off, even for different versions of build tools. that said, it seems worth it to fix this
[11:21] Thanks gotcha. Ok, the root cause is in the GO tar implementation, but probably the best place to file the issue would be an OCI? as they are a more involved party and might be interested in fixing this by themselves... but frankly - I don't see an easy way around - most of containerization software nowadays built with GO... the impact seems to be so huge
[11:30] <*> not sure about where to file the issue though, maybe both????

decided to share this find here.

Testcase and explanation

Please, take a look at this testcase (click for tar_issue_test.sh source)
#!/usr/bin/env sh
set -e

SKOPEO_IMG=quay.io/skopeo/stable:latest

uname -srvmpio
docker --version
docker run --rm \
    --security-opt seccomp=unconfined \
    $SKOPEO_IMG --version
tar --version | head -n 1

echo '================================'
IMAGE_NAME=hello-world

TMP_DIR=$(mktemp -dt tar_issue_test.XXXXXXXX)
mkdir "$TMP_DIR/$IMAGE_NAME"
echo "Created \"$TMP_DIR\""
trap "echo \"Removing \\\"$TMP_DIR\\\"\"; rm -rf \"$TMP_DIR\"" EXIT

docker run --rm \
    --security-opt seccomp=unconfined \
    --user $(id -u):$(id -g) \
    -v "$TMP_DIR/$IMAGE_NAME":"/$IMAGE_NAME" \
    $SKOPEO_IMG \
    copy docker://$IMAGE_NAME oci:$IMAGE_NAME:latest

mkdir "$TMP_DIR/$IMAGE_NAME/testdir"
cd "$TMP_DIR/$IMAGE_NAME/testdir"
mv \
    "$TMP_DIR/$IMAGE_NAME/blobs/sha256/2db29710123e3e53a794f2694094b9b4338aa9ee5c40b930cb8063a1be392c54" \
    "./src.tar.gz"
gunzip -q ./src.tar.gz

echo '================================'
tar xvf ./src.tar # contains just the "hello" binary
SOURCE_DATE_EPOCH=$(date +%s)
tar \
    --format=ustar \
    -b 1 \
    --sort=name \
    --numeric-owner --owner=0 --group=0 \
    --mtime="@${SOURCE_DATE_EPOCH}" --clamp-mtime \
    -cf repacked.tar hello
chmod g-w repacked.tar

echo '================================'
set -x
ls -lt --time-style=full-iso
tar -tvf src.tar
tar -tvf repacked.tar
cmp -l src.tar repacked.tar || true
hexdump -C src.tar | head
hexdump -C repacked.tar | head
set +x
echo '================================'

and it's output in my environment (kernel a bit outdated for irrelevant reasons):

Linux 4.15.0-176-generic #185-Ubuntu SMP Tue Mar 29 17:40:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Docker version 20.10.7, build f0df350
skopeo version 1.8.0
tar (GNU tar) 1.29
================================
Created "/tmp/tar_issue_test.h3RcZgaW"
Getting image source signatures
Copying blob sha256:2db29710123e3e53a794f2694094b9b4338aa9ee5c40b930cb8063a1be392c54
Copying config sha256:811f3caa888b1ee5310e2135cfd3fe36b42e233fe0d76d9798ebd324621238b9
Writing manifest to image destination
Storing signatures
================================
hello
================================
+ ls -lt --time-style=full-iso
total 48
-rw-r--r-- 1 dyefimov dyefimov 14848 2022-07-12 16:49:56.793553576 +0300 repacked.tar
-rw-r--r-- 1 dyefimov dyefimov 14848 2022-07-12 16:49:55.357547252 +0300 src.tar
-rwxrwxr-x 1 dyefimov dyefimov 13256 2021-09-24 02:47:50.000000000 +0300 hello
+ tar -tvf src.tar
-rwxrwxr-x 0/0           13256 2021-09-24 02:47 hello
+ tar -tvf repacked.tar
-rwxrwxr-x 0/0           13256 2021-09-24 02:47 hello
+ cmp -l src.tar repacked.tar
  102  61  60
  154  64  63
+ true
+ hexdump -C src.tar
+ head
00000000  68 65 6c 6c 6f 00 00 00  00 00 00 00 00 00 00 00  |hello...........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 30 31 30 30  37 37 35 00 30 30 30 30  |....0100775.0000|
00000070  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
00000080  30 30 33 31 37 31 30 00  31 34 31 32 33 32 31 31  |0031710.14123211|
00000090  30 34 36 00 30 31 30 32  37 34 00 20 30 00 00 00  |046.010274. 0...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  00 75 73 74 61 72 00 30  30 00 00 00 00 00 00 00  |.ustar.00.......|
+ hexdump -C repacked.tar
+ head
00000000  68 65 6c 6c 6f 00 00 00  00 00 00 00 00 00 00 00  |hello...........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 30 30 30 30  37 37 35 00 30 30 30 30  |....0000775.0000|
00000070  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
00000080  30 30 33 31 37 31 30 00  31 34 31 32 33 32 31 31  |0031710.14123211|
00000090  30 34 36 00 30 31 30 32  37 33 00 20 30 00 00 00  |046.010273. 0...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  00 75 73 74 61 72 00 30  30 00 00 00 00 00 00 00  |.ustar.00.......|
+ set +x
================================
Removing "/tmp/tar_issue_test.h3RcZgaW"

There are two differences between original and recompressed tar files (bytes 102 and 154)
The second one is the different CRC and a direct consequence of the first one.

As you can see the original tar file has 1 bit extra at 0x65 offset.
Bytes 101-109 in the header correspond to the file stat mode of the entry.
So in the src.tar hello binary mode string (octal) is 0100775. while inside GNU compressed one it is 0000775.

That extra bit corresponds to the S_IFREG returned by stat() syscall for regular files.

"Possible" root cause and question

GNU tar truncates first three triplets of the modestring while Golang tar does not

GNU states:

Starting from version 1.14 GNU tar features full support for POSIX.1-2001 archives.
A POSIX conformant archive will be created if tar was given ‘--format=posix’ (‘--format=pax’) option. No special option is required to read and extract from a POSIX archive.

POSIX tar.h knows nothing about S_IFREG.

Doesn't it mean Golang tar is not POSIX compliant? Am I missing something?

There are some inconsistencies in the above, like POSIX vs --format=ustar e.t.c. - of cause I double/triple checked all of them with the same result.

[UPDATE] After a bit more tracing...

Golang seems to be fine, at least for the S_IFREG part - it truncates it here right after the fstatat call, and here in the tar itself

Skopeo seems to be alright:
it pulls application/vnd.docker.image.rootfs.diff.tar.gzip from docker.io and silently puts it as application/vnd.oci.image.layer.v1.tar+gzip according to spec:

application/vnd.oci.image.layer.v1.tar+gzip

Interchangeable and fully compatible mime-types
application/vnd.docker.image.rootfs.diff.tar.gzip

So in the end, somehow docker.io registry stores non-canonical tarball in its library/hello-world's rootfs blob. Where that extra S_IFREG bit came from is unknown. Nevertheless, it violates the statement by @vbatts That unpacking and repacking of the _same_ content is deterministic also affecting content addressable scheme and reproducibility.

@sudo-bmitch
Copy link
Contributor

In general, I see reproducibility as a best effort, but not a guarantee. For the guarantee, you'd need to match the tooling that produced the image, and that tooling would need to provide a reproducibility guarantee itself. There are a lot of variables, including things like gzip compression levels, various attributes in the tar headers, seekable tar formats (estargz), and various digest algorithms. The JSON schemas can be extended with custom fields, and some implementations aren't consistent with ordering of those fields or the white space used in the JSON.

Ideally we'll identify as many of these as possible, and specify a canonical standard for everyone to follow to maximize the possibility of reproducibility. However, consumers of image content will also be flexible in when they allow to maximize the portability of content and compatibility between tools.

Given this, are there any specific changes needed to the image-spec right now, or should this be closed and we can revisit individual spec issues on a case-by-case basis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants