You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The big determinism is once a layer is built. That unpacking and repacking of the same content is deterministic. Compression can and does mess this up, but at least for the *.tar itself, this should hold true.
I found out that it not always holds true. Thus the content addressable scheme might be affected.
Steps to reproduce (with linux + GNU tar):
pull hello-world image with skopeo
gunzip the tar.gz layer holding the hello binary
tar x the hello binary inside
tar c the extracted hello binary
compare two tar archives
observe 1 bit difference in the hello tar header file stat mode section (and crc in the header consequentially)
I refined the case down to differences in GNU tar implementation vs Golang one.
Given that most of the containerization software nowadays written in Go, someone might find this useful.
As a side note, I don't have any intention of digging deeper into this and hope for more experienced OCI/Golang (-related) guys picking it up.
And after this small discussion at #reproducible-builds IRC (click for convo log),
[10:43] Hello. Diving deep into containerization software, I found out that GO implementation of tar behaves differently from GNU tar. More specifically it writes first three file stat mode triplets into the archive for every regular file, while GNU tar clamps (zeroes) them. GNU tar is conformant to the POSIX spec. OCI (Open Container Initiative) relies on tar format for its storage layer, without specifying any details. Resulting in: container images built with GO differ by 1 bit (+crc in the header) for every file within the archives, meaning that hashes used by the content addressable scheme differ too. Is this a reproducibility issue at all?
[11:14] <*> DYefimov: kind-of, usually using different build tools means all bets are off, even for different versions of build tools. that said, it seems worth it to fix this
[11:21] Thanks gotcha. Ok, the root cause is in the GO tar implementation, but probably the best place to file the issue would be an OCI? as they are a more involved party and might be interested in fixing this by themselves... but frankly - I don't see an easy way around - most of containerization software nowadays built with GO... the impact seems to be so huge
[11:30] <*> not sure about where to file the issue though, maybe both????
decided to share this find here.
Testcase and explanation
Please, take a look at this testcase (click for tar_issue_test.sh source)
#!/usr/bin/env shset -e
SKOPEO_IMG=quay.io/skopeo/stable:latest
uname -srvmpio
docker --version
docker run --rm \
--security-opt seccomp=unconfined \
$SKOPEO_IMG --version
tar --version | head -n 1
echo'================================'
IMAGE_NAME=hello-world
TMP_DIR=$(mktemp -dt tar_issue_test.XXXXXXXX)
mkdir "$TMP_DIR/$IMAGE_NAME"echo"Created \"$TMP_DIR\""trap"echo \"Removing \\\"$TMP_DIR\\\"\"; rm -rf \"$TMP_DIR\"" EXIT
docker run --rm \
--security-opt seccomp=unconfined \
--user $(id -u):$(id -g) \
-v "$TMP_DIR/$IMAGE_NAME":"/$IMAGE_NAME" \
$SKOPEO_IMG \
copy docker://$IMAGE_NAME oci:$IMAGE_NAME:latest
mkdir "$TMP_DIR/$IMAGE_NAME/testdir"cd"$TMP_DIR/$IMAGE_NAME/testdir"
mv \
"$TMP_DIR/$IMAGE_NAME/blobs/sha256/2db29710123e3e53a794f2694094b9b4338aa9ee5c40b930cb8063a1be392c54" \
"./src.tar.gz"
gunzip -q ./src.tar.gz
echo'================================'
tar xvf ./src.tar # contains just the "hello" binary
SOURCE_DATE_EPOCH=$(date +%s)
tar \
--format=ustar \
-b 1 \
--sort=name \
--numeric-owner --owner=0 --group=0 \
--mtime="@${SOURCE_DATE_EPOCH}" --clamp-mtime \
-cf repacked.tar hello
chmod g-w repacked.tar
echo'================================'set -x
ls -lt --time-style=full-iso
tar -tvf src.tar
tar -tvf repacked.tar
cmp -l src.tar repacked.tar ||true
hexdump -C src.tar | head
hexdump -C repacked.tar | head
set +x
echo'================================'
and it's output in my environment (kernel a bit outdated for irrelevant reasons):
There are two differences between original and recompressed tar files (bytes 102 and 154)
The second one is the different CRC and a direct consequence of the first one.
As you can see the original tar file has 1 bit extra at 0x65 offset.
Bytes 101-109 in the header correspond to the file stat mode of the entry.
So in the src.tar hello binary mode string (octal) is 0100775. while inside GNU compressed one it is 0000775.
That extra bit corresponds to the S_IFREG returned by stat() syscall for regular files.
Starting from version 1.14 GNU tar features full support for POSIX.1-2001 archives.
A POSIX conformant archive will be created if tar was given ‘--format=posix’ (‘--format=pax’) option. No special option is required to read and extract from a POSIX archive.
Skopeo seems to be alright:
it pulls application/vnd.docker.image.rootfs.diff.tar.gzip from docker.io and silently puts it as application/vnd.oci.image.layer.v1.tar+gzip according to spec:
So in the end, somehow docker.io registry stores non-canonical tarball in its library/hello-world's rootfs blob. Where that extra S_IFREG bit came from is unknown. Nevertheless, it violates the statement by @vbattsThat unpacking and repacking of the _same_ content is deterministic also affecting content addressable scheme and reproducibility.
The text was updated successfully, but these errors were encountered:
In general, I see reproducibility as a best effort, but not a guarantee. For the guarantee, you'd need to match the tooling that produced the image, and that tooling would need to provide a reproducibility guarantee itself. There are a lot of variables, including things like gzip compression levels, various attributes in the tar headers, seekable tar formats (estargz), and various digest algorithms. The JSON schemas can be extended with custom fields, and some implementations aren't consistent with ordering of those fields or the white space used in the JSON.
Ideally we'll identify as many of these as possible, and specify a canonical standard for everyone to follow to maximize the possibility of reproducibility. However, consumers of image content will also be flexible in when they allow to maximize the portability of content and compatibility between tools.
Given this, are there any specific changes needed to the image-spec right now, or should this be closed and we can revisit individual spec issues on a case-by-case basis?
Abstract
OCI chose tar format as a basis for images storage layer, while not specifying any constrains on the tar format itself AFAIK.
In #805 @vbatts says:
I found out that it not always holds true. Thus the content addressable scheme might be affected.
Steps to reproduce (with linux + GNU tar):
skopeo
gunzip
thetar.gz
layer holding the hello binarytar x
the hello binary insidetar c
the extracted hello binaryI refined the case down to differences in GNU tar implementation vs Golang one.
Given that most of the containerization software nowadays written in Go, someone might find this useful.
As a side note, I don't have any intention of digging deeper into this and hope for more experienced OCI/Golang (-related) guys picking it up.
Testcase and explanation
Please, take a look at this testcase (click for tar_issue_test.sh source)
and it's output in my environment (kernel a bit outdated for irrelevant reasons):
There are two differences between original and recompressed tar files (bytes 102 and 154)
The second one is the different CRC and a direct consequence of the first one.
As you can see the original tar file has 1 bit extra at 0x65 offset.
Bytes 101-109 in the header correspond to the file stat mode of the entry.
So in the src.tar hello binary mode string (octal) is
0100775.
while inside GNU compressed one it is0000775.
That extra bit corresponds to the
S_IFREG
returned by stat() syscall for regular files."Possible" root cause and question
GNU tar truncates first three triplets of the modestring while Golang tar does not
GNU states:
POSIX tar.h knows nothing about S_IFREG.
Doesn't it mean Golang tar is not POSIX compliant? Am I missing something?
There are some inconsistencies in the above, like
POSIX
vs--format=ustar
e.t.c. - of cause I double/triple checked all of them with the same result.[UPDATE] After a bit more tracing...
Golang
seems to be fine, at least for theS_IFREG
part - it truncates it here right after the fstatat call, and here in the tar itselfSkopeo
seems to be alright:it pulls
application/vnd.docker.image.rootfs.diff.tar.gzip
from docker.io and silently puts it asapplication/vnd.oci.image.layer.v1.tar+gzip
according to spec:So in the end, somehow docker.io registry stores non-canonical tarball in its
library/hello-world
's rootfs blob. Where that extraS_IFREG
bit came from is unknown. Nevertheless, it violates the statement by @vbattsThat unpacking and repacking of the _same_ content is deterministic
also affecting content addressable scheme and reproducibility.The text was updated successfully, but these errors were encountered: