This document outlines where to find/access logs and metrics for the snapshotter. It attempts to provide some common error paths that a user might run into while using the snapshotter and provides some guidance on how to either root-cause the issue or resolve it.
For the most part, the soci-snapshotter-grpc
logs will be the most important place to look when debugging. If soci-snapshotter-grpc
was started via systemd
then you can obtain logs using journalctl
:
sudo journalctl -u soci-snapshotter.unit
Note The command above assumes that you have used the unit file definition soci-snapshotter.service we have provided. If you have created your own unit file for
soci-snapshotter-grpc
and replacesoci-snapshotter.unit
with the one you have made.
If you have started soci-snapshotter-grpc
manually, logs will either be emitted to stderr/stdout or to the destination of your choice.
The snapshotter emits Prometheus metrics. To collect and access these metrics, you need to configure metrics_address
within SOCIs' config.toml
(located at /etc/soci-snapshotter-grpc
by default) before starting the snapshotter. You can provide any local address(TCP) or UNIX socket (if you are using a TCP address make sure the port is not in use). To view the metrics you can send a GET
request via any HTTP client to the /metrics
endpoint and Prometheus will return all the metrics emitted by the snapshotter.
$ cat /etc/soci-snapshotter-grpc/config.toml
metrics_address="localhost:8000"
$ curl localhost:8000/metrics
soci_fs_operation_duration_milliseconds_bucket{layer="sha256:328b9d3248edeb3ae6e7f9c347bcdb5632c122be218e6ecd89543cca9c8f1997",operation_type="init_metadata_store",le="1"} 1
soci_fs_operation_duration_milliseconds_bucket{layer="sha256:328b9d3248edeb3ae6e7f9c347bcdb5632c122be218e6ecd89543cca9c8f1997",operation_type="init_metadata_store",le="2"} 1
soci_fs_operation_duration_milliseconds_bucket{layer="sha256:328b9d3248edeb3ae6e7f9c347bcdb5632c122be218e6ecd89543cca9c8f1997",operation_type="init_metadata_store",le="4"} 1
soci_fs_operation_duration_milliseconds_bucket{layer="sha256:328b9d3248edeb3ae6e7f9c347bcdb5632c122be218e6ecd89543cca9c8f1997",operation_type="init_metadata_store",le="8"} 1
...
Below are a list of metrics emitted by the snapshotter:
- Mount
- operation_duration_mount (ms) - defines how long does it take to mount a layer during pull. Pulling should only take a couple of seconds. If this value is higher than 3-5 seconds this can indicate an issue while mounting.
- operation_duration_init_metadata_store (ms) - measures the time it takes to parse a zTOC and prepare the respective metadata records in metadata bbolt db (it records layer digest as well). This is one of the components of pulling, therefore there should be a correlation between the time to parse a zTOC with updating of metadata db and the duration of layer mount operation.
- Fetch from remote registry
- operation_duration_remote_registry_get (ms) - measures the time it takes to complete a
GET
operation from remote registry for a specific layer. This metric should help in identifying network issues, when lazily fetching layer data and seeing increased container start time.
- operation_duration_remote_registry_get (ms) - measures the time it takes to complete a
- FUSE
- operation_duration_node_readdir (us) - measures the time it takes to complete readdir() operation for a file from a specific layer. The per-layer granularity is to point out that each layer has its own
FUSE
mount, so it doesn’t make sense to generalize. The unit is microseconds. Large times in readdir may indicate that there are problems with the request speed from metadata db or issues with theFUSE
implementation (less likely, since this part is least likely to get modified). - operation_duration_synchronous_read (us) - measures the duration of
FUSE
read() operation for the specificFUSE
mountpoint, defined by the layer digest. The unit of measurement is microseconds. - synchronous_read_count - measures how many read() operations were issued for the specific
FUSE
mountpoint (defined by the layer digest) to date. The same value can be obtained fromoperation_duration_synchronous_read
as the Count property. - synchronous_bytes_served - measures the number of bytes served for synchronous reads.
- fuse_mount_failure_count - number of times the snapshotter falls back to use a normal overlay mount instead of mounting the layer as a
FUSE
mount. - background_span_fetch_failure_count - number of errors of span fetch by background fetcher.
- background_span_fetch_count - number of spans fetched by background fetcher.
- background_fetch_work_queue_size - number of items in the work queue of background fetcher.
- operation_duration_background_fetch - time in milliseconds to complete background fetch for a layer.
- Individual
FUSE
operation failure counts:- fuse_node_getattr_failure_count
- fuse_node_listxattr_failure_count
- fuse_node_lookup_failure_count
- fuse_node_open_failure_count
- fuse_node_readdir_failure_count
- fuse_file_read_failure_count
- fuse_file_getattr_failure_count
- fuse_whiteout_getattr_failure_count
- fuse_unknown_operation_failure_count
- operation_duration_node_readdir (us) - measures the time it takes to complete readdir() operation for a file from a specific layer. The per-layer granularity is to point out that each layer has its own
Below are some common scenarios that may occur during pulling and the lifetime of running a container. For scenarios not covered, please feel free to open an issue.
Note To allow for more verbose logging you can set the
--log-level
flag todebug
when starting the snapshotter.
While pulling, the image manifest, config, and layers without zTOCs' are fetched from the remote registry directly. Layers that have a zTOC are mounted as a FUSE
file system and will be pulled lazily when launching a container.
Below are a list of common error paths that may occur in this phase:
If you notice that all layers are being fetched for an image or that FUSE
mounts are not being created for layers with a zTOC than that means that remote mounting has failed for those layers.
Once you inspect the logs you should come across an error log that contains the message failed to prepare remote snapshot
with an error
key describing the error that was propagated up within the snapshotter.
Some possible error keys include:
-
skipping mounting layer <layer_digest> as FUSE mount: no zTOC for layer
This "error" message is not really indicative of a true error, but rather implies that the current layer does not have an associated zTOC. This is expected for layers that do not meet the minimum-layer size criteria established when creating the soci-index/zTOCs.
-
unable to fetch SOCI artifacts: <error>
This error indicates that the soci index along with the corresponding zTOC could not be fetched from the remote registry. This can be for a variety of different reasons. The most common reason is that the resolver could not authenticate against the remote registry. The snapshotter uses the docker resolver to resolve blobs in the remote so you must authenticate with docker first. If you are using
ECR
as your registry you can:export ECR_PASS=$(aws ecr get-login-password --region <region>) echo $ECR_PASS | sudo docker login -u AWS --password-stdin $ECR_REGISTRY
Note SOCI artifacts are only fetched when preparing the first layer. If they cannot be fetched the snapshotter will fallback to default snapshotter configured (eg: overlayfs) entirely.
If you notice that pulling takes a considerable amount of time you can:
-
Look for
failed to resolve layer (timeout)
within the logs. Remote mounts may take too long if something’s wrong with layer resolving. By default remote mounts time out after 30 seconds if a layer can’t be resolved. -
Check the
operation_duration_mount
metric to see if it takes unusual long time to mount a layer. Pull should be taking a couple of seconds, so one can be checking if any of these operations are taking more than 3-5 seconds. -
Parsing zTOC and initializing the metadata db is part of the pull command. You can check the
operation_duration_init_metadata_store
metric to see if initializing the metadata bbolt db is too slow. -
Look for HTTP failure codes in the log. Such logs are in this format:
Received status code
:
The background fetcher is initialized as soon as the snapshotter starts. If you have not explicitly disabled it via the the snapshotters config, it will be performing network requests to fetch data during/after pulling. To analyze the background fetcher you can:
- Look at the
background_span_fetch_failure_count
to determine how many times a background fetch failed. - Look at
background_span_fetch_count
metric to determine how many spans were fetched by the background fetcher. If this number is 0 this may indicate network failures.- Look for
Retrying request
within the logs to determine the error and response returned from the remote registry.
- Look for
A running container produces many read requests. If there is a read request for a file residing within a lazy-loaded layer than the read request is routed through the layers' FUSE
filesystem. This path can produce several different errors:
Look for failed to read the file
or unexpected copied data size for on-demand fetch
in the logs.
Corrupt Data
- Span verification failures can occur if the fetched data is corrupt or has been altered since zTOC creation. You can look for
span digests do not match
within logs to verify that this is the root cause. - Check to see if the zTOC contains appropriate data. You can do this by running the
soci ztoc info <digest>
command to inspect the zTOC. If the dictionaries are all 0-ed, the zTOC initially generated and subsequently pulled was corrupt.
Network Failures
The snapshotter contains custom retry logic when fetching spans(data) from the remote registry. By default it will try to fetch from remote a maximum of 9 times before returning an error.
- You can look for
retrying request
within the logs to determine the error and response returned from the remote registry. - You can also check
operation_duration_remote_registry_get
metric to see how long it takes to completeGET
from remote registry.
Removing an image additionally removes any associated snapshots. A simple sudo nerdctl image rm [image tag]
should remove all snapshots associated with the image before the image itself is removed. You can confirm the image is gone by ensuring it is no longer present in sudo nerdctl image ls
.
While a graceful stop/restart of the process is preferred, sometimes it is not possible or simply easier to just kill the process. However, oftentimes it comes with a multitude of issues that do not allow you to start the snapshotter properly, particularly if a previously loaded snapshot is still present. For instance, if you pull from a repository that requires credentials and stop the snapshotter, if credentials are expired upon the daemon's startup, the snapshotter will fail to start properly.
Many errors related to loaded snapshots can be surpassed by setting allow_invalid_mounts_on_restart=true
in /etc/soci-snapshotter-grpc/config.toml
. Note that using the snapshotter will likely load in a broken state and you will be unable to do common functionality (such as pulling another image) until the currently loaded snapshot is removed.
If all else fails, a clean slate can help to get you back to square one. These steps should bring you to a clean slate. (NOTE: This includes wiping your entire container store clean, so be sure to back up any important files.)
sudo killall -2 soci-snapshotter-grpc # SIGINT allows for a more graceful cleanup, can omit the -2 flag to send a SIGTERM
# If necessary, unmount any remaining fuse mounts, though a SIGINT to the daemon should automatically handle that
sudo rm -rf /var/lib/containerd
sudo rm -rf /var/lib/soci-snapshotter-grpc
sudo systemctl restart containerd
sudo soci-snapshotter-grpc
Here are some SOCI CLI commands that can be helpful when debugging issues relating to SOCI indices/zTOCs'
SOCI CLI Command | Description |
---|---|
soci ztoc get-file | retrieve a file from a local image layer using a specified ztoc |
soci ztoc info | get detailed info about a ztoc (list of files+offsets, num of spans, ...etc) |
soci ztoc list | list all ztocs |
soci index info | retrieve the contents of an index |
soci index list [options] —ref | list ztocs across all images / filter indices to those that are associated with a specific image ref |
soci index rm [options] —ref | remove an index from local db / only remove indices that are associated with a specific image ref |
We can use Golangs pprof
tool to profile the snapshotter. To enable profiling you must set the debug_address
within the snapshotters config (default: /etc/soci-snapshotter-grpc/config.toml
):
debug_address = "localhost:6060"
Once you have configured the debug address you can send a GET
to the /debug/pprof/profile
endpoint to receive a CPU profile of the snapshotter. You can specify an optional argument seconds
to limit the results to a certain time span:
curl http://localhost:6060/debug/pprof/profile?seconds=40 > out.pprof
You can use the pprof
tool provided by the Go CLI to visualize the data within a web browser:
go tool pprof -http=:8080 out.pprof