The big fat image problem #22

shykes · 2023-05-22T23:34:18Z

Problem

When discussing Github Actions, and Act, I problem I keep hearing about is "the big fat image". In order to reproduce the original Github Actions environment in a container, you need to download a massive docker image - tens of GB. That makes things slow and expensive, and makes many other improvements harder.

So it seems that this "BFI" is a bottleneck; and removing that bottleneck would make a lot of people in the Github Actions ecosystem happier and more productive. But how?

I am starting this issue to invite conversation and debate on this topic. I know @aweris and @tiborvass have opinions. Perhaps @kjuulh and @cpdeethree too? :)

Solution

To get us started, I will list possible solutions that were mentioned to me at some point.

Just keep the BFI. This is the default option, where we decide this problem is not really that big of a problem in practice. Therefore there is no need to change what "ain't broke".
Dockerfiles. Just write a custom Dockerfile (or equivalent) describing the dependencies for your particular workflow. As your workflow changes and evolves, it's your responsibility to keep that Dockerfile up-to-date. This involves trial and error, since the upstream developer of the Github Actions is not sharing with you a dependency list: you have to run the thing, wait for an error, then try again. Sounds cumbersome, but maybe in practice it's fine?
Error parsing. Dynamically run the thing, catch errors, dynamically infer from the error message what is missing, modify the container image accordingly, then try again. Do this with just the right balance of magic and manual configuration, so that it's easy to get started, but you never get stuck when the default configuration doesn't work. For example, "command not found: go" would be caught and resolved with "apk add golang" - note that my example involves a distro-specific solution.
Syscall tracing. Same idea as "error parsing", but uses strace and other system-level tracing to catch errors at a deeper layer. Same general problem of balancing magic and manual control. As discussed with @aweris
stargz: take advantage of Dagger/buildkit's support for stargz to keep the "BFI", but make everything fast and lightweight because only the files that are needed get downloaded, just in time. This could be the best of both worlds: no magical tooling to develop or massive packaging/annotation effort to scale to the whole Github Actions ecosystem; but we get the benefits anyway. I am lazy so I am instinctively drawn to the solution with the most benefits and the least work needed :) But there may be a catch that makes this option simply infeasible. cc @sipsma @tiborvass
Other?

aweris · 2023-05-23T09:08:26Z

Just to highlight the issue, here is Act's image information about runner sizes.

? Please choose the default image you want to use with act:

  - Large size image: +20GB Docker image, includes almost all tools used on GitHub Actions (IMPORTANT: currently only ubuntu-18.04 platform is available)
  - Medium size image: ~500MB, includes only necessary tools to bootstrap actions and aims to be compatible with all actions
  - Micro size image: <200MB, contains only NodeJS required to bootstrap actions, doesn't work with all actions

And here actions/runner-images/ubuntu-22.04 the tools and libraries installed on ubuntu-latest machines, we're trying to replicate.

For solutions,

Just keep the BFI. This is the default option, where we decide this problem is not really that big of a problem in practice. Therefore there is no need to change what "ain't broke".

While this approach may be suitable for the MVP and early phases, it could potentially disrupt our goal of achieving a smooth and uninterrupted user experience in the long term.

Dockerfiles. Just write a custom Dockerfile (or equivalent) describing the dependencies for your particular workflow. As your workflow changes and evolves, it's your responsibility to keep that Dockerfile up-to-date. This involves trial and error, since the upstream developer of the Github Actions is not sharing with you a dependency list: you have to run the thing, wait for an error, then try again. Sounds cumbersome, but maybe in practice it's fine?

In my opinion, this solution can be easily accomplished. Although it may not result in the best user experience, it is straightforward and adaptable. As a user, I would be satisfied if the customization process is quick and effortless.

Error parsing. Dynamically run the thing, catch errors, dynamically infer from the error message what is missing, modify the container image accordingly, then try again. Do this with just the right balance of magic and manual configuration, so that it's easy to get started, but you never get stuck when the default configuration doesn't work. For example, "command not found: go" would be caught and resolved with "apk add golang" - note that my example involves a distro-specific solution.

Syscall tracing. Same idea as "error parsing", but uses strace and other system-level tracing to catch errors at a deeper layer. Same general problem of balancing magic and manual control. As discussed with @aweris

As a user, I would be satisfied with these two options. Additionally, this could aid in diagnosing issues. However, I am concerned about the amount of effort or investment required for this code and how to balance it with manual effort.

stargz: take advantage of Dagger/buildkit's support for stargz to keep the "BFI", but make everything fast and lightweight because only the files that are needed get downloaded, just in time. This could be the best of both worlds: no magical tooling to develop or massive packaging/annotation effort to scale to the whole Github Actions ecosystem; but we get the benefits anyway. I am lazy so I am instinctively drawn to the solution with the most benefits and the least work needed :) But there may be a catch that makes this option simply infeasible.

It's quick to start and only downloads the necessary components during execution, which is convenient. Additionally, it doesn't require any user input. However, my understanding of 'stargz'/'estargz' is limited, so I'm not certain if it will function as I envision.

Here's what I envision for a solution:

First I would try stargz or estargz.

Pros:

No magical tooling
One image for all workflows, everyone would only use what they need

Cons:

Not a common solution, not sure if theory and practice would match

If stargz not works, then I would start with Just keep the BFI and incrementally introduce a combined solution of custom Dockerfile/Config Format with automated generation and updating this using error parsing/syscall tracing.

Pros:

user complexity or involvement is minimal
adds dependency mapping and additional options to diagnose issues as side effect

Cons:

too much magic code involves.
There may be some side effects of automation and manual processes. In order to minimize user interaction, a comprehensive automated process in the background is necessary to detect, update, and maintain workflow.

sheldonhull · 2023-05-23T18:50:12Z

Devcontainer and build/CI agent containers are large because they aim to eliminate the need for setting up numerous common dependencies. They are generally better than the hassle of maintaining and updating the latest versions of common tools.

Personally, I'm okay with downloading a large, pre-configured image. I've done this before with devcontainers that include everything I need, even if they are not fast or small. My priority is functionality and reducing my testing loop, so I don't mind the size. Trimming things down would only create more debugging work for me. I did this recently with gitlab simulated runner. Even if it takes a bit to download, it's a better feedback loop, similar to the gains of using Dagger. The size is a tradeoff, but acceptable then.

However, I have started to move away from using huge images and now prefer starting with an Ubuntu base image and adding my own tooling, as it helps reduce the size.

If it's a CI test image, I'm okay with it being larger if it allows for proper testing.

If I want to manage the environment precisely in a deterministic way, I would consider using Nix. However, it has a steep learning curve and adoption challenges within an organization.

Personally, I don't want to deal with managing more image definitions. I'm fine with a few apt-get installs in my CI job instead of trying to pre-build every CI image.

I'm not sure how helpful this information is, but I thought I'd share my perspective.

aweris · 2023-11-08T18:44:43Z

It's a low-priority issue. It may not be worth investing time as we haven't felt pressure in this direction.

aweris closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The big fat image problem #22

The big fat image problem #22

shykes commented May 22, 2023

aweris commented May 23, 2023

sheldonhull commented May 23, 2023

aweris commented Nov 8, 2023

The big fat image problem #22

The big fat image problem #22

Comments

shykes commented May 22, 2023

Problem

Solution

aweris commented May 23, 2023

For solutions,

sheldonhull commented May 23, 2023

aweris commented Nov 8, 2023