How to Fetch a Commit

An early design decision we made for Buckaroo was that the package installation should only require the lock-file. This guarantees that everyone (and your CI server) is always working from the same exact versions of you dependencies.

But it also leads to a a challenge: how do we fetch the package given only a Git URL and a commit hash?

A First Attempt

The simplest solution is the following:

git clone $GIT_URL $PACKAGE_DIR
cd $PACKAGE_DIR
git checkout $GIT_COMMIT

But this is inefficient. Cloning an entire repository can take anywhere from a few seconds to hours, depending on the number of commits. We only need the code at one commit, not the entire history.

Shallow Clones

Many of you are probably already thinking: why not do a shallow clone?

git clone --depth=1 -n $GIT_URL $PACKAGE_DIR
cd $PACKAGE_DIR
git checkout $GIT_COMMIT

But this only works if $GIT_COMMIT is the first commit on the default branch.

What about fetch?

git fetch origin $GIT_COMMIT

The problem is that to fetch an arbitrary commit from a shallow clone, the Git server must enable this feature on their side. GitHub, our most common package host, does not.

error: Server does not allow request for unadvertised object...

An advertised object is one at the tip of a branch or a tag. We can query these using ls-remote:

git ls-remote $GIT_URL

What's really nice about ls-remote is that we don't even need to clone first!

This gives us a more efficient procedure:

Use ls-remote to determine if the commit is advertised 1.1. If so, do a shallow clone and fetch 1.2. If not, do a full clone
Checkout the commit

An Even Better Solution?

This solution is pretty good, but can we do better?

It turns out that git fetch has an option called --deepen. This will fetch additional commits beyond those from previous fetches. With this, we can reach commits older than the tip in a shallow clone.

Do a shallow clone
Does the commit exist? 2.1. Yes, do a checkout 2.1. No, git fetch --deepen n, go to 2.

In this way, we can walk backwards through the commit history from the branch tips to the commit that we need. In a very large repository, this can save us considerable fetching time.

Of course, --deepen assumes we know which branch we want to fetch the commit from. Where can we get this information? Well, during the resolution process (generating the lock-file), Buckaroo will have explored the branches and tags that contain the commit we locked down to. We store this information in the lock-file, allowing us to use it as a hint for later installs.

[lock."github.com/buckaroo-pm/boost-config"]
versions = [ "branch=master" ]
revision = "4392ed19b232ed2dde7623843d7e30ef669d860e"

Here we know that 4392ed will likely be found on master.

With this in place, we now have a series of increasingly expensive strategies to try:

ls-remote
Shallow clone
Deepen the expected branch
Full clone
Fail

One Last Trick

We have one last trick for speeding things up. Each package in the packages folder (/buckaroo), is actually a Git repository. This allows us to upgrade a package very cheaply, since we can fetch the diff from the current version to the next.

Even better, the remote we fetch from can be a local Git repository! Buckaroo will try to fetch from its global Git cache before making requests to the remote.

Conclusion

Clones are expensive for large repositories
Shallow clones only work for commits near the branch tip
GitHub does not allow you to fetch arbitrary commits
Use ls-remote to discover the latest commits
Use --deepen to expand a shallow clone
Git is singly-linked, so save information when you have it

Provide feedback

Saved searches