-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: experiment with custom git archive command #424
base: main
Are you sure you want to change the base?
Conversation
The last few days I've been experimenting with an alternative to go-git just seems to be slow. It is over twice as slow, even though it ends up needing to unmarshal far less objects. This approach is likely worth exploring further though, given that I suspect this will scale with the size of output if we were as fast as git. See this table for comparison
Note: megarepo was recorded on the git-combine pod. sourcegraph was recorded on my macbook. From profiling, there are surprising things. For example 12% is spent in The other next approach I was considering was writing this command in rust. In the past I wrote a small program using rust's bindings with libgit2 and it was pleasant. GIT_DIR=$PWD /usr/bin/time -v git archive HEAD 2> git-archive.time | wc -c > git-archive.size
GIT_DIR=$PWD /usr/bin/time -v git-sg 2> git-sg.time | wc -c > git-sg.size
Sourcegraph repo on the mac
|
Update from yesterday I forgot to post: I spent a lot of time writing some fun integration with git-cat-file. The code is quite nice and performant, but still doesn't beat git archive. Even though archive sends 1.5x more data (96mb vs 145mb). This is on sourcegraph/sourcegraph. Hyperfine results:
Looking at CPU profiles for cat-file, we spend as much time running Info as Contents. To me this is a sign that the overhead of RPC / Info is not worth it. We could look into a queue like design to send multiple blob/info requests out before reading, but that seems complicated and based on the perf I doubt will make it faster than archive. Final attempt in this experiment, mix together |
Using ls-tree is pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive. There is opportunity to make it faster:
I did some profiling, and this solution barely generated any garbage so is super efficient. This means I'll export the code and integrate it directly into gitserver to try and create and end to end demo. A note on buffering. Testing with hyperfine adding output buffering slowed it down slightly. I wonder if in practice though the buffer will be more important due to the output being over the network rather than to
|
The profile output was really hard to read due to the arb nesting of calls to writeTree. This introduces a manual stack, but slightly adjusts the order of output. I'd prefer the normal DFS order to match git archive, but atleast for profiling this is good for now.
This is significantly faster than using go-git.
This will avoid allocations when using it.
We only read the entries field, so this makes it easier to use a different impl.
And its pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.
00a6a29
to
b07c069
Compare
This is our slowest implementation so far! I believe this is because gitobj has no caching between parsing packfiles so it pays the cost on each object retrieval.
Goal is to see how the viability of replacing
git archive
with a format and command optimized to only send what sourcegraph cares about.