Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: properly block on shard building #689

Merged
merged 1 commit into from
Nov 14, 2023
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Nov 11, 2023

When indexing, we build shards in parallel based on the parallelism flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly 100MB * parallelism.

Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.

This change fixes the throttling logic so we block indexing when all threads
are busy building shards.

@jtibshirani
Copy link
Member Author

jtibshirani commented Nov 11, 2023

I noticed this when I took a memory profile with -parallelism 4. Before the change, there is consistently over 2GB attributed to bytes.growSlice, which is what holds the document contents:

Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
    2.14GB 62.83% 62.83%     2.14GB 62.83%  bytes.growSlice
    0.51GB 15.07% 77.89%     0.51GB 15.07%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
    0.30GB  8.91% 86.80%     0.30GB  8.91%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

After, this is consistently only 700MB:

Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  762.59MB 37.60% 37.60%   762.59MB 37.60%  bytes.growSlice
  523.74MB 25.82% 63.42%   523.74MB 25.82%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
  304.30MB 15.00% 78.42%   304.30MB 15.00%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

So this seems to fix an important issue. I need to look further into why this is still not closer to 500MB (what you'd expect from 100MB buffer + 100MB * 4 threads) -- I think there is some overhead from git-go.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!! So now we will have at most parallelism + 1 shards in memory right? Since you can have parallelism documents having buildShard called on, and then 1 full todo slice trying to have flush called on it?

I double checked calls to flush and th euse of the b.building waitgroup. I don't see any issues with potential deadlocks/etc. LGTM!

@jtibshirani
Copy link
Member Author

Indeed now it will be at most parallelism + 1 shards in memory, plus whatever memory for building index structures. Thanks for double-checking the concurrency logic!

@jtibshirani jtibshirani merged commit 5e2620e into main Nov 14, 2023
8 checks passed
@jtibshirani jtibshirani deleted the jtibs/index-throttle branch November 14, 2023 16:08
@jtibshirani
Copy link
Member Author

@keegancsmith @stefanhengl general note about the indexing memory fixes: I plan to let these "bake" for ~2 weeks on S2 / dot com before backporting this to a 5.2 patch. I'm being pretty conservative since it's very core code and this logic hasn't been touched in a while.

jtibshirani added a commit that referenced this pull request Nov 16, 2023
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.

Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.

This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants