Indexing: properly block on shard building #689

jtibshirani · 2023-11-11T02:04:23Z

When indexing, we build shards in parallel based on the parallelism flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly 100MB * parallelism.

Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.

This change fixes the throttling logic so we block indexing when all threads
are busy building shards.

jtibshirani · 2023-11-11T02:13:17Z

I noticed this when I took a memory profile with -parallelism 4. Before the change, there is consistently over 2GB attributed to bytes.growSlice, which is what holds the document contents:

Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
    2.14GB 62.83% 62.83%     2.14GB 62.83%  bytes.growSlice
    0.51GB 15.07% 77.89%     0.51GB 15.07%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
    0.30GB  8.91% 86.80%     0.30GB  8.91%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

After, this is consistently only 700MB:

Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  762.59MB 37.60% 37.60%   762.59MB 37.60%  bytes.growSlice
  523.74MB 25.82% 63.42%   523.74MB 25.82%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
  304.30MB 15.00% 78.42%   304.30MB 15.00%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

So this seems to fix an important issue. I need to look further into why this is still not closer to 500MB (what you'd expect from 100MB buffer + 100MB * 4 threads) -- I think there is some overhead from git-go.

keegancsmith

nice catch!! So now we will have at most parallelism + 1 shards in memory right? Since you can have parallelism documents having buildShard called on, and then 1 full todo slice trying to have flush called on it?

I double checked calls to flush and th euse of the b.building waitgroup. I don't see any issues with potential deadlocks/etc. LGTM!

jtibshirani · 2023-11-14T16:07:24Z

Indeed now it will be at most parallelism + 1 shards in memory, plus whatever memory for building index structures. Thanks for double-checking the concurrency logic!

jtibshirani · 2023-11-14T16:11:03Z

@keegancsmith @stefanhengl general note about the indexing memory fixes: I plan to let these "bake" for ~2 weeks on S2 / dot com before backporting this to a 5.2 patch. I'm being pretty conservative since it's very core code and this logic hasn't been touched in a while.

When indexing, we build shards in parallel based on the `parallelism` flag. Each shard handles ~100MB of document contents, which should limit the memory usage to roughly `100MB * parallelism`. Looking at the size of the buffered document contents in memory profiles, we see much higher usage than this. The issue seems to be that we continue to buffer up documents even if all threads are busy building shards. This can be a real problem if shards take a super long time to build (say because ctags is slow) -- we could end up buffering a ton of content into memory at once. This change fixes the throttling logic so we block indexing when all threads are busy building shards.

Indexing: properly block on shard building

acf4e7a

jtibshirani marked this pull request as ready for review November 11, 2023 02:13

jtibshirani mentioned this pull request Nov 11, 2023

☂️ Search: improve Zoekt indexing sourcegraph/sourcegraph-public-snapshot#58133

Closed

14 tasks

keegancsmith approved these changes Nov 14, 2023

View reviewed changes

jtibshirani merged commit 5e2620e into main Nov 14, 2023
8 checks passed

jtibshirani deleted the jtibs/index-throttle branch November 14, 2023 16:08

jtibshirani mentioned this pull request Nov 20, 2023

zoekt: only one ctags process per build sourcegraph/sourcegraph-public-snapshot#58112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing: properly block on shard building #689

Indexing: properly block on shard building #689

jtibshirani commented Nov 11, 2023 •

edited

Loading

jtibshirani commented Nov 11, 2023 •

edited

Loading

keegancsmith left a comment

jtibshirani commented Nov 14, 2023

jtibshirani commented Nov 14, 2023

Indexing: properly block on shard building #689

Indexing: properly block on shard building #689

Conversation

jtibshirani commented Nov 11, 2023 • edited Loading

jtibshirani commented Nov 11, 2023 • edited Loading

keegancsmith left a comment

Choose a reason for hiding this comment

jtibshirani commented Nov 14, 2023

jtibshirani commented Nov 14, 2023

jtibshirani commented Nov 11, 2023 •

edited

Loading

jtibshirani commented Nov 11, 2023 •

edited

Loading