Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: respect indexing buffer limit #686

Merged
merged 1 commit into from
Nov 10, 2023
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Nov 10, 2023

When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears Content whenever SkipReason is set. The
invariant: a buffered document should only ever have SkipReason or Content,
not both.

Update: this also fixes a bug where we still ran ctags even if we identified a
file was binary and should be skipped. Now, we avoid running ctags in these
cases.

@jtibshirani
Copy link
Member Author

I started digging into this after I noticed indexserver memory spikes on S2:

Screenshot 2023-11-09 at 5 22 43 PM

I then correlated these with a large perforce repo that is consistently failing to index:

19:23:40.318963	 .  4440	... error: command [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map c_sharp:scip,go:scip,python:scip,scala:scip,typescript:scip,kotlin:scip,ruby:scip,javascript:scip,rust:scip,zig:scip -file_limit 1048576 -parallelism 8 -index /data/index -require_ctags -large_file **/fixtures.json /data/index/.indexserver.tmp/perforce-sgdev-org%2Fdevx-80k-files.git] failed: signal: killed OUT: 2023/11/09 19:22:30 attempting to index 86024 total files
19:23:40.318966	 .     3	... state: fail

The repo contents are auto-generated and contain a large number of non-source files. With this fix, Zoekt no longer chokes on indexing the repo.

@jtibshirani jtibshirani marked this pull request as ready for review November 10, 2023 01:27
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards. Or we just build up a very large number of files that are excluded (eg a third_party or vendor dir). IE this is still a great change.

Maybe given we sometimes set SkipReason before looking at a document we should update Builder.Add to skip most of the work if SkipReason is already set?

zoekt/gitindex/index.go

Lines 557 to 572 in db067d1

if blob.Size > int64(opts.BuildOptions.SizeMax) && !opts.BuildOptions.IgnoreSizeMax(keyFullPath) {
if err := builder.Add(zoekt.Document{
SkipReason: fmt.Sprintf("file size %d exceeds maximum size %d", blob.Size, opts.BuildOptions.SizeMax),
Name: keyFullPath,
Branches: brs,
SubRepositoryPath: key.SubRepoPath,
}); err != nil {
return err
}
continue
}
contents, err := blobContents(blob)
if err != nil {
return err
}

@keegancsmith
Copy link
Member

I just had a realisation and I think this PR will fix it. In our ctagsAddSymbolsParserMap code we don't do anything like skip files with a non-empty skip reason. That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

@jtibshirani
Copy link
Member Author

jtibshirani commented Nov 10, 2023

What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards.

Indeed, what happened with this perforce depot is that all the files were relatively small (so we didn't skip them upfront because they are too large). But they were not recognized as source by zoekt.CheckText, so they didn't contribute to the calculated buffer size. This is probably not super common, but I guess it can happen with test repos with a lot of auto-generated content.

That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

This is a good point! I'll merge this to now fix the issues on S2, but then follow up with a refactor + more complete fix.

@jtibshirani jtibshirani merged commit 2355607 into main Nov 10, 2023
8 checks passed
@jtibshirani jtibshirani deleted the jtibs/drop-content branch November 10, 2023 16:18
jtibshirani added a commit that referenced this pull request Nov 16, 2023
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants