-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: respect indexing buffer limit #686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards. Or we just build up a very large number of files that are excluded (eg a third_party or vendor dir). IE this is still a great change.
Maybe given we sometimes set SkipReason before looking at a document we should update Builder.Add to skip most of the work if SkipReason is already set?
Lines 557 to 572 in db067d1
if blob.Size > int64(opts.BuildOptions.SizeMax) && !opts.BuildOptions.IgnoreSizeMax(keyFullPath) { | |
if err := builder.Add(zoekt.Document{ | |
SkipReason: fmt.Sprintf("file size %d exceeds maximum size %d", blob.Size, opts.BuildOptions.SizeMax), | |
Name: keyFullPath, | |
Branches: brs, | |
SubRepositoryPath: key.SubRepoPath, | |
}); err != nil { | |
return err | |
} | |
continue | |
} | |
contents, err := blobContents(blob) | |
if err != nil { | |
return err | |
} |
I just had a realisation and I think this PR will fix it. In our ctagsAddSymbolsParserMap code we don't do anything like skip files with a non-empty skip reason. That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty? |
Indeed, what happened with this perforce depot is that all the files were relatively small (so we didn't skip them upfront because they are too large). But they were not recognized as source by
This is a good point! I'll merge this to now fix the issues on S2, but then follow up with a refactor + more complete fix. |
When indexing documents, we buffer up documents until we reach the shard size limit (100MB), then flush the shard. If we decide to skip a document because it's a binary file, then (naturally) we don't count its content size towards the shard limit. But we still buffered the full document. So if there are a large number of binary files, we could easily blow past the 100MB limit and run into memory issues. This change simply clears `Content` whenever `SkipReason` is set. The invariant: a buffered document should only ever have `SkipReason` or `Content`, not both.
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.
This change simply clears
Content
wheneverSkipReason
is set. Theinvariant: a buffered document should only ever have
SkipReason
orContent
,not both.
Update: this also fixes a bug where we still ran ctags even if we identified a
file was binary and should be skipped. Now, we avoid running ctags in these
cases.