Skip to content

Commit

Permalink
Indexing: respect indexing buffer limit (#686)
Browse files Browse the repository at this point in the history
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
  • Loading branch information
jtibshirani authored Nov 10, 2023
1 parent db067d1 commit 2355607
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 0 deletions.
3 changes: 3 additions & 0 deletions build/builder.go
Original file line number Diff line number Diff line change
Expand Up @@ -642,6 +642,9 @@ func (b *Builder) Add(doc zoekt.Document) error {
b.size += len(doc.Name) + len(doc.Content)
} else {
b.size += len(doc.Name) + len(doc.SkipReason)
// Drop the content if we are skipping the document. Skipped content is not counted towards the
// shard size limit, so otherwise we might buffer too much data in memory before flushing.
doc.Content = nil
}

if b.size > b.opts.ShardMax {
Expand Down
3 changes: 3 additions & 0 deletions build/builder_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,9 @@ func TestDontCountContentOfSkippedFiles(t *testing.T) {
if len(b.todo) != 1 || b.todo[0].SkipReason == "" {
t.Fatalf("document should have been skipped")
}
if b.todo[0].Content != nil {
t.Fatalf("document content should be empty")
}
if b.size >= 100 {
t.Fatalf("content of skipped documents should not count towards shard size thresold")
}
Expand Down

0 comments on commit 2355607

Please sign in to comment.