Indexing: respect indexing buffer limit (#686)

When indexing documents, we buffer up documents until we reach the shard size limit (100MB), then flush the shard. If we decide to skip a document because it's a binary file, then (naturally) we don't count its content size towards the shard limit. But we still buffered the full document. So if there are a large number of binary files, we could easily blow past the 100MB limit and run into memory issues. This change simply clears `Content` whenever `SkipReason` is set. The invariant: a buffered document should only ever have `SkipReason` or `Content`, not both.
sourcegraph · Nov 10, 2023 · 2355607 · 2355607
1 parent db067d1
commit 2355607
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 0 deletions.
diff --git a/build/builder.go b/build/builder.go
@@ -642,6 +642,9 @@ func (b *Builder) Add(doc zoekt.Document) error {
 		b.size += len(doc.Name) + len(doc.Content)
 	} else {
 		b.size += len(doc.Name) + len(doc.SkipReason)
+		// Drop the content if we are skipping the document. Skipped content is not counted towards the
+		// shard size limit, so otherwise we might buffer too much data in memory before flushing.
+		doc.Content = nil
 	}
 
 	if b.size > b.opts.ShardMax {

diff --git a/build/builder_test.go b/build/builder_test.go
@@ -244,6 +244,9 @@ func TestDontCountContentOfSkippedFiles(t *testing.T) {
 	if len(b.todo) != 1 || b.todo[0].SkipReason == "" {
 		t.Fatalf("document should have been skipped")
 	}
+	if b.todo[0].Content != nil {
+		t.Fatalf("document content should be empty")
+	}
 	if b.size >= 100 {
 		t.Fatalf("content of skipped documents should not count towards shard size thresold")
 	}