Indexing: respect indexing buffer limit #686

jtibshirani · 2023-11-10T01:22:02Z

When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears Content whenever SkipReason is set. The
invariant: a buffered document should only ever have SkipReason or Content,
not both.

Update: this also fixes a bug where we still ran ctags even if we identified a
file was binary and should be skipped. Now, we avoid running ctags in these
cases.

jtibshirani · 2023-11-10T01:27:12Z

I started digging into this after I noticed indexserver memory spikes on S2:

I then correlated these with a large perforce repo that is consistently failing to index:

19:23:40.318963	 .  4440	... error: command [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map c_sharp:scip,go:scip,python:scip,scala:scip,typescript:scip,kotlin:scip,ruby:scip,javascript:scip,rust:scip,zig:scip -file_limit 1048576 -parallelism 8 -index /data/index -require_ctags -large_file **/fixtures.json /data/index/.indexserver.tmp/perforce-sgdev-org%2Fdevx-80k-files.git] failed: signal: killed OUT: 2023/11/09 19:22:30 attempting to index 86024 total files
19:23:40.318966	 .     3	... state: fail

The repo contents are auto-generated and contain a large number of non-source files. With this fix, Zoekt no longer chokes on indexing the repo.

keegancsmith

Nice find. What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards. Or we just build up a very large number of files that are excluded (eg a third_party or vendor dir). IE this is still a great change.

Maybe given we sometimes set SkipReason before looking at a document we should update Builder.Add to skip most of the work if SkipReason is already set?

zoekt/gitindex/index.go

Lines 557 to 572 in db067d1

    
           if blob.Size > int64(opts.BuildOptions.SizeMax) && !opts.BuildOptions.IgnoreSizeMax(keyFullPath) { 
        
           	if err := builder.Add(zoekt.Document{ 
        
           		SkipReason:        fmt.Sprintf("file size %d exceeds maximum size %d", blob.Size, opts.BuildOptions.SizeMax), 
        
           		Name:              keyFullPath, 
        
           		Branches:          brs, 
        
           		SubRepositoryPath: key.SubRepoPath, 
        
           	}); err != nil { 
        
           		return err 
        
           	} 
        
           	continue 
        
           } 
        
           contents, err := blobContents(blob) 
        
           if err != nil { 
        
           	return err 
        
           }

keegancsmith · 2023-11-10T15:15:05Z

I just had a realisation and I think this PR will fix it. In our ctagsAddSymbolsParserMap code we don't do anything like skip files with a non-empty skip reason. That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

jtibshirani · 2023-11-10T16:13:07Z

What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards.

Indeed, what happened with this perforce depot is that all the files were relatively small (so we didn't skip them upfront because they are too large). But they were not recognized as source by zoekt.CheckText, so they didn't contribute to the calculated buffer size. This is probably not super common, but I guess it can happen with test repos with a lot of auto-generated content.

That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

This is a good point! I'll merge this to now fix the issues on S2, but then follow up with a refactor + more complete fix.

When indexing documents, we buffer up documents until we reach the shard size limit (100MB), then flush the shard. If we decide to skip a document because it's a binary file, then (naturally) we don't count its content size towards the shard limit. But we still buffered the full document. So if there are a large number of binary files, we could easily blow past the 100MB limit and run into memory issues. This change simply clears `Content` whenever `SkipReason` is set. The invariant: a buffered document should only ever have `SkipReason` or `Content`, not both.

Indexing: respect indexing buffer limit

6aa4bdc

jtibshirani marked this pull request as ready for review November 10, 2023 01:27

jtibshirani requested review from keegancsmith and stefanhengl November 10, 2023 01:28

keegancsmith approved these changes Nov 10, 2023

View reviewed changes

stefanhengl approved these changes Nov 10, 2023

View reviewed changes

jtibshirani merged commit 2355607 into main Nov 10, 2023
8 checks passed

jtibshirani deleted the jtibs/drop-content branch November 10, 2023 16:18

This was referenced Nov 10, 2023

☂️ Search: improve Zoekt indexing sourcegraph/sourcegraph-public-snapshot#58133

Closed

Indexing: improve skipped doc handling #687

Merged

jtibshirani mentioned this pull request Nov 20, 2023

zoekt: only one ctags process per build sourcegraph/sourcegraph-public-snapshot#58112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing: respect indexing buffer limit #686

Indexing: respect indexing buffer limit #686

jtibshirani commented Nov 10, 2023 •

edited

Loading

jtibshirani commented Nov 10, 2023

keegancsmith left a comment

keegancsmith commented Nov 10, 2023

jtibshirani commented Nov 10, 2023 •

edited

Loading

	if blob.Size > int64(opts.BuildOptions.SizeMax) && !opts.BuildOptions.IgnoreSizeMax(keyFullPath) {
	if err := builder.Add(zoekt.Document{
	SkipReason: fmt.Sprintf("file size %d exceeds maximum size %d", blob.Size, opts.BuildOptions.SizeMax),
	Name: keyFullPath,
	Branches: brs,
	SubRepositoryPath: key.SubRepoPath,
	}); err != nil {
	return err
	}
	continue
	}

	contents, err := blobContents(blob)
	if err != nil {
	return err
	}

Indexing: respect indexing buffer limit #686

Indexing: respect indexing buffer limit #686

Conversation

jtibshirani commented Nov 10, 2023 • edited Loading

jtibshirani commented Nov 10, 2023

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith commented Nov 10, 2023

jtibshirani commented Nov 10, 2023 • edited Loading

jtibshirani commented Nov 10, 2023 •

edited

Loading

jtibshirani commented Nov 10, 2023 •

edited

Loading